Advances in Bioinformatics and Big Data Analytics

The book will play a vital role in improvising knowledge on the practical application of information science in the biol

680 86 53MB

English Pages 404 [406] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
Preface
Overview
Objective
Organization
Audiences for the Publication
Abbreviations
Part I: Application and Analysis of Omics Data
Chapter 1
The Value of Next-Generation Sequencing and Multi-Omics Data for Clinical Diagnosis: Future Perspectives on Breast Cancer
Abstract
Introduction
NGS Types, Workflow, and Its Potency in Cancer Research and Clinical Routine
Driver and Passenger Mutations in Cancer, and Methods for Differentiating In-between
Strategies Relying on Differences in Biological Feature
Strategies Relying on Machine-Learning Methods
Methods Relating to the Effect of Mutations on Their Functional Properties
Pathway-Based and Network-Based Analysis
Next-Generation Sequencing in the Era of Breast Cancer Researches
Multi-omics Data in Cancer Research
Omics Data Repositories Databases
Catalogue of Somatic Mutations in Cancer (COSMIC)
Proteomics Identification Database (PRIDE)
Genomic Expression Omnibus (GEO)
Conclusion
References
Chapter 2
Next-Generation Sequencing and Omics Data Analysis Techniques
Abstract
Introduction
Recent Advances in the Application of Next-Generation Sequencing and Omics Data Analysis Techniques
Conclusion
References
Chapter 3
In silico Approaches to Vaccine Design
Abstract
Introduction
In silico Approaches for Vaccine Designing
Numerous In silico Approaches for Vaccine Designing
Conclusion
References
Chapter 4
Evolution of Genomic Medicine
Abstract
Introduction
Methodological Advancement
DNA, the Genetic Material
High Throughput Sequencing Technology
Pre-Genomic Era
First Generation Sequencing
Second Generation Sequencing
Third Generation Sequencing
Computational Tools Development
Whole Genome Analysis Tools
Disease and Drug Target Tools
Evolutionary Studies
Structural Modelling
Advent of Human Genome Projects
A Brief History
Significant Outcomes
ENCODE
GENCODE
Development of Genomic Database
Primary Databases
Secondary Databases
Composite Databases
Genome Annotation and Comparison
Paradigm Shift: Traditional to Genome Medicine
Need of Genome Medicine: Broader Vision
Genetic Diversity
Shortcomings of Traditional Medicine
GWAS and QTL Mapping
Implementation of Genome Medicine
Success, Challenges and Opportunities
Cancer
Sickle Cell Anaemia
Lactose Intolerance
Genomic Medicine and Its Financial Impact
Ethical and Legal Issues
Genome Medicine: Hope or Hype
Conclusion
References
Chapter 5
Bio-Inspired Computing
Abstract
Introduction
P-System
Informational Description
Components of the P-System
The Environment
Membranes
Symbols
Catalysts
Rules
Computation Process
Rule Application
Non-Deterministic Application
Maximally Parallel Application
As a Computation Model
Computation
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Computation Halts
Bio-Inspired Swarm Optimization Algorithms
Genetic Bee Colony (GBC) Algorithm
Setting ABC Parameter
Initialization of the Population of Solutions
Evaluation of the Population Solutions
Employer Bee
Onlooker Bee
Scout Bee
Genetic Operators
Cat Swarm Optimization (CSO)
Seeking Mode
The Tracing Mode
Artificial Algae Algorithm (AAA)
Helical Movement Phase
Evolutionary Process Phase
Adaptation Phase
Elephant Search Algorithm (ESA)
Chicken Swarm Optimization (CSO)
Behavioral Understanding
Mathematical Understanding
Grey Wolf Optimization (GWO) Algorithm
Mathematical Understanding
Moth–Flame Optimization (MFO) Algorithm
Algorithm
Generating the Initial Population of Moths
Updating the Positions of Moths
Updating the Number of Flames
Different Variants of MFO
Multi-objective
Binary
Hybridization
Applications
Whale Optimization Algorithm (WOA)
Mathematical Model
Bubble-net Attacking Method (Exploitation Phase)
Shrinking Encircling Mechanism
Spiral Updating Position
Search for Prey (Exploration Phase)
Fish Swarm Optimization Algorithm (FSOA)
Concept and Algorithm
Individual Movement Operator
Food Operator
Instinctive Collective Movement Operator
Non-Instinctive Collective Movement Operator
Artificial Neural Network
Artificial Bee Colony Algorithm
Cuckoo Optimization Algorithm (COA)
Bacterial Foraging Optimization Algorithm (BFOA)
Flower Pollination Algorithm (FPA)
Neuromorphic Engineering
Neurological Inspiration
Neuromorphic Prototypes
Neuromorphic Sensors
Conclusion
References
Chapter 6
Feature Selection and Classification of Microarray Cancer Dataset: Review and Challenges
Abstract
Introduction
Microarray Technology
Feature Selection
Methods
Filter
Wrapper
Embedded
Hybrid
Classification
Logistic Regression
Naïve Bayes
K-Nearest Neighbor (KNN)
Support Vector Machine
Random Forest
Decision Trees
Dataset
Related Work
Performance Evaluation Measures
Result and Analysis
Based on the Following
Feature Selection
Dataset
Classifier
Conclusion
References
Part II: Application of Bioinformatics Tools and Databases
Chapter 7
Machine Learning Methods in Bioinformatics
Abstract
Introduction
Recent Trends in the Application of Machine Learning in Bioinformatics Techniques
Conclusion
References
Chapter 8
Molecular Biomarkers as Health and Disease Predictors
Abstract
Introduction
Specific Authors That Have Worked on Molecular Biomarkers as Health and Disease Predictors
Conclusion
References
Chapter 9
Systems Biology Applications and Bioinformatics
Abstract
Introduction
System Biology
Medicine
Agriculture
Bioremediation
Current Techniques Involved Systems Biology Application and Bioinformatics
Conclusion
References
Chapter 10
Genome Data Resources and Tools for Sequence Analysis
Abstract
Introduction to Bioinformatics
Role of Genomics
Tools for Genomics Research
FastQC
GeneWise
NCBI Prokaryotic Genomes Automatic Annotation Pipeline
GenSAS
Ori-Finder
P2RP
KAAS
Simple Synteny
MEGA11
DNA Plotter
SNP
SNP2CAPS
TASSEL
STRUCTURE
ClustalW
Bioinformatics Databases
GenBank
Phytozome
EMBL
Swiss-Prot
UniProtKB
Gramene
GrainGenes
MaizeGDB
Multiple Databases and Tools as Sources
NCBI
KEGG
Conclusion
References
Chapter 11
Bioinformatics Tools for Biomarker Discovery
Abstract
Introduction
Typical Examples of Bioinformatics Tools That Could Be Applied in the Discovery of Biomarkers
Conclusion
References
Chapter 12
A Review on Recent Advances in Different Modelling Techniques, Algorithms, and Software for Metabolic Pathways Analysis in System Biology
Abstract
Introduction
Molecular Modeling in System Biology
Concerns about Modeling and Simulation
Classification of Models
Structured Dynamical Systems
Pathway Analysis in Biological Systems
Interactions Flow through Biological Pathway
Feedback Control
Pathway Activation Measurement
Future Aspects and Challenges
Conclusion
References
Chapter 13
Bioinformatics Tools and Databases for Genomics Research
Abstract
Introduction
Relevance of Bioinformatics Tools and Databases for Genomics Research
Databases for Genomics Research
Conclusion
References
Chapter 14
Computer Viruses and Their Defences in Computer Networks: An e-Epidemiological Model
Abstract
Introduction
Simple SIVR Model
SIVR Fuzzy Model
Solution of Equilibrium Points
Basic Reproduction Number R0
Comparison R0 Versus
Local Stability for Virus Free Equilibrium
Global Stability Analysis
Control Strategies for Virus
Result and Discussion
Example 1
Conclusion
References
Glossary
Index
About the Editors
About the Contributors
Blank Page
Recommend Papers

Advances in Bioinformatics and Big Data Analytics

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Computer Science, Technology and Applications

No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in rendering legal, medical or any other professional services.

Computer Science, Technology and Applications Advances in Bioinformatics and Big Data Analytics Sujata Dash, PhD, Hrudayanath Thatoi, PhD, Subhendu Kumar Pani, PhD and Seyedamin Pouriyeh, PhD (Editors) 2023. ISBN: 979-8-88697-693-9 (Hardcover) 2023. ISBN: 979-8-88697-850-6 (eBook) Situational Modeling: Definitions, Awareness, Simulation Alexander Fridman, PhD (Editor) 2023. ISBN: 979-8-88697-590-1 (Hardcover) 2023. ISBN: 979-8-88697-725-7 (eBook) Applications of Artificial Intelligence in the Healthcare Sector Jyoti Prakash Patra, PhD and Yogesh Kumar Rathore (Editors) 2023. ISBN: 979-8-88697-502-4 (Hardcover) 2023. ISBN: 979-8-88697-541-3 (eBook) Speech Recognition Technology and Applications Vasile-Florian Păiș (Editor) 2022. ISBN: 978-1-68507-929-1 (Hardcover) 2022. ISBN: 979-8-88697-179-8 (eBook) Internet of Everything: Smart Sensing Technologies T. Kavitha, PhD, V. Ajantha Devi, PhD, S. Neelavathy Pari, PhD and Sakkaravarthi Ramanathan, PhD (Editors) 2022. ISBN: 978-1-68507-865-2 (Hardcover) 2022. ISBN: 978-1-68507-943-7 (eBook)

More information about this series can be found at https://novapublishers.com/product-category/series/computer-sciencetechnology-and-applications/

Sujata Dash, PhD Hrudayanath Thatoi, PhD Subhendu Kumar Pani, PhD and Seyedamin Pouriyeh, PhD Editors

Advances in Bioinformatics and Big Data Analytics

Copyright © 2023 by Nova Science Publishers, Inc. DOI: https://doi.org/10.52305/SFPW4540 All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying, recording or otherwise without the written permission of the Publisher. We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to reuse content from this publication. Please visit copyright.com and search by Title, ISBN, or ISSN. For further questions about using the service on copyright.com, please contact:

Phone: +1-(978) 750-8400

Copyright Clearance Center Fax: +1-(978) 750-4470

E-mail: [email protected]

NOTICE TO THE READER The Publisher has taken reasonable care in the preparation of this book but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained in this book. The Publisher shall not be liable for any special, consequential, or exemplary damages resulting, in whole or in part, from the readers’ use of, or reliance upon, this material. Any parts of this book based on government reports are so indicated and copyright is claimed for those parts to the extent applicable to compilations of such works. Independent verification should be sought for any data, advice or recommendations contained in this book. In addition, no responsibility is assumed by the Publisher for any injury and/or damage to persons or property arising from any methods, products, instructions, ideas or otherwise contained in this publication. This publication is designed to provide accurate and authoritative information with regards to the subject matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in rendering legal or any other professional services. If legal or any other expert assistance is required, the services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS.

Library of Congress Cataloging-in-Publication Data ISBN:  H%RRN

Published by Nova Science Publishers, Inc. † New York

Contents

Preface

........................................................................................... ix

Abbreviations ...........................................................................................xv Part I

Application and Analysis of Omics Data .........................1

Chapter 1

The Value of Next-Generation Sequencing and Multi-Omics Data for Clinical Diagnosis: Future Perspectives on Breast Cancer .............................3 Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Chapter 2

Next-Generation Sequencing and Omics Data Analysis Techniques ............................41 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

Chapter 3

In silico Approaches to Vaccine Design .........................57 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Simeon Kayowa Olatunde, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

vi

Contents

Chapter 4

Evolution of Genomic Medicine .....................................81 Sujata Mohanty and Kopal Singhal

Chapter 5

Bio-Inspired Computing ...............................................125 Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Chapter 6

Feature Selection and Classification of Microarray Cancer Dataset: Review and Challenges..................................................181 Santosini Bhutia and Bichitrananda Patra

Part II

Application of Bioinformatics Tools and Databases ......................................................207

Chapter 7

Machine Learning Methods in Bioinformatics ...........209 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

Chapter 8

Molecular Biomarkers as Health and Disease Predictors ..................................................223 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

Chapter 9

Systems Biology Applications and Bioinformatics ........................................................235 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

Contents

vii

Chapter 10

Genome Data Resources and Tools for Sequence Analysis....................................................251 Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh and Hrudayanath Thatoi

Chapter 11

Bioinformatics Tools for Biomarker Discovery ..........279 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

Chapter 12

A Review on Recent Advances in Different Modelling Techniques, Algorithms, and Software for Metabolic Pathways Analysis in System Biology ...........................................297 Manish Paul, Saikat Chakrabarti and Amrita Banerjee

Chapter 13

Bioinformatics Tools and Databases for Genomics Research .................................................319 Charles Oluwaseun Adetunji, Frank Abimbola Ogundolie, Olugbemi Tope Olaniyan, Sujata Dash, Omosigho Omoruyi Pius, Kehinde Kazeem Kanmodi and Lawrence Achilles Nnyanzi

Chapter 14

Computer Viruses and Their Defences in Computer Networks: An e-Epidemiological Model.........................................337 Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra, Ranjita Rath and Tarini Charan Panda

Glossary

.........................................................................................357

Index

.........................................................................................363

About the Editors ......................................................................................367 About the Contributors ............................................................................371

Preface

Overview This book broadly covers the topic of Advances in Bioinformatics and Big Data analysis, an emerging field of research and an intersection of biological and computer science. Moreover, Biomedical and Bioinformatics is an emerging field of study at the intersection of information science, computer science, biology, and healthcare. Advances in Bioinformatics and Big Data analytics are a new era that brings tremendous opportunities and challenges due to the readily available plenty of bioinformatics data for further analysis. Bioinformatics and Big Data aim to cover huge, increasing, and compound datasets. Currently, the focus is on next-generation sequencing technologies. There is a considerable expansion of big biological data, which shows storage and processing challenges. Predicting data analytics to gather the wealth of data from biomedical and natural sources, such as genetic mapping of DNA sequence, helps to understand the human condition, health, and disease; which leads to curing diseases and improving human health and lives by sustaining the development of precision methods for healthcare. Various methods and software have been developed in this field for storing, organizing, understanding, and interpreting the exponential amount of biological data that ultimately aims to solve medical and biology problems. Keywords: Big Data and Nextgen sequencing; microarray-based gene expression analysis; bioimage analysis; personalized medicine; structural bioinformatics, vaccination, fuzzy primary reproduction number, stability, virus

x

Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al.

Objective The book will play a vital role in improvising knowledge on the practical application of information science in the biological field to a great extent. All the researchers and practitioners will benefit from those working in Big Data, IoT, Computational Intelligence, biomedical, and bioinformatics. This book would be a good collection of state-of-the-art approaches for data mining based on bioinformatics and health-related applications. It will be very beneficial for the new researchers and practitioners working in the field to follow the best-performing methods quickly. They would be able to compare different approaches and carry forward their research in the most critical area of research that directly impacts the betterment of human life and health. This book would also be instrumental because no text in the market provides a good collection of state-of-the-art methods of big data-driven bioinformatics. While emerging technology has made data entry much more accessible, it will also discuss challenges for researchers and medical professionals. Data sets have grown so huge that extracting and analyzing data with traditional methods has become challenging. However, these data sets also give an exciting opportunity to understand large-scale patterns and make predictions about health care. Intelligent technologies have employed the knowledge and implementation of big data techniques globally. This book also proposes Computational Approaches for Vaccine Designing, which will help us better understand ourselves and our environment. This book references new information processing technology that combines ideas from biology, chemistry, and medical science to manage electronic medical records productively. Specific topics covered include database management, genomics, proteomics, and scalability. This book aims to report the latest advances and developments in the field of bioinformatics and NextGen sequencing, health informatics, data mining, machine learning, and computational intelligence.

Organization The book, “Advances in Bioinformatics and Big Data Analytics” consists of 14 edited chapters, and the entire contents of the book are organized into the following two sections:

Preface

xi

Part I: Application and Analysis of Omics Data Part II: Application of Bioinformatics Tools and Databases A brief summarization of the chapters is provided below: Part I: This section has focused on the Application and Analysis of Omics Data, specifically in Bioinformatics. There are six chapters in this section. The first chapter, titled “The Value of Next-Generation Sequencing and MultiOmics Data for Clinical Diagnosis: Future Perspective in Breast Cancer,” represents insight into Next-Generation Sequencing and Multi-Omics Data. The authors have explained that the whole-exome sequencing approach can be used to find sequence variation within the exon-sequences of the human genome in a more targeted manner. DNA-protein interactions, chromatin accessibility, and DNA methylation are investigated using epigenomics approaches. The identification and quantification of RNA molecules are made using transcriptomics techniques. Currently, next-generation sequencing is the backbone of targeted therapy. This technology has been highly cost-effective and allows researchers to detect mutations that could not previously be seen using the whole genome, exome, or transcriptome in several types of cancer. Chapter 2, titled “Next Generation Sequencing and Omics Data Analysis Techniques,” gives insights into the progress in next-generation sequencing technologies, which is revolutionizing genomics, epigenomes, and transcriptomics research such as genomes - single nucleotide polymorphisms, loss of heterozygosity variants, copy number variants, genomic rearrangements, rare variants), epigenomes - DNA methylation, chromatin accessibility, histone modifications, binding, transcription factor) and transcriptomes - alternative splicing, gene expression, small RNAs, and long non-coding RNAs. Chapter 3, titled “In silico Approaches to Vaccine Design,” intends to provide detailed information on some techniques that could be applied during in silico approaches for vaccine designing. In Chapter 4, titled “Evolution in Genomic Medicines,” the authors have tried to identify the significant technological advancement in genomics and bioinformatics toward the development of Genomic Medicine and also discussed the potential challenges and opportunities of this emerging field. The rapidly increasing technological world is grabbing ideas through different natural algorithms, as discussed in Chapter 5, titled “Bio-Inspired Computing.” Mixing computing with natural and biological phenomena will club the understanding of our generation with nature and its events. The world is full of biology; once it combines with technology, it gives the perfect gift

xii

Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al.

for the computing future. Whether switching from Artificial Intelligence (AI), Genetic Algorithms, Immune systems, or human life, this broad scope of concepts will evolve people’s lifestyles. In this chapter, the authors will present the necessity of sandwiching computing around nature and its act of dealing. The unusual growth of the cells in the body causes the disease of cancer, a deadly disease. To diagnose such critical diseases, microarray technology has become a trend. It is necessary to find a fast and accurate method for cancer diagnosis and drug discovery that helps eradicate the disease from the body. Chapter 6, titled “Feature Selection and Classification of Microarray Cancer Dataset: Review and Challenges,” gives a comprehensive study of microarray gene expression data with feature selection and classification algorithms and, finally, future challenges of the problem. Part II: This section imparts insights into the Application of Bioinformatics Tools and Databases using seven contributed chapters submitted by researchers in this field. Chapter 7, titled “Machine Learning Methods in Bioinformatics,” explains that machine learning, a form of artificial intelligence, is a computing program that enables the computer to use learning algorithms such as supervised learning, semi-supervised learning, unsupervised learning, optimization and reinforcement learning, artificial neural networks, best first tree among others to self-study data and improve them. This learning mechanism has found use in many sectors, including virtual personal assistants, speech recognition, self-driving cars/trains, biomedicine, and genomic research. In genomic research, machine learning has been effective in analyzing and interpreting data sets with its improvement applied in biomedicine for effective drug prediction and discovery, disease prediction and diagnosis, drug repositioning, and cancer research, among others. Chapter 8 provides insight into clinical medicine, the use of biomarkers, and the advancement in many areas, such as diagnosis, genome study, molecular biology, and treatment of diverse infections. Biomarkers are generated in pathogenic processes, normal physiological activity, therapeutic intervention, and pharmacological reaction. Therefore, this chapter intends to provide relevant and detailed information on Molecular Biomarkers as health and disease predictors. Chapter 9, titled “Systems Biology Applications and Bioinformatics,” explains that system biology is the converging point where computer science, engineering, and biology meet. It involves the computational understanding of the ongoing interactions within the complex cells and the immediate surroundings. The application of this type of system biology with the aid of computational tools has tremendously transformed

Preface

xiii

various fields, such as in the improvement of agricultural outputs, a better understanding of pharmacology, and improvement in medicine; this has led to the emergence of timely and less tedious early diagnosis of various diseases, target medicine, enabling natural immunity, drug design, development and in the removal of contaminants or toxins from the environment through bioremediation. The authors of Chapter 10, titled “Genomic Data Resources and Tools for Sequence Analysis,” opines that the study of an organism's entire genetic makeup is genomics. The introduction of high-throughput sequencing techniques in this genomic era has resulted in a massive amount of genomic data being generated regularly. The effective use of tools and databases aids in analyzing genetic data and diversity. Genetic Diversity, Next-Generation Sequencing (NGS) Genetic Markers, Gene Mapping, Genotyping, Genome-Wide Interaction, Genomic Selection (GS), Multiparent Advanced Generation Inter-Cross (MAGIC), and other techniques are grouped into different categories. In genomics research, these databases and tools are commonly utilized. Chapter 11, titled “Bioinformatics Tools for Biomarker Discovery,” explains that biological markers are usually used to analyze and measure human health conditions. These markers are often used along with bio-computational tools in predicting the normal or abnormal state of human health and the effective management of diseases. Biomarkers play an important role in medical intervention. Thus, the use of proteomics techniques, bioinformatics, and machine learning to characterise and identify important proteins in health and diseases have been explored for biomarker discovery and treatment interventions. The discovery of these biologicalbased markers has given the ease of disease diagnosis and ensures early detection of diseases. The discovery of biomarkers entails the discovery of biomarkers before certification requires biomarker validation and clinical validation. Chapter 12, titled “Modeling and Simulation of biological processes and Pathway: A Review”, presented an overview of modelling methods and parameter estimation by determining different metabolic pathways of biological systems through simulation software. The chapter also reviews the routine works employed in mathematical biology and bioinformatics to describe genetic regulatory systems. Chapter 13, titled “Bioinformatics Tools and Databases for genomics research,” highlights the use of bioinformatics tools for the analysis and storage of biological data for structural and functional biological processes or properties of proteins that depend on databases and repositories of DNA/RNA or amino acids sequences that code for such protein. This chapter is a mini-review of the application of bioinformatics tools and databases for genomic research in various sectors.

xiv

Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al.

The last but not least chapter, “Computer Viruses and their Defenses in Computer Networks: An e-Epidemiological Model,” highlights an e-SVIRS fuzzy model. They created a fuzzy membership function for the two parameters, transmission and recovery rates, followed by infection. The computer virus load is a function of infection in the network system. Also, we discussed the stability analysis of the model at a virus-free equilibrium point under malware loads. Here both the classical and fuzzy basic reproduction number has been calculated. The fuzzy basic reproduction number helps control the virus in the network.

Audiences for the Publication This book will be of particular interest to Graduates and Postgraduate students, teachers and researchers, scientists, technical resource persons, bioengineers and members of institutes, and industry personnel perusing research in the disciplines of bioinformatics, biomedical and computer science who will find the information in the book valuable. Researchers in various fields such as computer science, medical Informatics, Healthcare IoT, computational intelligence, Machine learning, medical image processing, and clinical big data analytics will find it helpful. The book can be promoted in Colleges, Universities, Research Institutes, and Industries. Besides, various networks like electronic and print media, digital resources, academic libraries, and online trading platforms will be used for its publication. Editors Dr. Sujata Dash Department of Information Technology, Nagaland University, Dimapur, Nagaland, India Prof. Hrudaynath Thatoi Maharaja Srirama Chandra Bhanjadeo University Baripada, Odisha, India Dr. Subhenu Kumar Pani Krupajal Computer Academy, BPUT, Odisha, India Dr. Seyedamin Pouriyeh Information Technology, Kennesaw State University, USA

Abbreviations

2D-DIGE 2DPAGE 3D AAA AAO ABC ABySS ADIT ADMET AFLPs AGA AI AI ANN ANNs AURKB AutoSNPdb BACs Bandage BFOA BFT BLAST BLASTp BLAT BLDC BMMFOA BN

Two-Dimensional Difference Gel Electrophoresis Two-Dimensional Polyacrylamide Gel Electrophoresis Three-Dimensional Artificial Algae Algorithm Artificial Algae Optimization Artificial Bee Colony Assembly By Short Sequencing Auto Dep Input Tool Chemical Absorption, Distribution, Metabolism, Excretion, and Toxicity Amplified Fragment Length Polymorphism Adaptive Genetic Algorithms Artificial Intelligence Artificial Intelligence Artificial Neural Network Artificial Neural Networks Aurora Kinase B Annotated Single Nucleotide Polymorphism database Bacterial Artificial Chromosomes Bioinformatics Application for Navigating De novo Assembly Graphs Easily Bacterial Foraging Optimization Algorithm Best First Tree Basic Local Alignment Search Tool Standard protein-protein BLAST BLAST-like Alignment Tool Brushless Direct Current Binary coded Modified Moth-Flame Optimization Algorithm Bayesian Network

xvi

Abbreviations

BP BS-seq CA CAMPP CAP3 CATH CCLE CD 8+ T cell CDC CFO ChIPseq cHPI CMOS CMOS CN CNV CNVs COA COBALT CODIS COSMIC COVID19 CPTAC CPU CSO CSOA CVDs CXCL8 D dbGaP DBMS dbSNP DDBJ DE DEWE DLGAP5 DNA DNAView DOE

Back Propagation Bisulfite sequencing Cellular Automata anti-Cancer Biomarker Prediction Pipeline Contig Assembly Program Hierarchical Classification of Protein Domain Structures Cancer Cell Line Encyclopedia Cytotoxic CD8+ T cells Counts of Dimensions to Change Central Force Optimization Chromatin Immunoprecipitation sequencing CVD-specific Host-Pathogen Interactions Complementary Metal Oxide Semiconductor Complementary Metal-Oxide Semiconductor Chicks Copy Number Variation Copy Number Variants Cuckoo Optimization Algorithm Constraint-Based Multiple Alignment Tool Combined DNA Index System Catalogue of Somatic Mutations in Cancer Coronavirus Disease 2019 Clinical Proteomic Tumour Analysis Consortium Centre Processing Unit Cat Swarm Optimization Chicken Swarm Optimization Cardiovascular Diseases CXC Motif Chemokine Ligand 8 Dimensional space database of Genotypes and Phenotypes Database Management Systems database of Short Genetic Variation DNA Data Bank of Japan Differential Evolution Differential Expression Workflow Executor Disks Large Associated Protein 5 Deoxyribonucleic Acid Paternity and Kinship Analysis Department of Energy

Abbreviations

DT EBI ELISA ELM ELR EMBL EMT ENCODE EPD ER ESA EST ESTs FBFE FLC FPA FSOA GA GA GAM-NGS GBC GBCO GenBank GeneDB GenSAS GEO GFF GHOST GKN1 GKN2 GLM GMaP GS GSDB GSEA GWAS GWO

xvii

Decision Trees European Bioinformatics Institute Enzyme Linked Immunosorbent Assay Extreme Learning Machine Egg Laying Radius European Molecular Biology Laboratory Epithelial-to-Mesenchymal Transition Encyclopedia of DNA Elements Eukaryotic Promoter Database Oestrogen Receptor Elephant Search Algorithm Expressed Sequence Tag Expressed Sequence Tags Fuzzy Backward Feature Elimination Fuzzy Logic Controller Flower Pollination Algorithm Fish Swarm Optimization Algorithm Genetic Algorithm Genetic Algorithm Genomic Assemblies Merger for Next-Generation Sequencing Genetic Bee Colony Genetic Bee Colony Optimization Genetic Data Bank Gene Database Genome Sequence Annotation Server Genomic Expression Omnibus General Feature Format Global Hepatitis Outbreak and Surveillance Technology Gastrokine 1 Gastrokine 2 General Linear Model Geographic Management of Cancer Health Disparities Program Genomic Selection Genome Sequence Databases Gene Set Enrichment Analysis Genome-Wide Association Studies Grey Wolf Optimization Algorithm

xviii

Abbreviations

HER2/ErbB2 Human EGF-like growth factors Receptor HGP Human Genome Project Hisat Hierarchical Indexing for Spliced Alignment of Transcripts HIV Human Immunodeficiency Virus HJURP Holliday Junction Recognition Protein HPI Host-Pathogen Interactions HRASLS2 HRAS-like suppressor 2 HRASLS2 phospholipase A and acyltransferase 2 HTGs High-Throughput Genomes HTML Hyper Text Markup Language HTS High Through-put Sequencing ICA Independent Component Analysis ICAT Isotope-Coded Affinity Tags ICGC International Cancer Genomics Consortium IG Information Gain InDels Insertion–Deletion mutations INSDC International Nucleotide Sequence Database Collaboration iPANDA In silico Pathway Activation Network Decomposition Analysis iTRAQ isobaric Tags for Relative and Absolute Quantitation KAAS KEGG Automatic Annotation Server KEGG Kyoto Encyclopedia of Genes and Genomes KNN K Nearest Neighbors KO KEGG Orthology LAPO Lighting Attachment Procedure Optimization LDA Linear Discriminant Analysis logLRT Log-Likelihood Ratio MAGIC Multiparent Advanced Generation Inter-Cross MaizeGGD Maize Genetics and Genomics Database MAPLE Metabolic and Physiological Potential Evaluator MAQC Microarray Analysis Quality Control MDKAP Mass Disaster Kinship Analysis Program MEGA Molecular Evolutionary Genetics Analysis METABRIC Molecular Taxonomy of Breast Cancer International Consortium Methyl-seq Methylation sequencing MFISys Mass Fatality Identification System MFO Moth-Flame Optimization algorithm

Abbreviations

MI MIM MIPS miRNAs ML MLM MLP MMFOA MOMFA MOSFET MR MRMR mRNA MudPIT MYL9 MySQL NB NBRF NCAPG NCBI NCI NEAT1 NGS NHGRI NIH NLM ODE OmicsDI OMIM ONT OriFinder P2RP PCA PDB PDB PGAP PGM PHYLIP PIR

Mutual Information Mutual Information Maximization Martinized Institute of Protein microRNAs Machine Learning Mixed Linear Model Multi-Layer Perceptron Modified Moth-Flame Optimization Algorithm Multi-Objective Moth Flame Optimization Algorithm Metal Oxide Semiconductor Field Effect Transistor Mixture Ration Minimal Redundancy Maximum Relevance messenger RNA Multidimensional Protein Identification Technology Myosin Regulatory Light Polypeptide 9 My Structured Query Language Naïve Bayes National Biomedical Research Foundation Condensin complex subunit 3 National Center for Biotechnology Information National Cancer Institute Nuclear Paraspeckle Assembly Transcript 1 Next-Generation Sequencing National Human Genome Research Institute National Institutes of Health National Library of Medicine Ordinary Differential Equation Omics Discovery Index Online Mendelian Inheritance in Man Oxford Nanopore Technology Open Reading Frame Finder Predicted Prokaryotic Regulatory Proteins Principal Components Analysis Protein Databank Protein Data Bank Prokaryotic Genome Annotation Pipeline The Ion Personal Genome Machine PHYLogeny Inference Package Protein Information Resource

xix

xx

Abbreviations

PR PRIDE PS PSO PSO PubMed QMEAN QML QR QSAR QTL QTLs QUAST R&D RADS RCSB RefSeq RELM RF Ribo-seq RNA RNA-seq SARS-CoV2 SB SCGB2A1 SCOP SFRP2 SGA SIB SIR SIVR SL SNP SNV SOM SPC SRA SRA SRD

Progesterone Receptor Proteomics Identification Database Population Size Particle Swarm Optimization Particle Swarm Optimization Public/Publisher MEDLINE Qualitative Model Energy Analysis Qualitative Model Learning Qualitative Reasoning Quantitative Structure Activity Relationship Quantitative Trait Loci Quantitative Trait Locus Quality Assessment Tool for Genome Assemblies Research & Development Research and Discovery Stage Research Collaboratory for Structural Bioinformatics Reference Sequence Regularized ELM Random Forest Ribosome sequencing Ribonucleic Acid RNA sequencing Severe Acute Respiratory Syndrome Coronavirus 2 Systems Biology Secreto Globin Family 2A Member 1 Structural Classification of Proteins Secreted Frizzled-Related Protein 2 Standard Genetic Algorithm Swiss Institute of Bioinformatics Susceptible Infected Vaccinated Recovered Susceptible Infected Vaccinated Recovered Supervised Learning Single Nucleotide Polymorphism Single Nucleotide Variation Self-Organizing Maps Self-Position Consideration Sequence Raw Archive Sequence Read Archive Seeking a Range of selected Dimension

Abbreviations

SRS SSL SSRs STAR SU SVIRS SVM TASSEL TCGA TIGR TPX2 TrEMBL UC UniMES UniRef USL VEGF VLSI WES WGS WHO wHPI WOA XRD ZMW

xxi

Sequence Retrieval System Semi-Supervised Learning Simple Sequence Repeats Spliced Transcripts Alignment to a Reference Symmetric Uncertainty Susceptible Vaccinated Infected Recovered Susceptible Support Vector Machines Trait Analysis by a Sociation, Evolution, and Linkage The Cancer Genome Atlas The Institute for Genome Research Targeting protein for Xklp2 Translation of EMBL Unit Commitment UniProt Metagenomic and Environmental Sequences UniProt Reference Clusters Unsupervised Learning Vascular Endothelial Growth Factor Very-Large Scale Integration Whole-exome sequencing Whole genome sequencing World Health Organization Whole Host-Pathogen Interactions Whale Optimization Algorithm X-Ray Diffraction Zero-Mode Waveguide

Part I: Application and Analysis of Omics Data

Chapter 1

The Value of Next-Generation Sequencing and Multi-Omics Data for Clinical Diagnosis: Future Perspectives on Breast Cancer Amel Elbasyouni1,2,* Leila Saadi2,3 and Abdelkarim Baha4 1Molecular

Biology and Biotechnology Laboratory, Pan African University for Basic Sciences, Technology and Innovation (PAUSTI), Kenya 2Department of Biology, SNV Faculty, Blida 1 University, Algeria 3Animal Ecobiology Laboratory, Higher Normal School, Kouba, Algiers, Algeria 4Anatomy Pathology Service, CHU Beni Messous, Algiers, Algeria

Abstract Omics innovations are high-throughput tests for identifying and quantifying all biomolecules of a specific subset in a biological sample. Next-generation sequencing techniques are among the most widely used, more excellent omics techniques. Typically, genomic studies examine the DNA sequence, including coding and non-coding sequences across the entire genome. The whole-exome sequencing approach can be used to find sequence variation within the exon sequences of the human genome in a more targeted manner. DNA-protein interactions, chromatin accessibility, and DNA methylation are investigated using epigenomics approaches. The identification and quantification of RNA molecules are made using transcriptomics techniques. Currently, next-generation sequencing is the backbone of targeted therapy. This technology has been highly cost-effective and allows researchers to detect mutations that *

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

4

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha could not previously be seen using the whole genome, exome, or transcriptome in several types of cancer. As the most frequent disease among women and one of the deadliest, breast cancer is a serious threat. Breast cancer patients have a wide range of survival times, indicating the need to find predictive indicators for customized clinical diagnosis and treatment. Therefore, we believe this technology must be used in breast cancer diagnosis and adopted into routine clinical practice to plan an effective targeted therapy adequate to each patient. It may help detect resistance pathways, identify biomarkers for prognostics, and evaluate predictors of tumor heterogeneity.

Keywords: next-generation sequencing, multi-omics, breast cancer, clinical diagnosis

Introduction Since the first human genome sequence was completed, demand for less expensive and quicker sequencing technology has surged significantly (Thankachan and Thomas, 2018). This need has fueled the development of second-generation or next-generation sequencing technologies (NGS) (Torri et al., 2012). Over the last decade, many researchers have used next-generation sequencing technologies to decipher the variety of the breast cancer genome, gradually revealing the breast cancer genomic landscape (Hong et al., 2020). In breast cancer research, next-generation sequencing is primarily used in three areas: 1. genome sequencing analysis (Woerner et al., 2021), which includes whole genome, exome, and targeted gene sequencing (BewickeCopley et al., 2019); 2. RNA transcription sequencing, which includes whole transcriptome, small RNA, and noncoding RNA sequencing (Hong et al., 2020), and 3. epigenetic sequencing: chromatin immunoprecipitation and methylation analysis sequencing (Figure 1). NGS platforms perform massively parallel sequencing, involving millions of DNA fragments sequenced simultaneously from only one sample (Thankachan and Thomas, 2018). It enables a high-throughput sequencing, enabling whole genome sequencing in less than a day (Torri et al., 2012).

The Value of Next-Generation Sequencing …

5

There are three primary methods of next-generation DNA sequencing that can detect genetic mutations: whole-genome sequencing, whole-exome sequencing, and targeted sequencing (Behjati and Tarpey, 2013). At this level, NGS may uncover somatic genetic abnormalities in a cancer genome by comparing the tumor sequence with the intact germline DNA (Conway et al., 2019). According to the degree of involvement in the tumor development, there are two types of somatic genetic alterations: driver or passenger (Bozic et al., 2010). While driver mutations are involved in tumor development, the other somatic alterations, dubbed passenger mutations, are not and may result from the tumor’s genomic instability (McFarland et al., 2017). However, the distinction between these two types of mutations is dynamic and may progress throughout the disease (Bozic et al., 2010). For instance, a passenger mutation may evolve into a driver mutation during anti-cancer treatment if the resistant clone gains a clonal advantage. In addition, the type of genetic alteration can also be categorized into several types, including nucleotide substitution, indels (insertions or deletions of small fragments), gains or losses of copies (copy number variation), chromosomal rearrangements of large sequences, either insertions or deletions, duplications, inversions, or translocations (Kumar et al., 2020).

Figure 1. Heterogeneity of omics data in cancer research.

6

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Besides genome sequencing, a cell’s transcriptome can be analyzed using high-throughput sequencing technologies and RNA sequencing (RNA-Seq) (Hong et al., 2020). There is significantly more dynamic coverage and resolution of the transcriptome with the use of RNA-Seq than with Sanger sequencing and microarray-based approaches. Additionally, the generated data may uncover novel transcripts, reveal spliced genes, and detect allelespecific expression (Hong et al., 2020). In addition to polyadenylated messenger RNA (mRNA), RNA-Seq can be used to explore the diversity of RNA populations, including total and pre-mRNA as well as non-coding RNAs, such as micro- and long-range non-coding RNAs (Hong et al., 2020; Li et al., 2021). Moreover, NGS technology was quickly adopted by the epigenetics field (Sarda and Hannenhalli, 2014). Numerous species and cell types have had epigenetic changes studied using this technology (Roberti et al., 2019). Because of 5-hydroxymethyl-cytosine and its function in epigenetic reprogramming and tumor development and progression (Bakhoum et al., 2021), DNA methylation has recently gained much attention (Liu et al., 2021). In breast cancer cells, because of their “molecular signature,” or distinct biological and molecular behavior, the tumors cells have been identified by multi-Omics profiling (Nguyen et al., 2020). When DNA damage remains unchecked, or repair processes fail, it can lead to oncogenic or cancer-causing mutations (Anstine and Keri, 2019). Damage to cells’ DNA can occur from various sources, including oxidative stress. One of DNA’s “signature” features is the variety of DNA damage and repair mechanisms that can occur (Feng et al., 2018). This chapter outlines the significance of multi-Omics data across distinct cellular levels, including genome, epigenome, and transcriptome, which provide exceptional opportunities to significantly unravel the biology of cancers and breast cancer. In this chapter, we review research areas where multi-Omics can help us understand the process of malignant transformation, oncogenesis, metastasis, drug resistance, microenvironment, and epithelial-tomesenchymal transition, especially in breast cancer. Moreover, we aimed to counsel the establishment and adoption of next-generation sequencing into routine clinical practice through identification and annotation of mutations and structural variation assessment.

The Value of Next-Generation Sequencing …

7

NGS Types, Workflow, and Its Potency in Cancer Research and Clinical Routine Although various computational tools devoted to some regions of NGS data processing have been created in recent years (Thankachan and Thomas, 2018), the majority contain project-specific characteristics/features and are challenging in their functioning and parameterization (Torri et al., 2012). Because of this, it is difficult for bench scientists and researchers to grasp the nitty-gritty of this novel subject matter and understand it (Conway et al., 2019). We offer an NGS-based workflow in this chapter section that consists of nine steps for utilizing NGS data with the aim of customized treatment (Figure 2).

Figure 2. Pipeline and tools for genomic NGS workflow.

The NGS workflow conception may also serve as a springboard for collaboration between computational biologists, bioinformaticians, and clinicians (Torri et al., 2012) in order to: • • •

Improve the analytical methods and technologies; Leverage diverse sources of data and implement software and tools, and Serve as an autonomous user, liable for developing a brand new clinical practice.

8

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

First, there are three interconnected aspects to the proposed NGS workflow: template preparation, sequencing, and imaging (Thankachan and Thomas, 2018). Three sequencing methods are sequencing by synthesis, sequencing by the reversible terminator, and sequencing by ligation (Head et al., 2014). The combination of protocols used by each NGS platform to connect these three components affects features of the NGS data on that platform, such as the type, the coverage, and the quality (Thankachan and Thomas, 2018). Utilizing spatially scattered immobilization of templates created in the preliminary phase and recording simultaneous sequencing reactions are the two most powerful strategies for the subsequent steps (Shyr and Liu, 2013). Following the production of NGS reads, the subsequent processes can be categorized into the following four basic groups: •

• • •

chromatin immunoprecipitation (ChIP)-seq: Profiling transcription factors in human cancers is facilitated due to this development (Singh et al., 2019); Ribonucleic acid (RNA)-seq; Bisulfite sequencing, also known as BS sequencing, and Sequencing of the entire genome and the full exome, known as wholegenome/whole-exome sequencing (WGS/WES) (Torri et al., 2012).

A new “Transcriptomics” technology called RNA-sequencing (RNA-seq) can be used to identify the existence and quantity of RNA transcripts across the entire genome using efficient NGS techniques that have recently been developed (Wang et al., 2020). Instead of depending on transcript-specific probes, RNA-seq technology allows for the unbiased discovery of new transcripts, which is impossible with microarray technology. Along with a broad dynamic range, higher specificity/sensitivity, and the ability to detect low abundance transcripts, RNA-seq has several other advantages over microarrays (Z. Wang, Gerstein, and Snyder 2009). Analysis of the mammalian transcriptional landscape using RNA-Seq has revealed far more complicated than previously thought. Non-coding RNAs (ncRNAs) are ubiquitously transcribed from the great majority of non-coding sections of the genome, including inter-genic and intronic sequences (Atkinson, Marguerat, and Bähler 2012), aside from a varied spectrum of protein-coding RNAs and well-established regulatory RNAs like microRNAs (Liao et al., 2019). Newly available, massive amounts of gene expression profiling data have shown

The Value of Next-Generation Sequencing …

9

differences in gene expression across many types of cancer tissues and their standard equivalents, pointing to the possibility of uncovering complex molecular pathways that help explain cancer progression (Hong et al., 2020). In immuno-genomics, RNA sequencing effectively probes a tumor’s transcriptome and microenvironment (Smith et al., 2020). Tumor heterogeneity is an important characteristic. It is challenging to interpret bulk RNA-seq results since stromal and other cell types have infiltrated the tumor (Wang et al., 2020). Deconvoluting the functionally important signals from average signals produced from bulk RNA-seq can be complex due to the quantitative character of gene expression data (MilanezAlmeida et al., 2020). Intra-tumor transcriptome heterogeneity, critical for therapeutic response, may be analyzed with the help of the scRNA-seq method (Milanez-Almeida et al., 2020). An early investigation on drug resistance was conducted in a model of drug tolerance using a metastatic breast cancer cell line. This study serves as an excellent example of the sort of research that is being conducted by Lee et al., (2014). Cells resistant to drug effects, also known as drug-tolerant cells, have distinct microtubule assembly, stability, cell adhesion, and molecular pathway signaling gene expressions (Lee et al., 2014). Untreated or stressed cells lacked these drug-tolerant RNA variations. Increasing heterogeneity and ensuring the survival of a minority population by creating specific RNA variants enhances heterogeneity (Lee et al., 2014; Ren et al., 2021). Insightful clues for tumor treatment can also be gleaned through single-cell analysis. For this reason, a given targeted therapy often destroys only part of the tumor while leaving others unscathed—intra-tumor heterogeneity (Hong et al., 2020). Therapeutic techniques targeting numerous tumor subpopulations are crucial for overcoming this obstacle. Researchers have employed scRNAseq technology in metastatic renal cell carcinoma by examining numerous drug target pathways in various cell types compared to monotherapies (Kim et al., 2016). Advanced metastatic cancer provides the most outstanding clinical hurdles and can have genetic and cellular characteristics distinct from cancer’s early stages. Herein, the molecular and cellular reprogramming of metastatic lung cancer is demonstrated by single-cell RNA sequencing (Kim et al., 2020). Furthermore, single-cell RNA sequencing in mice breast cancer models revealed unique cell state heterogeneity patterns (Yeo et al., 2020). An RNAsequencing study was carried out on a group of 275 women who had been diagnosed with aggressive breast cancer. Multivariate prediction models were constructed to classify tumors into high and low transcriptomic grades based

10

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

on RNA-sequencing gene- and isoform-level expression data (Wang et al., 2016). It is possible to examine and contrast different types of malignancies in genomic and transcriptome landscapes using vast data archives such as TCGA, which house the omics data (Chen et al., 2020). It was found that upand down-regulated genes showed remarkable consistency across cancer types in thymic epithelial tumors (Radovich et al., 2018), Cholangiocarcinoma (Chen et al., 2020), prostate cancer (Yuan et al., 2020), pediatric leukemias, solid tumors (Ma et al., 2018) and Thyroid cancer (Yoo et al., 2019), Breast cancer (Sahu et al., 2018) and other cancers (Dash et al., 2016a; Dash et al., 2016b; Dash et al., 2020; Dash et al., 2019). The discovery of disease-associated epigenetic markers has relied heavily on the results of epigenomics research (Costa, 2010). In breast cancer, epigenetic modifications landscape serves as biomarkers for tumor development, progression, metastasis, therapy response, and recurrence (Lo and Sukumar, 2008). There are various unique technologies for identifying epigenetic information like DNA methylation and histone modification, including next-generation sequencing and microfluidics. However, single-cell epigenome analysis has made less progress than single-cell genome and single-cell transcriptomics (Davalos et al., 2017). As a result, the robust analysis capabilities of NGS should be used to examine breast cancer metastatic epigenetics at the single-cell level to better understand the molecular mechanisms behind metastasis, enabling the development of new metastatic treatment techniques and diagnostics to be accelerated. ChIP-seq tests and WGS accompanies with BS-seq can be used to analyze methylation (Davalos et al., 2017). Researchers have developed ChIP-Seq, a potent approach for discovering DNA-binding sites for transcription factors and histone proteins, to produce high-resolution histone modification maps in the whole genome (Adriaens et al., 2016). Following a simple technique, ChIP-seq collects DNA-binding proteins that are extracted, purified, and sequenced. The use of ChIP-seq technology has allowed us to obtain profound insights in gene regulation molecular mechanisms, pathways and events responsible for many diseases and biological disorders, such as tumorigenesis and the hallmarks of cancer (Young et al., 2011). It was feasible to comprehend how epigenetic deregulation is involved in specific malignancies, including breast (Grosselin et al., 2019) and lung cancer (Atsalaki et al., 2020), by comparing the cancer and normal tissues’ genome-wide profiles of histone modification marks heterogeneity (Grosselin et al., 2019).

The Value of Next-Generation Sequencing …

11

Additionally, chemical alterations to particular DNA bases, such as histones, can have profound epigenetic consequences. For example, modifying the promoter DNA sequence by modifying the Cytosine residue can alter the expression of the genes. It is possible to detect single methylated Cytosine bases in genomic DNA using whole-genome bisulfite sequencing, also known as Bisulfite sequencing (BS-Seq). It enables the examination of genome-wide methylome profiles after treating genomic DNA with sodium bisulfite, providing, a genome-wide map at a single-base level (Vidal et al., 2017). In recent studies, the methyl CpG binding domain, or MBD, a conserved eukaryotic protein, has been demonstrated to have a sequence motif involved in binding to cytosine residues that are methylated in the context of the 5’CG3’ dinucleotide. Researchers have used the MBD-isolated Sequencing technique to evaluate/examine the overall methylation pattern at a genome-wide level (Sweet and Ting, 2016). This method uses the methylCpG binding domain of MBD2 to precipitate methylated DNA, which is then sequenced. Indeed, changes in DNA methylation patterns such as hypermethylated promoters and enhancers of tumor-suppressor genes occur during cancer’s origin, progression, and dissemination (Vidal et al., 2017). Moreover, at the molecular level, epithelial-to-mesenchymal transition (EMT), a pivotal hallmark of cancer, requires significant transcriptional reprogramming, controlled by a restricted number of transcription factors. Therefore, mapping DNA-binding sites on a genome-wide scale in cancers, such as colorectal cancer and others identifies genes associated with stemness and metastasis capacity using ChIP-seq (Beyes et al., 2019). Using NGS methods, distinguishing specific chromatin landscapes and transcription factors involved with metastatic relapse to the lung or brain in breast cancer subtypes became possible and helpful in clinical routine (Cai et al., 2020). Integrating epigenomic and transcriptome analysis reveals superenhancer heterogeneity among breast cancer subtypes and provides therapeutically critical biological insights into most aggressive breast cancer, such as triple-negative (Huang et al., 2021). On the other hand, numerous sequences mapped-read alignment tools have been developed to handle the error rate, the speed, the memory, the sensitivity, and the alignment accuracy. These tools include Bowtie (Langmead et al., 2009; Langmead, 2010), TopHat, BWA, CUSHAW, Genome Multitool (GEM), and Genomic Short-read Nucleotide Alignment Program (GSNAP), etc. The global sequence read assembly is then converted into a whole genome through de novo assembly; hence, the genome analysis becomes more efficient, accurate, and cost-effective (Li et al., 2015).

12

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

For downstream NGS analysis, finding tiny indels and insertions has become essential because single nucleotide variation (SNVs) is the most common form of disease-causing mutation (Serratì et al., 2016). SNV identification methods are crucial in whole-genome and whole-exome sequencing investigations since they are used after reading mapping to a reference genome. Many critical somatic mutations occur at low frequencies across various cancer samples. As a result, a significant issue for clinical diagnostic techniques is accurately identifying rare, low-frequency of single nucleotide changes and genome structural variation in cancer samples (Mu, 2019). A critical first step in handling cancer is to find, understand and characterize minor genomic variants (Gómez-Romero et al., 2018).

Driver and Passenger Mutations in Cancer, and Methods for Differentiating In-between Driver mutations provide tumor cells an edge in terms of growth. They are advantageous to tumor development; therefore, they are more likely to be found and positively selected in a mixed population of cells. In contrast, passenger mutations do not provide tumor cells any advantage in terms of their ability to proliferate and therefore do not contribute to cancer development (Stratton, Campbell, and Futreal 2009). In order to better understand carcinogenesis’ molecular underpinnings, it is imperative to separate driver mutations from passenger mutations and discover diagnostic and therapeutic targets, which can achieve using several methods and softwares. These strategies could rely on biological, statistical, or functional features (Zhang et al., 2014). Machine learning is another trend that could be used in this field (Biswas and Chakrabarti, 2020).

Strategies Relying on Differences in Biological Feature Most approaches assume that the frequency (or score) of mutations in a driver gene will be higher than that in passenger genes. Because of this, the probability of each somatic mutation-containing gene is calculated using statistical scores, which generally consider the gene’s size, nucleotide composition, and background non-synonymous mutation rates (Zhang et al., 2014).

The Value of Next-Generation Sequencing …

13

In previous high-profile papers, published earlier, the Cancer Mutation Analysis program was used effectively to uncover driver mutations present in the implicated genes. First, the algorithm predicts the passenger mutation rate before analyzing the mutation at the gene level, which can be estimated in a variety of methods. The procedure proposed by Carter et al. (Carter et al., 2009) for calculating the passenger mutation rate is as follows: i. Determine the number of non-silent single-base variations, ii. The COSMIC database is an excellent place to start because it filters out known driver mutations. This database contains information on driver mutations that occur often, such as TP53, PIK3CA, and PTEN among others. iii. For each single base substitution or in dinucleotides and in a base substitution matrix, distribute the frequency of the non-silent somatic single-base changes that are not included in the driver mutation set. Twenty-four categories could be obtained. iv. Calculate the frequencies by dividing the total number of mutations left after filtering out the known driver mutations by the value for each of the 24 categories. v. Once the gene-level mutation scores have been determined, the final step is to determine the overall mutation score. Researchers can use a measure called the Log-Likelihood Ratio (logLRT) to show a specific score when a gene is a passenger. Increased scores imply that the mutation rate is higher than the passenger rate, indicating that cancer gene candidates

Strategies Relying on Machine-Learning Methods These approaches include building a classifier on a predefined dataset comprising driver and passenger mutations and then applying the classification algorithm and its parameters to a novel dataset. Tan et al. created a new method for identifying ‘driver’ mutations based on missense-mutation-related features, DMI, or driver mutation identification (Tan et al., 2012). DMI includes for each missense mutation the following features:

14

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

• • • •

Substitution score matrix; Changes in the physicochemical properties of the residues as a result of mutation; Characteristics of amino acids in a protein sequence, and Other UniProt, SwissProt, or COSMIC-derived annotated characteristics.

Methods Relating to the Effect of Mutations on Their Functional Properties It is believed that, according to this technique, mutations with major/ significant functional implications are more likely to be driver mutations than variations with minor functional effects (Zhang et al., 2014).

Pathway-Based and Network-Based Analysis Pathway and network analysis is a valuable tool in genomics for analyzing large amounts of genomic data, especially when studying gene expression (PCAWG Drivers and Functional Interpretation Working Group et al. 2020). SNP data from genome-wide association studies can be used to interpret variants related to the biological processes in which the impacted genes and proteins are involved, and this offers promise for the future (Zheng et al., 2018). Biological pathways are expressive, descriptive, and informative. Sometimes elaborate representations are used to organize and summarize fundamental processes. These pathways indicate the potential interaction connecting genes, proteins, and metabolites. Path diagrams are typical for graphing the information available on such biological processes (Kutmon et al., 2015). We believe that data visualization, such as interactive pathway diagrams and gene-gene biological interactions like genetic networks, enhances the interpretation of scientific data, conclusions derived, and discussion of followup concerns (Habermann et al., 2015). Gene set enrichment analysis (GSEA) (Subramanian et al., 2005), DAVID (Huang et al., 2009), and g: Profiler (Reimand et al., 2016) now present pathway-analysis results via lists, graphs, or links to the route diagram. We believe it is beneficial to include an

The Value of Next-Generation Sequencing …

15

interactive pathway diagram and network visualizations with metadata from additional sources to aid in understanding the hallmarks of cancer, its development and progression, and the cellular and molecular involved pathways at the gene or molecular level. Data from many omics sources, such as gene expression and genetic variation, can be analyzed functionally using MetaCore’s software package, which provides an accurate p-values enriched in gene ontology, pathways, and related genes’ networks from literature. The sets have been curated to include only those genes that have been proven to be involved in tumorigenesis (Schultz et al., 2018). PathVisio 3, a computational tool, is software that allows researchers to edit, visualize, and analyze pathways (Burnett et al., 2021). A central panel in PathVisio enables the construction of pathway diagrams and displays pathway entities in various ways depending on the advanced data visualization settings (Kutmon et al., 2015). Moreover, With Caleydo, data visualization is easy with StratomeX, enRoute, and Entourage (Turkay et al., 2014). StratomeX organizes cancer patient data, and TCGA databases are queried for disease information (Marx, 2013). The Entourage view and the enRoute view focus on the interdependencies between pathways’ maps, and experimental data at the pathway/network level, respectively, are beneficial when conducting pathway analysis. In addition, Entourage provides useful capabilities for a wide range of molecular biology applications, including the visualization of biological route maps (Lex et al., 2013). This feature enables a more profound understanding, illustrating the functions that a gene identified in one pathway may play throughout a complex, interrelated process. Each pathway’s interconnections are shown with colored lines, making it easy to see which genes are linked together. EnRoute also selects a pathway’s genes and links them to experimental data from the TCGA, where CNVs can be seen (Partl et al., 2013). Genomic research is crucial for cancer research development. The research on cancer genomes showed that gene anomalies are responsible for the formation and growth of numerous types of cancer (Marx, 2013). Genomic and cancer data visualization tools can aid our understanding of cancer biology and pave the way for novel techniques for diagnosing and treating the disease, allowing for more informed decision-making on treatment options or tailored therapy (Qu et al., 2019). Recent years have seen the development of novel technologies that enable the exploration of complex genomic data using

16

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

visualization tools. Additional efforts are required to build new tools to satisfy the field’s evolving needs.

Next-Generation Sequencing in the Era of Breast Cancer Researches Though breast cancer may arise from several factors, it is thought to begin with an unbalanced activity of pathways involved in mammary gland development’s patterning and morphogenesis (Feng et al., 2018). Several categories have been proposed according to the invasiveness, incidence, histology, or molecular profiling of breast cancer (Elbasyouni et al., 2021). The location of a tumor determines whether it is lobular or ductal. Immunohistochemical identification of signature receptors linked with specific cellular activities has traditionally been used to classify breast cancer (Elbasyouni et al., 2021). Luminal cell-derived breast cancers are the most prevalent for hormone receptors like Estrogen Receptor (ER), Progesterone Receptor (PR), or Human EGF-like growth factors Receptor (HER2/ErbB2) (Hynes and Watson, 2010). Cell proliferation, survival, differentiation, and angiogenesis are regulated by HER2, a transmembrane glycoprotein of the receptor family. Signaling pathways including PI3K, MAPK, and PKC are activated and subsequently phosphorylated by the receptor’s activation, leading to pathogenic activities. These tumors may also have a specific TP53 mutation. Each category has a distinct pharmacological therapy (Feng et al., 2018). Triple-negative breast tumors lack expression of ER, PR, and HER2, and are most typically derived from cells of basal origin. In addition, a new class of breast tumors known as claudin-low has been discovered recently (Shinde et al., 2010). Indeed, based on molecular gene expression patterns, breast cancer has been reclassified as luminal A, luminal B, HER2-rich, Triplenegative, Claudin-low and Basal-like subtypes in order to provide a more accurate prediction of clinical outcomes than conventional clinical and pathological markers (Arpino et al., 2013; Guler, 2017). In addition to the pathogenic mutations in breast cancer susceptibility genes (BRCA1 and BRCA2) (Lalloo et al., 2006), germline variants in ATM (Marouf et al., 2017), BARD1 (Toh et al., 2019), BLM (Bononi et al., 2020), CDH1 (Corso et al., 2020), CHEK2 (Stolarova et al., 2020), PALB2 (De Angelis et al., 2021), PMS2 and MSH6 (Roberts et al., 2018), ABRAXAS

The Value of Next-Generation Sequencing …

17

(FAM175A) (Renault et al., 2016), FANCC (ABCTB Investigators et al. 2019), and RAD51 and its paralog genes (B, C, and D) (Setton et al., 2021), as well as more genes, including TP53 (Lalloo et al., 2006) and PTEN (Heikkinen et al., 2011), may enhance breast cancer susceptibility. The genetic aspects of germline variant-positive breast cancers, especially their relationship with bi-allelic inactivation via loss of heterozygosis and second somatic mutations, remain unknown. These germline predispositions are strongly associated with the genetic risk of mammary carcinoma. Its management and prevention might benefit from increased research and understanding of the genetic basis for familial risk (Inagaki-Kawata et al., 2020). Aside from identifying germline alterations and clarifying tumor origin, next-generation sequencing can impact clinical therapy (Guan et al., 2012). Current genetic testing criteria can miss individual germline changes in BRCA1/2, or some people may opt out of genetic testing altogether. Tumor genomics can reveal a previously unknown germline mutation. There is a correlation between the somatic mutation rate found by WGS and matching germline relevance for frequently changed genes found in breast cancer pools of identical WGS samples and panel testing. WGS may be used to compare mutation frequencies, but it can also be used to see whether mutation types and locations have changed over time (Medical Genome Initiative et al. 2020). Mutations in BRCA1/2 have a high germline transmission rate, and patients should be referred for genetic counseling if these mutations are found in malignancies. ATM, PMS2, CHEK2, ATR, and CDH1 mutations can be inherited; thus, seeking genetic counseling in conjunction with a clinical and family history evaluation should be a consideration. Although PTEN and TP53 hereditary disorders are well-documented, somatic mutations rarely reflect germline changes. Overall, the availability of large panels of breast cancer susceptibility genes has increased. Therefore, identifying pathogenic mutations in BRCA1 and BRCA2 genes is a common practice to forecast the risk of developing breast or ovarian cancer, investigate the need for a risk-reducing ablation/mastectomy, and plan for more effective treatment and personalized medicine (Sukumar et al., 2021). According to Kan and his team’s findings from genomic analysis, Asian breast cancer patients had an increased prevalence of many mutation signatures (Kan et al., 2018). For example, these individuals have problems with DNA repair mechanisms, which is supported by research showing that younger Asian patients had more excellent rates of mutations that affect the

18

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

ability of the BRCA1 and BRCA2 genes to repair DNA damage. Furthermore, BRCA1 deficiency may result in an insufficient response to replication stress, raising cancer risk (Sadeghi et al., 2020). As a result, patients are offered upfront screening for BRCA1 and BRCA2 mutations at the time of breast cancer diagnosis. A consensus for breast cancer panel testing is developing, resulting in the clinical significance of variety of germline variants to establish an earlier diagnosis and a better prognostic (Taylor et al., 2018). Indeed, different breast cancer-predisposing genes will inevitably be discovered (LaDuca et al., 2020). Moreover, an international screening program has been formed to compile sequencing data, discover and validate novel breast cancer genes (Catana, Apostu, and Antemie, 2019). Exome sequencing, on the other hand, recently revealed that significant novel putative hereditary breast cancer susceptibility genes are involved in pathways implicated in reducing the genome instability and repair mechanisms (HRR and MMR) (Kwei et al., 2010; Pećina-Šlaus et al., 2020) (Figure 3A). While DNA repair is critical for hereditary breast cancer suppression, cell cycle checkpoints (Oh et al., 2017) and programmed death/apoptosis pathways are involved in hereditary breast cancer (Thu et al., 2018). Typically, these processes remove DNA-damaged cells, but the haplo-insufficiency of checkpoint regulator genes results in a buildup of mutant cells (Inoue and Fry, 2017). The prototypical checkpoint regulator is TP53. It coordinates many factors involved in genomic stability molecular pathways. Taken together, ATM, CHK2, TP53, and BRCA1–BARD1 can all act as checkpoints in cell cycle phases (G1, S, and G2) (Bassi et al. 2021), and, therefore, cells with impaired DNA repair pathways and non-functional cell cycle checkpoints are expected to have a selective advantage in proliferating. Identifying cancers lacking in HRR, whether due to germline or somatic BRCA1 and BRCA2 inactivation, is critical clinically, as these malignancies are selectively responsive to PARPi (Santana dos Santos et al. 2020). The mutational signature is intended to be a direct pathophysiological reflection of MMR pathway abrogation (Volkova et al., 2020). Thereby increasing the sensitivity to immune therapies; thus, WGS will soon be required for routine clinical analysis enabling optimized precision medicine for each molecular type of breast cancer (Figure 3B). These conclusions are achieved due to improvements in whole-genome sequencing (Wyrick and Roberts, 2015; Davies, Morganella, et al. 2017; Davies, Glodzik, et al., 2017). Understanding the broader role of DNA damage repair in cancer has become a fundamental and appealing strategy for targeted cancer therapy

The Value of Next-Generation Sequencing …

19

(Cheng et al., 2020). In particular, developing novel hypotheses or theories in this field based on previous scientists’ findings is critical for future promising drugable emerging targets (Huang and Zhou, 2021). To better understand the genetic landscape of cancer, targeted panels have been developed in oncology to search for hereditary cancer and monitor somatic changes in patients with progressing cancer (Gulilat et al., 2019). This information can be used to find new therapies or repurpose existing stuff. The development of targeted gene panels is now also taking place (Rybin et al., 2021), and this developing technology has promise for monitoring cancer initiation or relapse, tumor promotion, and evolution, including the emergence of treatment resistance (Bewicke-Copley et al., 2019) (Figure 3C).

Figure 3. Multi-omics markers to be introduced to routine panels for breast tumor progression prediction.

One of the primary goals of precision cancer treatment is to customize clinical management to the unique events associated with tumor genesis, progression, and hallmark features of enhanced genomic instability associated with tumor aggressiveness. (Bewicke-Copley et al. 2019). When breast cancer is sequenced with a targeted NGS, the high frequency of clinically significant

20

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

genetic mutations suggests that targeted therapeutics may be created for patients with breast cancer. From a practical standpoint, the most significant contribution of microarray-based technology to high-throughput molecular diagnostics tests was the discovery of prognostic markers (Glas et al., 2006). Numerous commercially available gene expression–based platforms have been implemented in the clinical setting, including Oncotype DX (Paik et al., 2004), MammaPrint (Tian et al., 2010), Breast Cancer Index (BCI) (‘Breast Cancer Index for Breast Cancer Prognosis,’ n.d.), PAM50 (Ohnstad et al., 2017) and EndoPredict (Kondo et al., 2011). They assist oncologists in determining which patients have a good enough outcome to avoid chemotherapy. Additionally, these signatures have the greatest discriminatory strength in ERpositive cancer; their prognostic utility in ER-negative tumors is limited (Pu et al., 2020), as ER-negative tumors express significant quantities of proliferation-related genes (Yersal, 2014).

Multi-omics Data in Cancer Research Large data sets can be generated and examined quickly and easily using omics technologies (Murugesan and Premkumar, 2021). The benefits of looking at many types of omics data together rather than individually, such as simultaneously examining transcriptome, epigenome, and proteome data, allow for a more comprehensive conclusion (Horak, Fröhling, and Glimm 2016). Das et al., colleagues conducted a comprehensive assessment of the utilization of multi-omics data in cancer research and oncology (Das et al., 2020). Single-omics type of data investigates a deeper level of inquiry and can provide helpful information (for example, mutation detection) even though the complexity of cancer increases from the genome to the proteome (Chakraborty et al., 2018). As a result, integrating several omics data types provides a more comprehensive picture of cancer (Amar et al., 2017). A 2016 study demonstrating the usefulness of proteomic screening derived from tissues with breast cancer revealed an imperfect harmony between proteomic and genomic data, illustrating the challenges of restriction or relying just on one type of omics data to make crucial decisions (Ma et al., 2016).

The Value of Next-Generation Sequencing …

21

Omics Data Repositories Databases Catalogue of Somatic Mutations in Cancer (COSMIC) COSMIC has expanded from a four-gene study in 2004 (Bamford et al., 2004) to include all human genes and describe 5 977 977 mutations in 1 391 372 samples today (Forbes et al., 2017). Expert experts have culled information from 26 251 papers to compile a comprehensive list of 223 essential cancer genes. With 466 large-scale systematic screens’ genome-wide annotations, whole-genome and cancer genome atlas data, and open-access data from ICGC create a comprehensive picture of cancer (The International Cancer Genome Consortium 2010). Data are displayed in tabular or interactive visualizations, such as the gene histogram, which collects and presents a plethora of data (Tate et al., 2019), available via the following website: https://cancer.sanger.uk/cosmic.

Proteomics Identification Database (PRIDE) Proteomics data like protein and peptide identifications, post-translational changes, and supporting spectral evidence can all be found in the PRIDE database, an open, standards-compliant repository.

Genomic Expression Omnibus (GEO) The GEO database maintains molecular abundance data for a wide range of high-throughput measurement techniques. Comparative genomic hybridization, gene expression microarrays, genomic tiling arrays, single-nucleotide polymorphism detection, and chromatin immuno-precipitation can all be used to find protein-binding genomic areas using chromatin immuno-precipitation (ChIP-seq technology). Other Omics data-repositories are presented in Table 1.

22

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Table 1. Omics data repositories data bases, their utility and accessing link Data Repositories The Cancer Genome Atlas (TCGA)

Clinical Proteomic Tumor Analysis Consortium (CPTAC)

International Cancer Genomics Consortium (ICGC) Cancer Cell Line Encyclopedia (CCLE)

Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)

Omics Discovery Index (OmicsDI)

Utility Data on genomic sequencing, expression, methylation, and copy number variation have been generated, evaluated, and made available for more than 20,000 people with more than 30 different types of tumors. Genomic, transcriptomic and proteomic data of more than 1100 cancer patients, including those with various forms of cancer such as endometrial, renal, lung, breast, colon, ovarian, brain, head and neck, and pancreatic. genomic, transcriptomic, and epigenomic alterations in 50 distinct tumour types and/or subtypes that are of clinical and social consequence throughout the world Sequencing data compiled from 947 different cancer cell lines in humans. Combined with the pharmacological characteristics of 24 anticancer drugs from 479 different lines. Permitted the identification of drug susceptibility determinants and predictions based on genetic, lineage, and gene expression. Contains: • Clinical traits, • Expression, • CNV profiles, • SNP genotypes, and • Genomic and transcriptomic data derived from patients diagnosed with breast tumors. It is an open-source platform for accessing, discovering, and disseminating omics data sets. OmicsDI can combine data sets from proteomics, genomes, metabolomics, and transcriptomics.

Link https://www.cancer.gov/a bout-nci/organization/ccg/ research/structuralgenomics/tcga

https://proteomics.cancer. gov/programs/cptac

https://dcc.icgc.org/

https://sites.broadinstitute. org/ccle/

https://www.mercuriolab. umassmed.edu/metabric

https://www.omicsdi.org/

Conclusion Recent improvements in genomic sequencing have discovered genetic variation across individuals and within the same patient, which may develop

The Value of Next-Generation Sequencing …

23

from primary tumor sites to metastatic locations and during therapy. It is becoming easier to discover genetic abnormalities in tumors and blood samples, but turning that information into better clinical care is still a barrier. Omics-cancer medicines have increased the therapeutic impact of NGS, but they have also created new obstacles. While the primary focus is on predicting treatment success, NGS can also help with pathology diagnosis and prognosis and reveal germline mutations that increase a person’s risk of cancer. Compared to quantitative methods, NGS detects the presence or absence of genetic changes and permits detection even when samples include only a tiny fraction of tumor cells. With the advent of next-generation sequencing, oncologists face new problems in interpreting and applying genomic discoveries in the clinic. A balance is struck between the level of coverage and the range of molecular NGS assays for each application. Dimension is the number of sequence reads in a specific area of the genome. Whole-genome sequencing or Whole-exome sequencing is frequently used in scientific applications. New genes or noncoding areas that control cancer biology can be discovered using Whole-genome sequencing and Whole-exome sequencing. The good news is that many of these genes are well known, making it possible to do more focused tests on panels that specifically target them. High sequencing depth is provided by targeted NGS, making up for non-tumor DNA present in the sample. As a result, hundreds of gene changes can be reliably detected even when tumor cells’ quantity is less than 20% of a sample. Therefore, it is crucial to incorporate NGS technology for routine clinical applications to use focused panels. In the era of breast cancer, genomics faces a significant challenge in translating the vast amounts of NGS data into clinically useful information. These data can be used to assess current therapeutic schemes and develop new ones that target newly identified drugable genes or minor tumor sub-clones. Furthermore, single-cell RNA (scRNA-seq) and next-generation RNA sequencing have been beneficial for studying cancer stem cells and predicting epithelial to the mesenchymal transition process, metastasis and organotropism. Every advancement in cancer understanding is aimed at improving the care of patients. Although NGS-based breakthroughs are significant from a scientific standpoint, clinical advantages from these advancements are still under discussion. It is clear that to study cancer progression, uncover new therapeutic approaches, and create novel cancer biomarkers, we must use a variety of omics methodologies at various levels. Multidimensional omics

24

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

techniques are required to portray the whole picture of cancer-host interactions. In contrast, a single type of omics investigation might disclose much information at a unidirectional level. Instead of just looking at one omic level of information, multi-Omics research could uncover how the malignant change affects data flow from that level up to the next. The malignant genotype and phenotypic traits can be linked using a multi-Omics approach. In conclusion, to meet new technical problems and support new applications, computational tools must keep pace with advances in sequencing technology. New mapping methods are needed, for example, to reliably and effectively align lengthy reads as sequencing systems become capable of producing longer reads. Researchers will be able to recreate biological networks functioning at the cellular level thanks to these advances in analyzing multi-Omics data (genome, epigenome, and transcriptome) in unusual cell types and cell states in different cancer types, including breast cancer, stages, and grade. Computational biologists and bioinformaticians will be able to move into clinical diagnostics due to these developments, solving, therefore, the biggest oncological-multi-omics enigma.

References ABCTB Investigators, NBCS Collaborators, Thilo Dörk, Paolo Peterlongo, Arto Mannermaa, Manjeet K. Bolla, Qin Wang, Joe Dennis, Thomas Ahearn, Irene L. Andrulis, Hoda Anton-Culver, Volker Arndt, Kristan J. Aronson, Annelie Augustinsson, Laura E. Beane Freeman, Matthias W. Beckmann, Alicia BeeghlyFadiel, Sabine Behrens, Marina Bermisheva, Carl Blomqvist, Natalia V. Bogdanova, Stig E. Bojesen, Hiltrud Brauch, Hermann Brenner, Barbara Burwinkel, Federico Canzian, Tsun L. Chan, Jenny Chang-Claude, Stephen J. Chanock, Ji-Yeob Choi, Hans Christiansen, Christine L. Clarke, Fergus J. Couch, Kamila Czene, Mary B. Daly, Isabel dos-Santos-Silva, Miriam Dwek, Diana M. Eccles, Arif B. Ekici, Mikael Eriksson, D. Gareth Evans, Peter A. Fasching, Jonine Figueroa, Henrik Flyger, Lin Fritschi, Marike Gabrielson, Manuela Gago-Dominguez, Chi Gao, Susan M. Gapstur, Montserrat García-Closas, José A. García-Sáenz, Mia M. Gaudet, Graham G. Giles, Mark S. Goldberg, David E. Goldgar, Pascal Guénel, Lothar Haeberle, Christopher A. Haiman, Niclas Håkansson, Per Hall, Ute Hamann, Mikael Hartman, Jan Hauke, Alexander Hein, Peter Hillemanns, Frans B. L. Hogervorst, Maartje J. Hooning, John L. Hopper, Tony Howell, Dezheng Huo, Hidemi Ito, Motoki Iwasaki, Anna Jakubowska, Wolfgang Janni, Esther M. John, Audrey Jung, Rudolf Kaaks, Daehee Kang, Pooja Middha Kapoor, Elza Khusnutdinova, Sung-Won Kim, Cari M. Kitahara, Stella Koutros, Peter Kraft, Vessela N. Kristensen, Ava Kwong, Diether Lambrechts, Loic Le Marchand, Jingmei Li, Sara Lindström, Martha Linet, Wing-Yee Lo, Jirong

The Value of Next-Generation Sequencing …

25

Long, Artitaya Lophatananon, Jan Lubiński, Mehdi Manoochehri, Siranoush Manoukian, Sara Margolin, Elena Martinez, Keitaro Matsuo, Dimitris Mavroudis, Alfons Meindl, Usha Menon, Roger L. Milne, Nur Aishah Mohd Taib, Kenneth Muir, Anna Marie Mulligan, Susan L. Neuhausen, Heli Nevanlinna, Patrick Neven, William G. Newman, Kenneth Offit, Olufunmilayo I. Olopade, Andrew F. Olshan, Janet E. Olson, Håkan Olsson, Sue K. Park, Tjoung-Won Park-Simon, Julian Peto, Dijana Plaseska-Karanfilska, Esther Pohl-Rescigno, Nadege Presneau, Brigitte Rack, Paolo Radice, Muhammad U. Rashid, Gad Rennert, Hedy S. Rennert, Atocha Romero, Matthias Ruebner, Emmanouil Saloustros, Marjanka K. Schmidt, Rita K. Schmutzler, Michael O. Schneider, Minouk J. Schoemaker, Christopher Scott, Chen-Yang Shen, Xiao-Ou Shu, Jacques Simard, Susan Slager, Snezhana Smichkoska, Melissa C. Southey, John J. Spinelli, Jennifer Stone, Harald Surowy, Anthony J. Swerdlow, Rulla M. Tamimi, William J. Tapper, Soo H. Teo, Mary Beth Terry, Amanda E. Toland, Rob A. E. M. Tollenaar, Diana Torres, Gabriela Torres-Mejía, Melissa A. Troester, Thérèse Truong, Shoichiro Tsugane, Michael Untch, Celine M. Vachon, Ans M. W. van den Ouweland, Elke M. van Veen, Joseph Vijai, Camilla Wendt, Alicja Wolk, Jyh-Cherng Yu, Wei Zheng, Argyrios Ziogas, Elad Ziv, Alison M. Dunning, Paul D. P. Pharoah, Detlev Schindler, Peter Devilee, and Douglas F. Easton. 2019. ‘Two Truncating Variants in FANCC and Breast Cancer Risk’. Scientific Reports 9 (1): 12524. https://doi.org/10.1038/s41598-019-48804-y. Adriaens, Michiel E., Peggy Prickaerts, Michelle Chan-Seng-Yue, Twan van den Beucken, Vivian E. H. Dahlmans, Lars M. Eijssen, Timothy Beck, Bradly G. Wouters, Jan Willem Voncken, and Chris T. A. Evelo. 2016. ‘Quantitative Analysis of ChIP-Seq Data Uncovers Dynamic and Sustained H3K4me3 and H3K27me3 Modulation in Cancer Cells under Hypoxia’. Epigenetics and Chromatin 9 (1): 48. https://doi.org/ 10.1186/s13072-016-0090-4. Amar, D., S. Izraeli, and R. Shamir. 2017. ‘Utilizing Somatic Mutation Data from Numerous Studies for Cancer Research: Proof of Concept and Applications’. Oncogene 36 (24): 3375–83. https://doi.org/10.1038/onc.2016.489. Anstine, Lindsey J., and Ruth Keri. 2019. ‘A New View of the Mammary Epithelial Hierarchy and Its Implications for Breast Cancer Initiation and Metastasis’. Journal of Cancer Metastasis and Treatment 2019 (June). https://doi.org/10.20517/23944722.2019.24. Arpino, Grazia, Daniele Generali, Anna Sapino, Lucia Del Matro, Antonio Frassoldati, Michelino de Laurentis, Paolo Pronzato, Giorgio Mustacchi, Marina Cazzaniga, Sabino De Placido, Pierfranco Conte, Mariarosa Cappelletti, Vanessa Zanoni, Andrea Antonelli, Mario Martinotti, Fabio Puglisi, Alfredo Berruti, Alberto Bottini, and Luigi Dogliotti. 2013. ‘Gene Expression Profiling in Breast Cancer: A Clinical Perspective’. The Breast 22 (2): 109–20. https://doi.org/10.1016/j.breast.2013.01.016. Atkinson, Sophie R., Samuel Marguerat, and Jürg Bähler. 2012. ‘Exploring Long NonCoding RNAs through Sequencing’. Seminars in Cell and Developmental Biology 23 (2): 200–205. https://doi.org/10.1016/j.semcdb.2011.12.003. Atsalaki, Xanthoula, Lefteris Koumakis, George Potamias, and Manolis Tsiknakis. 2020. ‘Chip-Seq and Gene Expression Data for the Identification of Functional Sub-

26

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Pathways: A Proof of Concept in Lung Cancer’. Preprint. Systems Biology. https://doi.org/10.1101/2020.06.15.151712. Bakhoum, Mathieu F., Ellis J. Curtis, Michael H. Goldbaum, and Paul S. Mischel. 2021. ‘BAP1 Methylation: A Prognostic Marker of Uveal Melanoma Metastasis’. NPJ Precision Oncology 5 (1): 89. https://doi.org/10.1038/s41698-021-00226-8. Bamford, S., E. Dawson, S. Forbes, J. Clements, R. Pettett, A. Dogan, A. Flanagan, J. Teague, P. A. Futreal, M. R. Stratton, and R. Wooster. 2004. ‘The COSMIC (Catalogue of Somatic Mutations in Cancer) Database and Website’. British Journal of Cancer 91 (2): 355–58. https://doi.org/10.1038/sj.bjc.6601894. Bassi, Christian, Jerome Fortin, Bryan E. Snow, Andrew Wakeham, Jason Ho, Jillian Haight, Annick You-Ten, Emily Cianci, Luke Buckler, Chiara Gorrini, Vuk Stambolic, and Tak W. Mak. 2021. ‘The PTEN and ATM Axis Controls the G1/S Cell Cycle Checkpoint and Tumorigenesis in HER2-Positive Breast Cancer’. Cell Death & Differentiation, May. https://doi.org/10.1038/s41418-021-00799-8. Behjati, Sam, and Patrick S Tarpey. 2013. ‘What Is next Generation Sequencing?’ Archives of Disease in Childhood - Education & Practice Edition 98 (6): 236–38. https://doi.org/10.1136/archdischild-2013-304340. Bewicke-Copley, Findlay, Emil Arjun Kumar, Giuseppe Palladino, Koorosh Korfi, and Jun Wang. 2019. ‘Applications and Analysis of Targeted Genomic Sequencing in Cancer Studies’. Computational and Structural Biotechnology Journal 17: 1348–59. https://doi.org/10.1016/j.csbj.2019.10.004. Beyes, Sven, Geoffroy Andrieux, Monika Schrempp, David Aicher, Janna Wenzel, Pablo Antón-García, Melanie Boerries, and Andreas Hecht. 2019. ‘Genome-Wide Mapping of DNA-Binding Sites Identifies Stemness-Related Genes as Directly Repressed Targets of SNAIL1 in Colorectal Cancer Cells’. Oncogene 38 (40): 6647–61. https://doi.org/10.1038/s41388-019-0905-4. Biswas, Nupur, and Saikat Chakrabarti. 2020. ‘Artificial Intelligence (AI)-Based Systems Biology Approaches in Multi-Omics Data Analysis of Cancer’. Frontiers in Oncology 10 (October): 588221. https://doi.org/10.3389/fonc.2020.588221. Bononi, Angela, Keisuke Goto, Guntulu Ak, Yoshie Yoshikawa, Mitsuru Emi, Sandra Pastorino, Lorenzo Carparelli, Angelica Ferro, Masaki Nasu, Jin-Hee Kim, Joelle S. Suarez, Ronghui Xu, Mika Tanji, Yasutaka Takinishi, Michael Minaai, Flavia Novelli, Ian Pagano, Giovanni Gaudino, Harvey I. Pass, Joanna Groden, Joseph J. Grzymski, Muzaffer Metintas, Muhittin Akarsu, Betsy Morrow, Raffit Hassan, Haining Yang, and Michele Carbone. 2020. ‘Heterozygous Germline BLM Mutations Increase Susceptibility to Asbestos and Mesothelioma’. Proceedings of the National Academy of Sciences 117 (52): 33466–73. https://doi.org/10.1073/pnas.2019652117. Bozic, I., T. Antal, H. Ohtsuki, H. Carter, D. Kim, S. Chen, R. Karchin, K. W. Kinzler, B. Vogelstein, and M. A. Nowak. 2010. ‘Accumulation of Driver and Passenger Mutations during Tumor Progression’. Proceedings of the National Academy of Sciences 107 (43): 18545–50. https://doi.org/10.1073/pnas.1010978107. ‘Breast Cancer Index for Breast Cancer Prognosis’. n.d., 6. Burnett, Mervin, Vito Rodolico, Fan Shen, Roger Leng, Mingyong Zhang, David D. Eisenstat, and Consolato Sergi. 2021. ‘PathVisio Analysis: An Application Targeting

The Value of Next-Generation Sequencing …

27

the MiRNA Network Associated with the P53 Signaling Pathway in Osteosarcoma’. BIOCELL 45 (1): 17–26. https://doi.org/10.32604/biocell.2021.013973. Cai, Wesley L., Celeste B. Greer, Jocelyn F. Chen, Anna Arnal-Estapé, Jian Cao, Qin Yan, and Don X. Nguyen. 2020. ‘Specific Chromatin Landscapes and Transcription Factors Couple Breast Cancer Subtype with Metastatic Relapse to Lung or Brain’. BMC Medical Genomics 13 (1): 33. https://doi.org/10.1186/s12920-020-0695-0. Carter, Hannah, Sining Chen, Leyla Isik, Svitlana Tyekucheva, Victor E. Velculescu, Kenneth W. Kinzler, Bert Vogelstein, and Rachel Karchin. 2009. ‘Cancer-Specific High-Throughput Annotation of Somatic Mutations: Computational Prediction of Driver Missense Mutations’. Cancer Research 69 (16): 6660–67. https://doi.org/ 10.1158/0008-5472.CAN-09-1133. Catana, Andreea, Adina Patricia Apostu, and Razvan-Geo Antemie. 2019. ‘Multi Gene Panel Testing for Hereditary Breast Cancer - Is It Ready to Be Used?’ Medicine and Pharmacy Reports, July. https://doi.org/10.15386/mpr-1083. Chakraborty, Sajib, Md. Ismail Hosen, Musaddeque Ahmed, and Hossain Uddin Shekhar. 2018. ‘Onco-Multi-OMICS Approach: A New Frontier in Cancer Research’. BioMed Research International 2018 (October): 1–14. https://doi.org/10.1155/2018/9836256. Chappell, Kevin, Kanishka Manna, Charity L. Washam, Stefan Graw, Duah Alkam, Matthew D. Thompson, Maroof Khan Zafar, Lindsey Hazeslip, Christopher Randolph, Allen Gies, Jordan T. Bird, Alicia K Byrd, Sayem Miah, and Stephanie D. Byrum. 2021. ‘Multi-Omics Data Integration Reveals Correlated Regulatory Features of Triple Negative Breast Cancer’. Molecular Omics 17 (5): 677–91. https://doi.org/ 10.1039/D1MO00117E. Chen, Geng, Zhixiong Cai, Xiuqing Dong, Jing Zhao, Song Lin, Xi Hu, Fang-E Liu, Xiaolong Liu, and Huqing Zhang. 2020. ‘Genomic and Transcriptomic Landscape of Tumor Clonal Evolution in Cholangiocarcinoma’. Frontiers in Genetics 11 (March): 195. https://doi.org/10.3389/fgene.2020.00195. Chen, Yi-Xiao, Hao Chen, Yu Rong, Feng Jiang, Jia-Bin Chen, Yuan-Yuan Duan, DongLi Zhu, Tie-Lin Yang, Zhijun Dai, Shan-Shan Dong, and Yan Guo. 2020. ‘An Integrative Multi-Omics Network-Based Approach Identifies Key Regulators for Breast Cancer’. Computational and Structural Biotechnology Journal 18: 2826–35. https://doi.org/10.1016/j.csbj.2020.10.001. Cheng, Angela S., Samuel C. Y. Leung, Dongxia Gao, Samantha Burugu, Meenakshi Anurag, Matthew J. Ellis, and Torsten O. Nielsen. 2020. ‘Mismatch Repair Protein Loss in Breast Cancer: Clinicopathological Associations in a Large British Columbia Cohort’. Breast Cancer Research and Treatment 179 (1): 3–10. https://doi.org/10. 1007/s10549-019-05438-y. Conway, Jake R., Jeremy L. Warner, Wendy S. Rubinstein, and Robert S. Miller. 2019. ‘Next-Generation Sequencing and the Clinical Oncology Workflow: Data Challenges, Proposed Solutions, and a Call to Action’. JCO Precision Oncology, no. 3 (December): 1–10. https://doi.org/10.1200/PO.19.00232. Corso, Giovanni, Giacomo Montagna, Joana Figueiredo, Carlo La Vecchia, Uberto Fumagalli Romario, Maria Sofia Fernandes, Susana Seixas, Franco Roviello, Cristina Trovato, Elena Guerini-Rocco, Nicola Fusco, Gabriella Pravettoni, Serena Petrocchi, Anna Rotili, Giulia Massari, Francesca Magnoni, Francesca De Lorenzi, Manuela

28

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Bottoni, Viviana Galimberti, João Miguel Sanches, Mariarosaria Calvello, Raquel Seruca, and Bernardo Bonanni. 2020. ‘Hereditary Gastric and Breast Cancer Syndromes Related to CDH1 Germline Mutation: A Multidisciplinary Clinical Review’. Cancers 12 (6): 1598. https://doi.org/10.3390/cancers12061598. Costa, Fabricio. 2010. ‘Epigenomics in Cancer Management’. Cancer Management and Research, October, 255. https://doi.org/10.2147/CMAR.S7280. Das, Tonmoy, Geoffroy Andrieux, Musaddeque Ahmed, and Sajib Chakraborty. 2020. ‘Integration of Online Omics-Data Resources for Cancer Research’. Frontiers in Genetics 11 (October): 578345. https://doi.org/10.3389/fgene.2020.578345. Dash, Sujata, and Bichitrananda Patra. “Genetic diagnosis of cancer by evolutionary fuzzyrough based neural-network ensemble.” In Data Analytics in Medicine: Concepts, Methodologies, Tools, and Applications, pp. 645-662. IGI Global, 2020. Dash, Sujata, Ruppa Thulasiram, and Parimala Thulasiraman. “Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data.” International Journal of Swarm Intelligence Research (IJSIR) 10, no. 2 (2019): 1-20. Dash, Sujata. “A diverse meta learning ensemble technique to handle imbalanced microarray dataset.” In Advances in Nature and Biologically Inspired Computing, pp. 1-13. Springer, Cham, 2016a. Dash S., Subudhi B., Computational Intelligence Applications in Bioinformatics, 2016, pp.1-514, IGI Global, USA. DOI: 10.4018/978-1-5225-0427-6 USA, 2016b. Davalos, Veronica, Anna Martinez-Cardus, and Manel Esteller. 2017. ‘The Epigenomic Revolution in Breast Cancer’. The American Journal of Pathology 187 (10): 2163–74. https://doi.org/10.1016/j.ajpath.2017.07.002. Davies, Helen, Dominik Glodzik, Sandro Morganella, Lucy R Yates, Johan Staaf, Xueqing Zou, Manasa Ramakrishna, Sancha Martin, Sandrine Boyault, Anieta M Sieuwerts, Peter T Simpson, Tari A King, Keiran Raine, Jorunn E Eyfjord, Gu Kong, Åke Borg, Ewan Birney, Hendrik G Stunnenberg, Marc J van de Vijver, Anne-Lise BørresenDale, John W M Martens, Paul N Span, Sunil R Lakhani, Anne Vincent-Salomon, Christos Sotiriou, Andrew Tutt, Alastair M Thompson, Steven Van Laere, Andrea L Richardson, Alain Viari, Peter J Campbell, Michael R Stratton, and Serena Nik-Zainal. 2017. ‘HRDetect Is a Predictor of BRCA1 and BRCA2 Deficiency Based on Mutational Signatures’. Nature Medicine 23 (4): 517–25. https://doi.org/10.1038/nm. 4292. Davies, Helen, Sandro Morganella, Colin A. Purdie, Se Jin Jang, Elin Borgen, Hege Russnes, Dominik Glodzik, Xueqing Zou, Alain Viari, Andrea L. Richardson, AnneLise Børresen-Dale, Alastair Thompson, Jorunn E. Eyfjord, Gu Kong, Michael R. Stratton, and Serena Nik-Zainal. 2017. ‘Whole-Genome Sequencing Reveals Breast Cancers with Mismatch Repair Deficiency’. Cancer Research 77 (18): 4755–62. https://doi.org/10.1158/0008-5472.CAN-17-1083. De Angelis, Carmine, Carmela Nardelli, Paola Concolino, Martina Pagliuca, Mario Setaro, Elisa De Paolis, Pietro De Placido, Valeria Forestieri, Giovanni Luca Scaglione, Annalisa Ranieri, Barbara Lombardo, Lucio Pastore, Sabino De Placido, and Ettore Capoluongo. 2021. ‘Case Report: Detection of a Novel Germline PALB2 Deletion in a Young Woman with Hereditary Breast Cancer: When the Patient’s Phenotype

The Value of Next-Generation Sequencing …

29

History Doesn’t Lie’. Frontiers in Oncology 11 (February): 602523. https://doi.org/ 10.3389/fonc.2021.602523. Elbasyouni, Amel, Leila Saadi, and AbdelKarim Baha. 2021. “Epidemiological Profile and Distribution of Prognostic Factors in Invasive Breast Cancer among Algerian Women”. Onco Review, September. https://doi.org/10.24292/01.OR.124220921. Feng, Yixiao, Mia Spezia, Shifeng Huang, Chengfu Yuan, Zongyue Zeng, Linghuan Zhang, Xiaojuan Ji, Wei Liu, Bo Huang, Wenping Luo, Bo Liu, Yan Lei, Scott Du, Akhila Vuppalapati, Hue H. Luu, Rex C. Haydon, Tong-Chuan He, and Guosheng Ren. 2018. ‘Breast Cancer Development and Progression: Risk Factors, Cancer Stem Cells, Signaling Pathways, Genomics, and Molecular Pathogenesis’. Genes & Diseases 5 (2): 77–106. https://doi.org/10.1016/j.gendis.2018.05.001. Forbes, Simon A., David Beare, Harry Boutselakis, Sally Bamford, Nidhi Bindal, John Tate, Charlotte G. Cole, Sari Ward, Elisabeth Dawson, Laura Ponting, Raymund Stefancsik, Bhavana Harsha, Chai Yin Kok, Mingming Jia, Harry Jubb, Zbyslaw Sondka, Sam Thompson, Tisham De, and Peter J. Campbell. 2017. ‘COSMIC: Somatic Cancer Genetics at High-Resolution’. Nucleic Acids Research 45 (D1): D777–83. https://doi.org/10.1093/nar/gkw1121. Glas, Annuska M., Arno Floore, Leonie J. M. J. Delahaye, Anke T. Witteveen, Rob C. F. Pover, Niels Bakx, Jaana S. T. Lahti-Domenici, Tako J. Bruinsma, Marc O. Warmoes, René Bernards, Lodewyk F. A. Wessels, and Laura J. Van ‘t Veer. 2006. ‘Converting a Breast Cancer Microarray Signature into a High-Throughput Diagnostic Test’. BMC Genomics 7 (1): 278. https://doi.org/10.1186/1471-2164-7-278. Gómez-Romero, Laura, Kim Palacios-Flores, José Reyes, Delfino García, Margareta Boege, Guillermo Dávila, Margarita Flores, Michael C. Schatz, and Rafael Palacios. 2018. ‘Precise Detection of de Novo Single Nucleotide Variants in Human Genomes’. Proceedings of the National Academy of Sciences 115 (21): 5516–21. https://doi.org/ 10.1073/pnas.1802244115. Grosselin, Kevin, Adeline Durand, Justine Marsolier, Adeline Poitou, Elisabetta Marangoni, Fariba Nemati, Ahmed Dahmani, Sonia Lameiras, Fabien Reyal, Olivia Frenoy, Yannick Pousse, Marcel Reichen, Adam Woolfe, Colin Brenan, Andrew D. Griffiths, Céline Vallot, and Annabelle Gérard. 2019. ‘High-Throughput Single-Cell ChIP-Seq Identifies Heterogeneity of Chromatin States in Breast Cancer’. Nature Genetics 51 (6): 1060–66. https://doi.org/10.1038/s41588-019-0424-9. Guan, Yan-Fang, Gai-Rui Li, Rong-Jiao Wang, Yu-Ting Yi, Ling Yang, Dan Jiang, XiaoPing Zhang, and Yin Peng. 2012. ‘Application of Next-Generation Sequencing in Clinical Oncology to Advance Personalized Treatment of Cancer’. Chinese Journal of Cancer 31 (10): 463–70. https://doi.org/10.5732/cjc.012.10216. Guler, E. Nilufer. 2017. ‘Gene Expression Profiling in Breast Cancer and Its Effect on Therapy Selection in Early-Stage Breast Cancer’. European Journal of Breast Health 13 (4): 168–74. https://doi.org/10.5152/ejbh.2017.3636. Gulilat, Markus, Tyler Lamb, Wendy A. Teft, Jian Wang, Jacqueline S. Dron, John F. Robinson, Rommel G. Tirona, Robert A. Hegele, Richard B. Kim, and Ute I. Schwarz. 2019. ‘Targeted next Generation Sequencing as a Tool for Precision Medicine’. BMC Medical Genomics 12 (1): 81. https://doi.org/10.1186/s12920-019-0527-2.

30

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Habermann, Bianca, Jose Villaveces, and Prasanna Koti. 2015. ‘Tools for Visualization and Analysis of Molecular Networks, Pathways, and -Omics Data’. Advances and Applications in Bioinformatics and Chemistry, June, 11. https://doi.org/10.2147/ AABC.S63534. Head, Steven R., H. Kiyomi Komori, Sarah A. LaMere, Thomas Whisenant, Filip Van Nieuwerburgh, Daniel R. Salomon, and Phillip Ordoukhanian. 2014. ‘Library Construction for Next-Generation Sequencing: Overviews and Challenges’. BioTechniques 56 (2): 61–77. https://doi.org/10.2144/000114133. Heikkinen, Tuomas, Dario Greco, Liisa M Pelttari, Johanna Tommiska, Pia Vahteristo, Päivi Heikkilä, Carl Blomqvist, Kristiina Aittomäki, and Heli Nevanlinna. 2011. ‘Variants on the Promoter Region of PTEN Affect Breast Cancer Progression and Patient Survival’. Breast Cancer Research 13 (6): R130. https://doi.org/10.1186/ bcr3076. Hong, Mingye, Shuang Tao, Ling Zhang, Li-Ting Diao, Xuanmei Huang, Shaohui Huang, Shu-Juan Xie, Zhen-Dong Xiao, and Hua Zhang. 2020. ‘RNA Sequencing: New Technologies and Applications in Cancer Research’. Journal of Hematology & Oncology 13 (1): 166. https://doi.org/10.1186/s13045-020-01005-x. Horak, Peter, Stefan Fröhling, and Hanno Glimm. 2016. ‘Integrating Next-Generation Sequencing into Clinical Oncology: Strategies, Promises and Pitfalls’. ESMO Open 1 (5): e000094. https://doi.org/10.1136/esmoopen-2016-000094. Huang, Da Wei, Brad T. Sherman, and Richard A. Lempicki. 2009. ‘Bioinformatics Enrichment Tools: Paths toward the Comprehensive Functional Analysis of Large Gene Lists’. Nucleic Acids Research 37 (1): 1–13. https://doi.org/10.1093/nar/ gkn923. Huang, Hao, Jianyang Hu, Alishba Maryam, Qinghua Huang, Yuchen Zhang, Saravanan Ramakrishnan, Jingyu Li, Haiying Ma, Victor W. S. Ma, Wah Cheuk, Grace Y. K. So, Wei Wang, William C. S. Cho, Liang Zhang, Kui Ming Chan, Xin Wang, and Y. Rebecca Chin. 2021. ‘Defining Super-Enhancer Landscape in Triple-Negative Breast Cancer by Multiomic Profiling’. Nature Communications 12 (1): 2242. https://doi.org/ 10.1038/s41467-021-22445-0. Huang, Ruixue, and Ping-Kun Zhou. 2021. ‘DNA Damage Repair: Historical Perspectives, Mechanistic Pathways and Clinical Translation for Targeted Cancer Therapy’. Signal Transduction and Targeted Therapy 6 (1): 254. https://doi.org/10.1038/s41392-02100648-7. Hynes, N. E., and C. J. Watson. 2010. ‘Mammary Gland Growth Factors: Roles in Normal Development and in Cancer’. Cold Spring Harbor Perspectives in Biology 2 (8): a003186–a003186. https://doi.org/10.1101/cshperspect.a003186. Inagaki-Kawata, Yukiko, Kenichi Yoshida, Nobuko Kawaguchi-Sakita, Masahiro Kawashima, Tomomi Nishimura, Noriko Senda, Yusuke Shiozawa, Yasuhide Takeuchi, Yoshikage Inoue, Aiko Sato-Otsubo, Yoichi Fujii, Yasuhito Nannya, Eiji Suzuki, Masahiro Takada, Hiroko Tanaka, Yuichi Shiraishi, Kenichi Chiba, Yuki Kataoka, Masae Torii, Hiroshi Yoshibayashi, Kazuhiko Yamagami, Ryuji Okamura, Yoshio Moriguchi, Hironori Kato, Shigeru Tsuyuki, Akira Yamauchi, Hirofumi Suwa, Takashi Inamoto, Satoru Miyano, Seishi Ogawa, and Masakazu Toi. 2020. ‘Genetic

The Value of Next-Generation Sequencing …

31

and Clinical Landscape of Breast Cancers with Germline BRCA1/2 Variants’. Communications Biology 3 (1): 578. https://doi.org/10.1038/s42003-020-01301-9. Inoue, Kazushi, and Elizabeth A Fry. 2017. ‘Haploinsufficient Tumor Suppressor Genes’, 30. Kan, Zhengyan, Ying Ding, Jinho Kim, Hae Hyun Jung, Woosung Chung, Samir Lal, Soonweng Cho, Julio Fernandez-Banet, Se Kyung Lee, Seok Won Kim, Jeong Eon Lee, Yoon-La Choi, Shibing Deng, Ji-Yeon Kim, Jin Seok Ahn, Ying Sha, Xinmeng Jasmine Mu, Jae-Yong Nam, Young-Hyuck Im, Soohyeon Lee, Woong-Yang Park, Seok Jin Nam, and Yeon Hee Park. 2018. ‘Multi-Omics Profiling of Younger Asian Breast Cancers Reveals Distinctive Molecular Signatures’. Nature Communications 9 (1): 1725. https://doi.org/10.1038/s41467-018-04129-4. Kim, Kyu-Tae, Hye Won Lee, Hae-Ock Lee, Hye Jin Song, Da Eun Jeong, Sang Shin, Hyunho Kim, Yoojin Shin, Do-Hyun Nam, Byong Chang Jeong, David G. Kirsch, Kyeung Min Joo, and Woong-Yang Park. 2016. ‘Application of Single-Cell RNA Sequencing in Optimizing a Combinatorial Therapeutic Strategy in Metastatic Renal Cell Carcinoma’. Genome Biology 17 (1): 80. https://doi.org/10.1186/s13059-0160945-9. Kim, Nayoung, Hong Kwan Kim, Kyungjong Lee, Yourae Hong, Jong Ho Cho, Jung Won Choi, Jung-Il Lee, Yeon-Lim Suh, Bo Mi Ku, Hye Hyeon Eum, Soyean Choi, YoonLa Choi, Je-Gun Joung, Woong-Yang Park, Hyun Ae Jung, Jong-Mu Sun, Se-Hoon Lee, Jin Seok Ahn, Keunchil Park, Myung-Ju Ahn, and Hae-Ock Lee. 2020. ‘SingleCell RNA Sequencing Demonstrates the Molecular and Cellular Reprogramming of Metastatic Lung Adenocarcinoma’. Nature Communications 11 (1): 2285. https://doi.org/10.1038/s41467-020-16164-1. Kondo, Masahide, Shu-Ling Hoshi, Takeharu Yamanaka, Hiroshi Ishiguro, and Masakazu Toi. 2011. ‘Economic Evaluation of the 21-Gene Signature (Oncotype DX®) in Lymph Node-Negative/Positive, Hormone Receptor-Positive Early-Stage Breast Cancer Based on Japanese Validation Study (JBCRG-TR03)’. Breast Cancer Research and Treatment 127 (3): 739–49. https://doi.org/10.1007/s10549-010-1243y. Kumar, Sushant, Jonathan Warrell, Shantao Li, Patrick D. McGillivray, William Meyerson, Leonidas Salichos, Arif Harmanci, Alexander Martinez-Fundichely, Calvin W. Y. Chan, Morten Muhlig Nielsen, Lucas Lochovsky, Yan Zhang, Xiaotong Li, Shaoke Lou, Jakob Skou Pedersen, Carl Herrmann, Gad Getz, Ekta Khurana, and Mark B. Gerstein. 2020. ‘Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences’. Cell 180 (5): 915-927.e16. https://doi.org/10.1016/j.cell.2020.01.032. Kutmon, Martina, Martijn P. van Iersel, Anwesha Bohler, Thomas Kelder, Nuno Nunes, Alexander R. Pico, and Chris T. Evelo. 2015. ‘PathVisio 3: An Extendable Pathway Analysis Toolbox’. Edited by Robert F. Murphy. PLOS Computational Biology 11 (2): e1004085. https://doi.org/10.1371/journal.pcbi.1004085. Kwei, Kevin A., Yvonne Kung, Keyan Salari, Ilona N. Holcomb, and Jonathan R. Pollack. 2010. ‘Genomic Instability in Breast Cancer: Pathogenesis and Clinical Implications’. Molecular Oncology 4 (3): 255–66. https://doi.org/10.1016/j.molonc.2010.04.001.

32

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

LaDuca, Holly, Eric C. Polley, Amal Yussuf, Lily Hoang, Stephanie Gutierrez, Steven N. Hart, Siddhartha Yadav, Chunling Hu, Jie Na, David E. Goldgar, Kelly Fulk, Laura Panos Smith, Carolyn Horton, Jessica Profato, Tina Pesaran, Chia-Ling Gau, Melissa Pronold, Brigette Tippin Davis, Elizabeth C. Chao, Fergus J. Couch, and Jill S. Dolinsky. 2020. ‘A Clinical Guide to Hereditary Cancer Panel Testing: Evaluation of Gene-Specific Cancer Associations and Sensitivity of Genetic Testing Criteria in a Cohort of 165,000 High-Risk Patients’. Genetics in Medicine 22 (2): 407–15. https://doi.org/10.1038/s41436-019-0633-8. Lalloo, Fiona, Jennifer Varley, Anthony Moran, David Ellis, Lindsay O’Dair, Paul Pharoah, Antonis Antoniou, Roger Hartley, Andrew Shenton, Sheila Seal, Barbara Bulman, Anthony Howell, and D. Gareth R. Evans. 2006. ‘BRCA1, BRCA2 and TP53 Mutations in Very Early-Onset Breast Cancer with Associated Risks to Relatives’. European Journal of Cancer 42 (8): 1143–50. https://doi.org/10.1016/j.ejca.2005. 11.032. Langmead, Ben. 2010. ‘Aligning Short Sequencing Reads with Bowtie’. Current Protocols in Bioinformatics 32 (1). https://doi.org/10.1002/0471250953.bi1107s32. Langmead, Ben, Cole Trapnell, Mihai Pop, and Steven L Salzberg. 2009. ‘Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome’. Genome Biology 10 (3): R25. https://doi.org/10.1186/gb-2009-10-3-r25. Lee, M.-C. W., F. J. Lopez-Diaz, S. Y. Khan, M. A. Tariq, Y. Dayn, C. J. Vaske, A. J. Radenbaugh, H. J. Kim, B. M. Emerson, and N. Pourmand. 2014. ‘Single-Cell Analyses of Transcriptional Heterogeneity during Drug Tolerance Transition in Cancer Cells by RNA Sequencing’. Proceedings of the National Academy of Sciences 111 (44): E4726–35. https://doi.org/10.1073/pnas.1404656111. Lex, Alexander, Christian Partl, Denis Kalkofen, Marc Streit, Samuel Gratzl, Anne Mai Wassermann, Dieter Schmalstieg, and Hanspeter Pfister. 2013. ‘Entourage: Visualizing Relationships between Biological Pathways Using Contextual Subsets’. IEEE Transactions on Visualization and Computer Graphics 19 (12): 2536–45. https://doi.org/10.1109/TVCG.2013.154. Li, Jian, Aarif Mohamed Nazeer Batcha, Björn Gaining, and Ulrich R. Mansmann. 2015. ‘An NGS Workflow Blueprint for DNA Sequencing Data and Its Application in Individualized Molecular Oncology’. Cancer Informatics 14s5 (January): CIN.S30793. https://doi.org/10.4137/CIN.S30793. Li, Xizhe, Xianyu Liu, Deze Zhao, Weifang Cui, Yingfang Wu, Chunfang Zhang, and Chaojun Duan. 2021. ‘TRNA-Derived Small RNAs: Novel Regulators of Cancer Hallmarks and Targets of Clinical Application’. Cell Death Discovery 7 (1): 249. https://doi.org/10.1038/s41420-021-00647-1. Liao, Jiangquan, Jie Wang, Yongmei Liu, Jun Li, and Lian Duan. 2019. ‘Transcriptome Sequencing of LncRNA, MiRNA, MRNA and Interaction Network Constructing in Coronary Heart Disease’. BMC Medical Genomics 12 (1): 124. https://doi.org/10. 1186/s12920-019-0570-z. Liu, Sijia, Lina Gu, Nan Wu, Jiayu Song, Jiazhuo Yan, Shanshan Yang, Yue Feng, Zhao Wang, Le Wang, Yunyan Zhang, and Yan Jin. 2021. ‘Overexpression of DTL Enhances Cell Motility and Promotes Tumor Metastasis in Cervical Adenocarcinoma

The Value of Next-Generation Sequencing …

33

by Inducing RAC1-JNK-FOXO1 Axis’. Cell Death & Disease 12 (10): 929. https://doi.org/10.1038/s41419-021-04179-5. Lo, Pang-Kuo, and Saraswati Sukumar. 2008. ‘Epigenomics and Breast Cancer’. Pharmacogenomics 9 (12): 1879–1902. https://doi.org/10.2217/14622416.9.12.1879. Ma, Sisi, Jiwen Ren, and David Fenyö. 2016. ‘Breast Cancer Prognostics Using MultiOmics Data’, 8. Ma, Xiaotu, Yu Liu, Yanling Liu, Ludmil B. Alexandrov, Michael N. Edmonson, Charles Gawad, Xin Zhou, Yongjin Li, Michael C. Rusch, John Easton, Robert Huether, Veronica Gonzalez-Pena, Mark R. Wilkinson, Leandro C. Hermida, Sean Davis, Edgar Sioson, Stanley Pounds, Xueyuan Cao, Rhonda E. Ries, Zhaoming Wang, Xiang Chen, Li Dong, Sharon J. Diskin, Malcolm A. Smith, Jaime M. Guidry Auvil, Paul S. Meltzer, Ching C. Lau, Elizabeth J. Perlman, John M. Maris, Soheil Meshinchi, Stephen P. Hunger, Daniela S. Gerhard, and Jinghui Zhang. 2018. ‘PanCancer Genome and Transcriptome Analyses of 1,699 Paediatric Leukaemias and Solid Tumours’. Nature 555 (7696): 371–76. https://doi.org/10.1038/nature25795. Marouf, Chaymaa, Omar Hajji, Amal Tazzite, Hassan Jouhadi, Abdellatif Benider, and Sellama Nadifi. 2017. ‘Germline Variants in the ATM Gene and Breast Cancer Susceptibility in Moroccan Women: A Meta-Analysis’. Egyptian Journal of Medical Human Genetics 18 (4): 329–34. https://doi.org/10.1016/j.ejmhg.2017.02.002. Marx, Vivien. 2013. ‘Data Visualization: Ambiguity as a Fellow Traveler’. Nature Methods 10 (7): 613–15. https://doi.org/10.1038/nmeth.2530. Mccain, Jack. n.d. ‘The Cancer Genome Atlas: New Weapon in Old War?’, 8. McFarland, Christopher D., Julia A. Yaglom, Jonathan W. Wojtkowiak, Jacob G. Scott, David L. Morse, Michael Y. Sherman, and Leonid A. Mirny. 2017. ‘The Damaging Effect of Passenger Mutations on Cancer Progression’. Cancer Research 77 (18): 4763–72. https://doi.org/10.1158/0008-5472.CAN-15-3283-T. Medical Genome Initiative, Christian R. Marshall, Shimul Chowdhury, Ryan J. Taft, Mathew S. Lebo, Jillian G. Buchan, Steven M. Harrison, Ross Rowsey, Eric W. Klee, Pengfei Liu, Elizabeth A. Worthey, Vaidehi Jobanputra, David Dimmock, Hutton M. Kearney, David Bick, Shashikant Kulkarni, Stacie L. Taylor, John W. Belmont, Dimitri J. Stavropoulos, and Niall J. Lennon. 2020. ‘Best Practices for the Analytical Validation of Clinical Whole-Genome Sequencing Intended for the Diagnosis of Germline Disease’. NPJ Genomic Medicine 5 (1): 47. https://doi.org/10.1038/s41525020-00154-9. Menyhárt, Otília, and Balázs Győrffy. 2021. ‘Multi-Omics Approaches in Cancer Research with Applications in Tumor Subtyping, Prognosis, and Diagnosis’. Computational and Structural Biotechnology Journal 19: 949–60. https://doi.org/10.1016/j.csbj. 2021.01.009. Milanez-Almeida, Pedro, Andrew J. Martins, Ronald N. Germain, and John S. Tsang. 2020. ‘Cancer Prognosis with Shallow Tumor RNA Sequencing’. Nature Medicine 26 (2): 188–92. https://doi.org/10.1038/s41591-019-0729-3. Mu, Wenbo. 2019. ‘Detection of Structural Variation Using Target Captured NextGeneration Sequencing Data for Genetic Diagnostic Testing’. Genetics in Medicine 21 (7): 8.

34

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Murugesan, Manikandan, and Kumpati Premkumar. 2021. ‘Systemic Multi-Omics Analysis Reveals Amplified P4HA1 Gene Associated With Prognostic and Hypoxic Regulation in Breast Cancer’. Frontiers in Genetics 12 (February): 632626. https://doi.org/10.3389/fgene.2021.632626. Nguyen, Quang-Huy, Hung Nguyen, Tin Nguyen, and Duc-Hau Le. 2020. ‘Multi-Omics Analysis Detects Novel Prognostic Subgroups of Breast Cancer’. Frontiers in Genetics 11 (October): 574661. https://doi.org/10.3389/fgene.2020.574661. Oh, Ji Hoon, Ho Hur, Ji-Yeon Lee, Yeejeong Kim, Younsoo Seo, and Myoung Hee Kim. 2017. ‘The Mitotic Checkpoint Regulator RAE1 Induces Aggressive Breast Cancer Cell Phenotypes by Mediating Epithelial-Mesenchymal Transition’. Scientific Reports 7 (1): 42256. https://doi.org/10.1038/srep42256. Ohnstad, Hege O., Elin Borgen, Ragnhild S. Falk, Tonje G. Lien, Marit Aaserud, My Anh T. Sveli, Jon A. Kyte, Vessela N. Kristensen, Gry A. Geitvik, Ellen Schlichting, Erik A. Wist, Therese Sørlie, Hege G. Russnes, and Bjørn Naume. 2017. ‘Prognostic Value of PAM50 and Risk of Recurrence Score in Patients with Early-Stage Breast Cancer with Long-Term Follow-Up’. Breast Cancer Research 19 (1): 120. https://doi.org/ 10.1186/s13058-017-0911-9. Paik, Soonmyung, Chungyeul Kim, Frederick L Baehner, Taesung Park, D Lawrence Wickerham, and Norman Wolmark. 2004. ‘A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer’. The New England Journal of Medicine, 10. Partl, Christian, Alexander Lex, Marc Streit, Denis Kalkofen, Karl Kashofer, and Dieter Schmalstieg. 2013. ‘EnRoute: Dynamic Path Extraction from Biological Pathway Maps for Exploring Heterogeneous Experimental Datasets’. BMC Bioinformatics 14 (S19): S3. https://doi.org/10.1186/1471-2105-14-S19-S3. PCAWG Drivers and Functional Interpretation Working Group, PCAWG Consortium, Matthew A. Reyna, David Haan, Marta Paczkowska, Lieven P. C. Verbeke, Miguel Vazquez, Abdullah Kahraman, Sergio Pulido-Tamayo, Jonathan Barenboim, Lina Wadi, Priyanka Dhingra, Raunak Shrestha, Gad Getz, Michael S. Lawrence, Jakob Skou Pedersen, Mark A. Rubin, David A. Wheeler, Søren Brunak, Jose M. G. Izarzugaza, Ekta Khurana, Kathleen Marchal, Christian von Mering, S. Cenk Sahinalp, Alfonso Valencia, Jüri Reimand, Joshua M. Stuart, and Benjamin J. Raphael. 2020. ‘Pathway and Network Analysis of More than 2500 Whole Cancer Genomes’. Nature Communications 11 (1): 729. https://doi.org/10.1038/s41467-02014367-0. Pećina-Šlaus, Nives, Anja Kafka, Iva Salamon, and Anja Bukovac. 2020. ‘Mismatch Repair Pathway, Genome Stability and Cancer’. Frontiers in Molecular Biosciences 7 (June): 122. https://doi.org/10.3389/fmolb.2020.00122. Pu, Minya, Karen Messer, Sherri R. Davies, Tammi L. Vickery, Emily Pittman, Barbara A. Parker, Matthew J. Ellis, Shirley W. Flatt, Catherine R. Marinac, Sandahl H. Nelson, Elaine R. Mardis, John P. Pierce, and Loki Natarajan. 2020. ‘Research-Based PAM50 Signature and Long-Term Breast Cancer Survival’. Breast Cancer Research and Treatment 179 (1): 197–206. https://doi.org/10.1007/s10549-019-05446-y. Qu, Zhonglin, Chng Wei Lau, Quang Vinh Nguyen, Yi Zhou, and Daniel R Catchpoole. 2019. ‘Visual Analytics of Genomic and Cancer Data: A Systematic Review’. Cancer

The Value of Next-Generation Sequencing …

35

Informatics 18 (January): 117693511983554. https://doi.org/10.1177/1176935119835 546. Radovich, Milan, Curtis R. Pickering, Ina Felau, Gavin Ha, Hailei Zhang, Heejoon Jo, Katherine A. Hoadley, Pavana Anur, Jiexin Zhang, Mike McLellan, Reanne Bowlby, Thomas Matthew, Ludmila Danilova, Apurva M. Hegde, Jaegil Kim, Mark D. M. Leiserson, Geetika Sethi, Charles Lu, Michael Ryan, Xiaoping Su, Andrew D. Cherniack, Gordon Robertson, Rehan Akbani, Paul Spellman, John N. Weinstein, D. Neil Hayes, Ben Raphael, Tara Lichtenberg, Kristen Leraas, Jean Claude Zenklusen, Junya Fujimoto, Cristovam Scapulatempo-Neto, Andre L. Moreira, David Hwang, James Huang, Mirella Marino, Robert Korst, Giuseppe Giaccone, Yesim GokmenPolar, Sunil Badve, Arun Rajan, Philipp Ströbel, Nicolas Girard, Ming S. Tsao, Alexander Marx, Anne S. Tsao, Patrick J. Loehrer, Adrian Ally, Elizabeth L. Appelbaum, J. Todd Auman, Miruna Balasundaram, Saianand Balu, Madhusmita Behera, Rameen Beroukhim, Mario Berrios, Giovanni Blandino, Tom Bodenheimer, Moiz S. Bootwalla, Jay Bowen, Denise Brooks, Flavio M. Carcano, Rebecca Carlsen, Andre L. Carvalho, Patricia Castro, Lara Chalabreysse, Lynda Chin, Juok Cho, Gina Choe, Eric Chuah, Sudha Chudamani, Carrie Cibulskis, Leslie Cope, Matthew G. Cordes, Daniel Crain, Erin Curley, Timothy Defreitas, John A. Demchok, Frank Detterbeck, Noreen Dhalla, Hendrik Dienemann, W. Jeff Edenfield, Francesco Facciolo, Martin L. Ferguson, Scott Frazer, Catrina C. Fronick, Lucinda A. Fulton, Robert S. Fulton, Stacey B. Gabriel, Johanna Gardner, Julie M. Gastier-Foster, Nils Gehlenborg, Mark Gerken, Gad Getz, David I. Heiman, Shital Hobensack, Andrea Holbrook, Robert A. Holt, Alan P. Hoyle, Carolyn M. Hutter, Michael Ittmann, Stuart R. Jefferys, Corbin D. Jones, Steven J. M. Jones, Katayoon Kasaian, Jaegil Kim, Patrick K. Kimes, Phillip H. Lai, Peter W. Laird, Michael S. Lawrence, Pei Lin, Jia Liu, Laxmi Lolla, Yiling Lu, Yussanne Ma, Dennis T. Maglinte, David Mallery, Elaine R. Mardis, Marco A. Marra, Julie Martin, Michael Mayo, Sam Meier, Michael Meister, Shaowu Meng, Matthew Meyerson, Piotr A. Mieczkowski, Christopher A. Miller, Gordon B. Mills, Richard A. Moore, Scott Morris, Lisle E. Mose, Thomas Muley, Andrew J. Mungall, Karen Mungall, Rashi Naresh, Yulia Newton, Michael S. Noble, Taofeek Owonikoko, Joel S. Parker, Joseph Paulaskis, Robert Penny, Charles M. Perou, Corinne Perrin, Todd Pihl, Amie Radenbaugh, Suresh Ramalingam, Nilsa Ramirez, Ralf Rieker, Jeffrey Roach, Sara Sadeghi, Gordon Saksena, Jacqueline E. Schein, Heather K. Schmidt, Steven E. Schumacher, Candace Shelton, Troy Shelton, Yan Shi, Juliann Shih, Gabriel Sica, Henrique C. S. Silveira, Janae V. Simons, Payal Sipahimalani, Tara Skelly, Heidi J. Sofia, Matthew G. Soloway, Joshua Stuart, Qiang Sun, Angela Tam, Donghui Tan, Roy Tarnuzzer, Nina Thiessen, David J. Van Den Berg, Mohammad A. Vasef, Umadevi Veluvolu, Doug Voet, Vonn Walter, Yunhu Wan, Zhining Wang, Arne Warth, Cleo-Aron Weis, Daniel J. Weisenberger, Matthew D. Wilkerson, Lisa Wise, Tina Wong, Hsin-Ta Wu, Ye Wu, Liming Yang, Jiashan Zhang, and Erik Zmuda. 2018. ‘The Integrated Genomic Landscape of Thymic Epithelial Tumors’. Cancer Cell 33 (2): 244-258.e10. https://doi.org/10.1016/j.ccell. 2018.01.003. Reimand, Jüri, Tambet Arak, Priit Adler, Liis Kolberg, Sulev Reisberg, Hedi Peterson, and Jaak Vilo. 2016. ‘G:Profiler—a Web Server for Functional Interpretation of Gene

36

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Lists (2016 Update)’. Nucleic Acids Research 44 (W1): W83–89. https://doi.org/ 10.1093/nar/gkw199. Ren, Lili, Junyi Li, Chuhan Wang, Zheqi Lou, Shuangshu Gao, Lingyu Zhao, Shuoshuo Wang, Anita Chaulagain, Minghui Zhang, Xiaobo Li, and Jing Tang. 2021. ‘Single Cell RNA Sequencing for Breast Cancer: Present and Future’. Cell Death Discovery 7 (1): 104. https://doi.org/10.1038/s41420-021-00485-1. Renault, Anne-Laure, Fabienne Lesueur, Yan Coulombe, Stéphane Gobeil, Penny Soucy, Yosr Hamdi, Sylvie Desjardins, Florence Le Calvez-Kelm, Maxime Vallée, Catherine Voegele, The Breast Cancer Family Registry, John L. Hopper, Irene L. Andrulis, Melissa C. Southey, Esther M. John, Jean-Yves Masson, Sean V. Tavtigian, and Jacques Simard. 2016. ‘ABRAXAS (FAM175A) and Breast Cancer Susceptibility: No Evidence of Association in the Breast Cancer Family Registry’. Edited by Alvaro Galli. PLOS ONE 11 (6): e0156820. https://doi.org/10.1371/journal.pone.0156820. Roberti, Annalisa, Adolfo F. Valdes, Ramón Torrecillas, Mario F. Fraga, and Agustin F. Fernandez. 2019. ‘Epigenetics in Cancer Therapy and Nanomedicine’. Clinical Epigenetics 11 (1): 81. https://doi.org/10.1186/s13148-019-0675-4. Roberts, Maegan E., Sarah A. Jackson, Lisa R. Susswein, Nur Zeinomar, Xinran Ma, Megan L. Marshall, Amy R. Stettner, Becky Milewski, Zhixiong Xu, Benjamin D. Solomon, Mary Beth Terry, Kathleen S. Hruska, Rachel T. Klein, and Wendy K. Chung. 2018. ‘MSH6 and PMS2 Germ-Line Pathogenic Variants Implicated in Lynch Syndrome Are Associated with Breast Cancer’. Genetics in Medicine 20 (10): 1167– 74. https://doi.org/10.1038/gim.2017.254. Rybin, Matthew J., Melina Ramic, Natalie R. Ricciardi, Philipp Kapranov, Claes Wahlestedt, and Zane Zeier. 2021. ‘Emerging Technologies for Genome-Wide Profiling of DNA Breakage’. Frontiers in Genetics 11 (January): 610386. https://doi.org/10.3389/fgene.2020.610386. Sadeghi, Fatemeh, Marzieh Asgari, Mojdeh Matloubi, Maral Ranjbar, Nahid Karkhaneh Yousefi, Tahereh Azari, and Majid Zaki-Dizaji. 2020. ‘Molecular Contribution of BRCA1 and BRCA2 to Genome Instability in Breast Cancer Patients: Review of Radiosensitivity Assays’. Biological Procedures Online 22 (1): 23. https://doi. org/10.1186/s12575-020-00133-5. Sahu B., Dash S., Mohanty S. N., Rout S. K., Ensemble comparative study for diagnosis of breast cancer datasets, International Journal of Engineering & Technology, Vol.7(4.15), pp 281-285, 2018 Santana dos Santos, Elizabeth, François Lallemand, Ambre Petitalot, Sandrine M. Caputo, and Etienne Rouleau. 2020. ‘HRness in Breast and Ovarian Cancers’. International Journal of Molecular Sciences 21 (11): 3850. https://doi.org/10.3390/ijms21113850. Sarda, Shrutii, and Sridhar Hannenhalli. 2014. ‘Next-Generation Sequencing and Epigenomics Research: A Hammer in Search of Nails’. Genomics & Informatics 12 (1): 2. https://doi.org/10.5808/GI.2014.12.1.2. Schultz, David J., Abirami Krishna, Stephany L. Vittitow, Negin Alizadeh-Rad, Penn Muluhngwi, Eric C. Rouchka, and Carolyn M. Klinge. 2018. ‘Transcriptomic Response of Breast Cancer Cells to Anacardic Acid’. Scientific Reports 8 (1): 8063. https://doi.org/10.1038/s41598-018-26429-x.

The Value of Next-Generation Sequencing …

37

Serratì, Simona, Simona De Summa, Brunella Pilato, Daniela Petriella, Rosanna Lacalamita, Stefania Tommasi, and Rosamaria Pinto. 2016. ‘Next-Generation Sequencing: Advances and Applications in Cancer Diagnosis’. OncoTargets and Therapy Volume 9 (December): 7355–65. https://doi.org/10.2147/OTT.S99807. Setton, Jeremy, Pier Selenica, Semanti Mukherjee, Rachna Shah, Isabella Pecorari, Biko McMillan, Isaac X. Pei, Yelena Kemel, Ozge Ceyhan-Birsoy, Margaret Sheehan, Kaitlyn Tkachuk, David N. Brown, Liying Zhang, Karen Cadoo, Simon Powell, Britta Weigelt, Mark Robson, Nadeem Riaz, Kenneth Offit, Jorge S. Reis-Filho, and Diana Mandelker. 2021. ‘Germline RAD51B Variants Confer Susceptibility to Breast and Ovarian Cancers Deficient in Homologous Recombination’. NPJ Breast Cancer 7 (1): 135. https://doi.org/10.1038/s41523-021-00339-0. Shinde, Shivani S., Michele R. Forman, Henry M. Kuerer, Kai Yan, Florentia Peintinger, Kelly K. Hunt, Gabriel N. Hortobagyi, Lajos Pusztai, and W. Fraser Symmans. 2010. ‘Higher Parity and Shorter Breastfeeding Duration: Association with Triple-Negative Phenotype of Breast Cancer’. Cancer 116 (21): 4933–43. https://doi.org/10.1002/ cncr.25443. Shyr, Derek, and Qi Liu. 2013. ‘Next Generation Sequencing in Cancer Research and Clinical Application’, 11. Singh, Abhishek A., Karianne Schuurman, Ekaterina Nevedomskaya, Suzan Stelloo, Simon Linder, Marjolein Droog, Yongsoo Kim, Joyce Sanders, Henk van der Poel, Andries M. Bergman, Lodewyk F. A. Wessels, and Wilbert Zwart. 2019. ‘Optimized ChIP-Seq Method Facilitates Transcription Factor Profiling in Human Tumors’. Life Science Alliance 2 (1): e201800115. https://doi.org/10.26508/lsa.201800115. Smith, C. C., L. M. Bixby, K. L. Miller, S. R. Selitsky, D. S. Bortone, K. A. Hoadley, B. G. Vincent, and J. S. Serody. 2020. ‘Using RNA Sequencing to Characterize the Tumor Microenvironment’. In Biomarkers for Immunotherapy of Cancer, edited by Magdalena Thurin, Alessandra Cesano, and Francesco M. Marincola, 2055:245–72. Methods in Molecular Biology. New York, NY: Springer New York. https://doi.org/ 10.1007/978-1-4939-9773-2_12. Stolarova, Lenka, Petra Kleiblova, Marketa Janatova, Jana Soukupova, Petra Zemankova, Libor Macurek, and Zdenek Kleibl. 2020. ‘CHEK2 Germline Variants in Cancer Predisposition: Stalemate Rather than Checkmate’. Cells 9 (12): 2675. https://doi.org/ 10.3390/cells9122675. Stratton, Michael R., Peter J. Campbell, and P. Andrew Futreal. 2009. ‘The Cancer Genome’. Nature 458 (7239): 719–24. https://doi.org/10.1038/nature07943. Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. 2005. ‘Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting GenomeWide Expression Profiles’. Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102. Sukumar, Jasmine, Mahmoud Kassem, Doreen Agnese, Robert Pilarski, Bhuvaneswari Ramaswamy, Kevin Sweet, and Sagar Sardesai. 2021. ‘Concurrent Germline BRCA1, BRCA2, and CHEK2 Pathogenic Variants in Hereditary Breast Cancer: A Case Series’. Breast Cancer Research and Treatment 186 (2): 569–75. https://doi. org/10.1007/s10549-021-06095-w.

38

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

Sweet, Thomas J., and Angela H. Ting. 2016. ‘Women in cancer thematic review: Diverse Functions of DNA Methylation: Implications for Prostate Cancer and Beyond’. Endocrine-Related Cancer 23 (11): T169–78. https://doi.org/10.1530/ERC-16-0306. Tan, Hua, Jiguang Bao, and Xiaobo Zhou. 2012. ‘A Novel Missense-Mutation-Related Feature Extraction Scheme for “Driver” Mutation Identification’. Bioinformatics 28 (22): 2948–55. https://doi.org/10.1093/bioinformatics/bts558. Tate, John G., Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G. Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C. Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E. Rye, Helen E. Speedy, Ray Stefancsik, Sam L. Thompson, Shicai Wang, Sari Ward, Peter J. Campbell, and Simon A. Forbes. 2019. ‘COSMIC: The Catalogue of Somatic Mutations in Cancer’. Nucleic Acids Research 47 (D1): D941–47. https://doi.org/10.1093/nar/gky1015. Taylor, Amy, Angela F Brady, Ian M. Frayling, Helen Hanson, Marc Tischkowitz, Clare Turnbull, and Lucy Side. 2018. ‘Consensus for Genes to Be Included on Cancer Panel Tests Offered by UK Genetics Services: Guidelines of the UK Cancer Genetics Group’. Journal of Medical Genetics 55 (6): 372–77. https://doi.org/10.1136/ jmedgenet-2017-105188. Thankachan, Aswathy, and Mr Bino Thomas. 2018. ‘A Study of Next Generation Sequencing Data, Workflow, Application and Platform Comparison’. IOP Conference Series: Materials Science and Engineering 396 (August): 012031. https://doi.org/ 10.1088/1757-899X/396/1/012031. The International Cancer Genome Consortium. 2010. ‘International Network of Cancer Genome Projects’. Nature 464 (7291): 993–98. https://doi.org/10.1038/nature08987. Thu, Kl, I Soria-Bretones, Tw Mak, and Dw Cescon. 2018. ‘Targeting the Cell Cycle in Breast Cancer: Towards the next Phase’. Cell Cycle 17 (15): 1871–85. https://doi. org/10.1080/15384101.2018.1502567. Tian, Sun, Paul Roepman, Laura J van’t Veer, Rene Bernards, Femke De Snoo, and Annuska M Glas. 2010. ‘Biological Functions of the Genes in the Mammaprint Breast Cancer Profile Reflect the Hallmarks of Cancer’. Biomarker Insights 5 (January): BMI.S6184. https://doi.org/10.4137/BMI.S6184. Toh, Ming Ren, Siao Ting Chong, Sock Hoai Chan, Chen Ee Low, Nur Diana Binte Ishak, Jing Quan Lim, Eliza Courtney, and Joanne Ngeow. 2019. ‘Functional Analysis of Clinical BARD1 Germline Variants’. Molecular Case Studies 5 (4): a004093. https://doi.org/10.1101/mcs.a004093. Torri, Federica, Ivo D. Dinov, Alen Zamanyan, Sam Hobel, Alex Genco, Petros Petrosyan, Andrew P. Clark, Zhizhong Liu, Paul Eggert, Jonathan Pierce, James A. Knowles, Joseph Ames, Carl Kesselman, Arthur W. Toga, Steven G. Potkin, Marquis P. Vawter, and Fabio Macciardi. 2012. ‘Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows’. Genes 3 (3): 545–75. https://doi.org/ 10.3390/genes3030545. Turkay, Cagatay, Alexander Lex, Marc Streit, Hanspeter Pfister, and Helwig Hauser. 2014. ‘Characterizing Cancer Subtypes Using Dual Analysis in Caleydo StratomeX’. IEEE Computer Graphics and Applications 34 (2): 38–47. https://doi.org/10.1109/MCG. 2014.1.

The Value of Next-Generation Sequencing …

39

Vidal, E., S. Sayols, S. Moran, A. Guillaumet-Adkins, M. P. Schroeder, R. Royo, M. Orozco, M. Gut, I. Gut, N. Lopez-Bigas, H. Heyn, and M. Esteller. 2017. ‘A DNA Methylation Map of Human Cancer at Single Base-Pair Resolution’. Oncogene 36 (40): 5648–57. https://doi.org/10.1038/onc.2017.176. Volkova, Nadezda V., Bettina Meier, Víctor González-Huici, Simone Bertolini, Santiago Gonzalez, Harald Vöhringer, Federico Abascal, Iñigo Martincorena, Peter J. Campbell, Anton Gartner, and Moritz Gerstung. 2020. ‘Mutational Signatures Are Jointly Shaped by DNA Damage and Repair’. Nature Communications 11 (1): 2169. https://doi.org/10.1038/s41467-020-15912-7. Wang, Mei, Daniel Klevebring, Johan Lindberg, Kamila Czene, Henrik Grönberg, and Mattias Rantalainen. 2016. ‘Determining Breast Cancer Histological Grade from RNA-Sequencing Data’. Breast Cancer Research 18 (1): 48. https://doi.org/10. 1186/s13058-016-0710-8. Wang, Ye, Michael Mashock, Zhuang Tong, Xiaofeng Mu, Hong Chen, Xin Zhou, Hong Zhang, Gexin Zhao, Bin Liu, and Xinmin Li. 2020. ‘Changing Technologies of RNA Sequencing and Their Applications in Clinical Oncology’. Frontiers in Oncology 10 (April): 447. https://doi.org/10.3389/fonc.2020.00447. Wang, Zhong, Mark Gerstein, and Michael Snyder. 2009. ‘RNA-Seq: A Revolutionary Tool for Transcriptomics’. Nature Reviews Genetics 10 (1): 57–63. https://doi. org/10.1038/nrg2484. Woerner, Audrey C., Renata C. Gallagher, Jerry Vockley, and Aashish N. Adhikari. 2021. ‘The Use of Whole Genome and Exome Sequencing for Newborn Screening: Challenges and Opportunities for Population Health’. Frontiers in Pediatrics 9 (July): 663752. https://doi.org/10.3389/fped.2021.663752. Wyrick, John J., and Steven A. Roberts. 2015. ‘Genomic Approaches to DNA Repair and Mutagenesis’. DNA Repair 36 (December): 146–55. https://doi.org/10.1016/j.dnarep. 2015.09.018. Yeo, Syn Kok, Xiaoting Zhu, Takako Okamoto, Mingang Hao, Cailian Wang, Peixin Lu, Long Jason Lu, and Jun-Lin Guan. 2020. ‘Single-Cell RNA-Sequencing Reveals Distinct Patterns of Cell State Heterogeneity in Mouse Models of Breast Cancer’. ELife 9 (August): e58810. https://doi.org/10.7554/eLife.58810. Yersal, Ozlem. 2014. ‘Biological Subtypes of Breast Cancer: Prognostic and Therapeutic Implications’. World Journal of Clinical Oncology 5 (3): 412. https://doi.org/10. 5306/wjco.v5.i3.412. Yoo, Seong-Keun, Young Shin Song, Eun Kyung Lee, Jinha Hwang, Hwan Hee Kim, Gyeongseo Jung, Young A Kim, Su-jin Kim, Sun Wook Cho, Jae-Kyung Won, EunJae Chung, Jong-Yeon Shin, Kyu Eun Lee, Jong-Il Kim, Young Joo Park, and JeongSun Seo. 2019. ‘Integrative Analysis of Genomic and Transcriptomic Characteristics Associated with Progression of Aggressive Thyroid Cancer’. Nature Communications 10 (1): 2764. https://doi.org/10.1038/s41467-019-10680-5. Yuan, Jiao, Kevin H. Kensler, Zhongyi Hu, Youyou Zhang, Tianli Zhang, Junjie Jiang, Mu Xu, Yutian Pan, Meixiao Long, Kathleen T. Montone, Janos L. Tanyi, Yi Fan, Rugang Zhang, Xiaowen Hu, Timothy R. Rebbeck, and Lin Zhang. 2020. ‘Integrative Comparison of the Genomic and Transcriptomic Landscape between Prostate Cancer Patients of Predominantly African or European Genetic Ancestry’. Edited by Peter

40

Amel Elbasyouni, Leila Saadi and Abdelkarim Baha

McKinnon. PLOS Genetics 16 (2): e1008641. https://doi.org/10.1371/journal.pgen. 1008641. Zhang, J., J. Liu, J. Sun, C. Chen, G. Foltz, and B. Lin. 2014. ‘Identifying Driver Mutations from Sequencing Data of Heterogeneous Tumors in the Era of Personalized Genome Sequencing’. Briefings in Bioinformatics 15 (2): 244–55. https://doi.org/10.1093/ bib/bbt042. Zheng, Fang, Le Wei, Liang Zhao, and FuChuan Ni. 2018. ‘Pathway Network Analysis of Complex Diseases Based on Multiple Biological Networks’. BioMed Research International 2018 (July): 1–12. https://doi.org/10.1155/2018/5670210.

Chapter 2

Next-Generation Sequencing and Omics Data Analysis Techniques Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi4 and Lawrence Achilles Nnyanzi4 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract Next-generation sequencing and omics techniques cut across different fields of science, such as the pharmaceutical industry, gene therapy applications, therapeutics and diagnostics, disease prevention and pharmacogenomics, developmental biology, comparative genomics, and evolutionary genomics. Progress in next-generation sequencing technologies is revolutionizing genomics, epigenomes, and transcriptomics research such as genomes - single nucleotide *

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

42

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. polymorphisms, loss of heterozygosity variants, copy number variants, genomic rearrangements, rare variants), epigenomes - DNA methylation, chromatin accessibility, histone modifications, binding, transcription factor) and transcriptomes - alternative splicing, gene expression, small RNAs, and long non-coding RNAs. There are different tools available for the integration of omics datasets, such as computational tools and web-based tools. Therefore, this chapter intends to provide detailed information on next-generation sequencing and omics data analysis techniques.

Keywords: next-generation sequencing, omics techniques, gene therapy applications, therapeutics and diagnostics, disease prevention and pharmacogenomics, developmental biology, comparative genomics

Introduction Omics data analysis involves the approach engaged in creating meaning from big data sets. This study consists of using mega data volume to prevent and predict diseases and discover biomarkers from a complete gene profile obtained from organisms. Large data sets obtained from high throughput sequencing (HTS) require a promising approach to validate and make sense out of the sequences. In this significant data era, omics data analysis can be used by scientists to predict disease and propose recurrence in disease through dataset prediction. This analysis also involves the use of genomic datasets and other forms, such as transcriptomics, proteomics, and metabolomics, for preventive medicine predictions, drug discovery (Fadare et al., 2021) and development, precision medicine, survival analysis (Kaur et al., 2021, Adetunji et al., 2022a-j; Olaniyan et al., 2022a, b; Oyedara et al., 2022) and biomarker discovery. Another aspect of omics analysis is meta omics, which has been found useful in several industries, especially agriculture and food. In food industries, meta omics has been applied in exploring the safety and quality of nutrition (Chaillou et al., 2015; Kergourlay et al., 2015; Tan et al., 2015; Thoendel et al., 2016), as well as understanding and analyzing food fermentation products such as in yogurt, cheese and dairy (Chen et al., 2017). Another application of these techniques is in grapevine meta omics analysis. This use of several bioinformatics tools for analyzing HTS data analysis has given insights to diseases associated with this economical and industrial veritable plant. The study of omics of this plant has been able to

Next-Generation Sequencing and Omics Data Analysis Techniques

43

either improve the plant yield, produce disease resistance in plants or even induce disease in other to understand the gene response to such state (Alaimo et al., 2018). To increase the accuracy of prediction and effective management of the big data, the use of machine learning (ML) and deep learning has further increased it percentage accuracy of prediction. In the medical field, machine learning is becoming a popular method of analysis because of its capability to detect key features from complex datasets. Picard et al. (2021) and Reel et al. (2021) in their studies reviewed the integrated strategies that can be applied in the utilization of machine learning for multi-omics data analysis. In the study of Oh et al. (2020), they reviewed the application of ML-based analysis for HTS datasets for disease predicition, clinical management, survival analysis using multi-omics datasets. Other approaches in using big data for predictive analysis includes radiomics (Kaur et al., 2021) which involves omics imaging data, toxicogenomics (Verheijen et al., 2020), metagenomics of microbial community (Joyce and Palsson, 2006; Segata et al., 2013). Omics dataset analysis for various brain disorder related diseases (Cui et al., 2021; Dong et al., 2021), neurogenerative diseases (Nomura et al., 2021), transcriptomics and epigenomics of brain development (Jourdon et al., 2021). Though conventional protein purification results to discovery of new proteins and enzymes with desired properties for various industrial applications (Ogundolie, 2015; Ayodeji et al., 2017; Ogundolie, 2021; Ogundolie et al., 2022, Adetunji et al. 2022m-p; Adetunji et al. 2023a, b)), the application of NGS and the various omics dataset analysis has led to the emergence of enzyme technology today which is giving way to improved enzymes through various genetic engineering techniques and gene modification trials for various industrial applications.

Recent Advances in the Application of Next-Generation Sequencing and Omics Data Analysis Techniques Biswapriya et al. (2019) reported that evaluation of biological data through high throughput omics techniques like Genomics, proteomics, metabolomics and transcriptomics have the capacity to generate huge files like integrated omics poly-omics, multi-omics, pan-omics and trans-omics. Using high throughput technologies, several millions of proteins or nucleotides can be

44

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

analyzed and sequenced, generating large data files with reduced cost to provide understanding of the molecular basis of diverse phenotypes and genetic codes. Balamurugan et al. (2019) showed how progress in the application of next-generation sequencing has helped to improve food security and safety. In their report, the authors stated that bioinformatics tools were combined with the next-generation sequencing to isolate and compare individual gene or strain utilizing Whole genome sequencing, single nucleotide polymorphism, and multi-locus sequence typing. Also, different Metagenomics techniques are being utilized to identify microbial populations, diagnostics, antimicrobial resistance, outbreak investigations, food authenticity and forensics so as to improve food safety. Theodore and Andrew, (2015) highlighted the role of next-generation sequencing in the prevention and management of infectious diseases through multiscale data analysis. The authors applied these techniques in analysis of clinical samples creating multiscale predictive models. In clinical microbiology investigation of infectious diseases, routine analysis of microbial genome for the management if infectious diseases will improve clinical outcomes, save cost, enhance accuracy and time using next-generation sequencing. Chien-Yueh et al. (2013) demonstrated that the numerous advantages of next-generation sequencing make it the technology of choice for several high throughput analyses. Some of these advantages are cost-effectiveness, high resolution, accuracy in genomic analyses and unprecedented sequencing speed. The authors revealed that next-generation sequencing employs techniques like whole genome sequencing, gene expression profiling, target sequencing, small RNA sequencing and chromatin immune-precipitation sequencing in comprehensively analyzing biological data for genomic identification, drug discovery, disease diagnosis and genetic testing. Next-generation sequencing has numerous advantages over microarray techniques such as improved accuracy, higher throughput and several applications like whole genome sequencing, re-sequencing, de novo assembly sequencing, transcriptome sequencing at the RNA or DNA level which can be utilized in the prediction of protein, genes, signaling molecules, pathways and enzymes. The rapid explosion of data witnessed recently has been attributed to the advancement in next-generation sequencing and bioinformatics tools. Emerging tools in molecular biology is facilitating opportunity for molecular analysis of

Next-Generation Sequencing and Omics Data Analysis Techniques

45

pathogenic organism’s genome for disease prevention, surveillance and therapeutic drug discovery. Recent development of next-generation sequencing and deployment into infectious diseases investigation such as Ebola, zika, lassa and novel coronavirus SARS-CoV-2 virus has received massive support form scientists. Next-generation sequencing can provide wide and high-resolution information of pathogen genome for diagnostic, surveillance, research and drug formulations. Jerzy, (2015) showed that in next-generation sequencing, samples like RNA, DNA and methylation can be used for biological analysis to produce large scale data at a very reduced cost and time. The authors noted that compared with the traditional first-generation Sanger sequencing, nextgeneration sequencing cost and large data run in gigabase high throughput production is more efficient and provides more understanding in the field of systeomics, proteomics, genomics and transcriptomics. Shanrong et al. (2017) revealed that the rapid evolution of next-generation sequencing technologies is reshaping the several aspects of genomics study particularly in the area of drug design, development and biological research due to the low cost. The authors noted that the large amount of data generated through next-generation sequencing are being incorporated into computational techniques and cloud computing for easy analysis, privacy, storage, security, and transfer. Also, in pharmacogenomics where drug response, toxicity, efficacy in health and disease status are evaluated using clinical biomarkers and SNPs technology for personalized medicine, next-generation sequencing provides approaches that can identify genes, variants, and mutations. Therefore, it can be suggested that nest generation sequencing facilitates precision medicine through prognosis, therapeutic, diagnosis, and analysis of massive amounts of data. Guohua et al. (2015) showed that the plethora of data emerging from biological sources can be a challenge in genomics research. Recently, large data generation by next-generation sequencing has triggered the development of bioinformatics tools for improved visualization, analysis, and interpretation of data. Xiaohong et al. (2015) conducted research on Anopheles gambiae to explore the genetic variation and the molecular basis of the traits using nextgeneration sequencing technologies. From their study, they discovered several large LD blocks around para gene of the mosquito. The authors revealed that the genetic makeup of the mosquito makes it susceptible to insecticide resistance, and P. falciparum infection.

46

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Feng et al. (2015) studied the importance of next-generation sequencing like RNA-seq technology in massive production of large data to uncover alternative splicing isoforms and sites using bioinformatics techniques for different biological functions like gene expression, epigenetics studies, protein diversity, and system biology. The authors demonstrated that the splice sites can predict the position of intron/exon structures, site features, and recognition. Stéphane et al. (2013) reported that the role of functional omics techniques and next-generation sequencing in system biology and bioinformatics. The authors noted that living organisms are made up of molecules serving as building blocks and signaling markers for various functions like RNA, DNA, lipids, proteins, and metabolites. Information is encoded in these biomolecules which could be analyzed through computational and high-throughput technologies. Xinkun, (2018) reported the approaches involved in the analysis of molecular information in genetic components through DNA sequencing, recombinants, and PCR. Today, the whole genome sequencing of bacteria and viruses has been achieved which is changing the landscape of bioinformatics, cloud computing applications using next generating sequencing to generate large quality data. Chen et al. (2016) also reveal how data generated through omics technologies can be utilized to study the genetic pathways and disease mechanism in cell using Dimension reduction approaches. The authors further demonstrated that exploratory analysis of next-generation sequencing data sets could provide unprecedented knowledge biological systems. Jun et al. (2021) reported that the emergence of next-generation sequencing will lead to transformation in biomedical research. The authors revealed that large data from genomic, proteomic, transcriptomic, metabolic and epigenomic research derived from human, cell lines, and animals have resulted in unprecedented opportunities in diverse biomedical applications. Bao et al. (2014) highlighted the usefulness of next-generation sequencing like exome sequencing in the study of diseases. This technique is cost effective with several applications in bioinformatics techniques like raw data quality assessment, alignment, pre-processing, variant analysis, big data management and post-processing. Po-Yen et al. (2014) discussed the relevance of nextgeneration sequencing in personalized cardiovascular disease and clinical diagnosis. Using the risk factors of cardiovascular diseases like atherosclerosis, and coronary artery disease, prevention and prediction can be made using omics data from biomarkers (Behera et al., 2016; Dash et al., 2017; Dash and Abraham, 2018; Rahman et al., 2018; Sahu et al., 2018; Dash et al.,

Next-Generation Sequencing and Omics Data Analysis Techniques

47

2019; Dash et al., 2020; Dash et al., 2021; Rahman et al., 2021). The authors highlighted that next-generation technique like RNA-seq and ChIP-seq is a promising tool in the enhancement of understanding of the molecular mechanisms of diseases, improvement in the development of personalized care, and reduction in mortality rates through identification of risk factors. Mrozek et al. (2021) showed that computational tools provide support and improvement in the quality of next-generation sequence and multi-omics data for molecular profiling. Rute et al. (2020) reported that diagnosis through clinical genetics has provided much needed influence in health care system. Several thousands of genes can be analyzed using next-generation sequencing at unprecedented speed. Tiziana and Grazia (2020) reported that bioinformatics application in next-generation sequencing has impacted the agricultural sector, particularly in food production. The development of different databases in biological systems and network systems has been a crucial application in food production, microbiology, and fermentation. Today, omics research has developed the use of nanobiomaterials (Adetunji et al., 2022k, l). den Besten et al. (2018) also reported on the role of next-generation sequencing in public health risk assessment and food safety. The authors revealed that phenotypic data provided mechanistic cellular information and exposure assessment through omics techniques and next-generation sequencing. Mira et al. (2019) demonstrated the role of multi-omics analysis like proteomics, small RNA transcriptomics and mRNA in the clinical understanding of kidney fibrotic molecular pathogenesis. The cheaper rate of running next-generation sequencing has resulted in data generation for diagnostic, research, forensic and bioinformatics analysis. This technology is changing the landscape of genomic research in clinical settings. The authors revealed that there are still many more improvements to be done such as data accessibility, interpretation and processing. Adequate knowledge in bioinformatics is required to actively select the appropriate algorithms or platform.

Conclusion This chapters have provided a detailed information on Next-Generation Sequencing and omics data analysis techniques that could facilitate data cleaning, biomolecule identification, normalization, data dimensionality reduction, statistical validation, biological contextualization, data handling and storage, data archiving and sharing. These computational and informatics

48

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

techniques could also play a significant role in the reduction of numerous processing time, cost and more data on omics like glycomics, microbiomics, phenomics and lipidomics.

References Adetunji CO, Nwankwo W, Olayinka AS, Olugbemi OT, Akram M, Laila U, Samuel MO, Oshinjo AM, Adetunji JB, Okotie GE and Esiobu ND. (2022a). Computational Intelligence Techniques for Combating COVID-19. https://www.taylorfrancis.com/ chapters/edit/10.1201/9781003178903-16/computational-intelligence-techniquescombating-covid-19-charles-oluwaseun-adetunji-wilson-nwankwo-akinola-samsonolayinka-olaniyan-tope-olugbemi-muhammad-akram-umme-laila-michaelolugbenga-samuel-ayomide-michael-oshinjo-juliana-bunmi-adetunji-gloria-okotienwadiuto-diuto-esiobu. In book: In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics, (pp. 251-269). CRC Press. Adetunji CO, Nwankwo W, Olayinka AS, Olugbemi OT, Akram M, Laila U, Olugbenga MS, Oshinjo AM, Adetunji JB, Okotie GE and Esiobu ND. (2022b). Machine Learning and Behaviour Modification for COVID-19. https://accounts.taylorfrancis. com/identityv1/#/login?authorize=trueandclient_id=1e4a7127d79e837214ba643156 e37f599d0c2cd15c69d1b2d31cdcf9ee2279d0andresponse_type=codeandscope=mail andredirect_uri=https:%2F%2Fwww.taylorfrancis.com%2Flogin%2Fcallbackandstat e=eyJjdXJyZW50VXJsIjoiaHR0cHM6Ly93d3cudGF5bG9yZnJhbmNpcy5jb20vYm 9va3MvZWRpdC8xMC4xMjAxLzk3ODEwMDMxNzg5MDMvbWVkaWNhbC1ia W90ZWNobm9sb2d5LWJpb3BoYXJtYWNldXRpY3MtZm9yZW5zaWMtc2NpZ W5jZS1iaW9pbmZvcm1hdGljcy1oYWppeWEtbWFpcm8taW51d2EtaWZlb21hLW 1hdXJlZW4tZXplb251LWNoYXJsZXMtb2x1d2FzZXVuLWFkZXR1bmppLWVtb WFudWVsLW9sdWZlbWktZWt1bmRheW8tYWJ1YmFrYXItZ2lkYWRvLWFiZH VscmF6YWstaWJyYWhpbS1iZW5qYW1pbi1ld2EtdWJpIn0%3Dandflow=andbran d=ubx. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903. Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Garipova L and Shariati MA. (2022c). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://link.springer. com/chapter/10.1007/978-3-030-79753-9_10 Adetunji CO, Inobeme A, Tadso J, Olaniyan OT, Abimbola OF, Shahnawaz M, and Anani O. (2022d). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota, (pp. 301-312). Springer, Singapore. https://doi.org/10.1007/978-981-16-5403-9_16 Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Petukhova E and Shariati MA. (2022e). Machine Learning Approaches for COVID-

Next-Generation Sequencing and Omics Data Analysis Techniques

49

19 Pandemic. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://link.springer.com/chapter/ 10.1007/978-3-030-79753-9_8 Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Isabekova O and Shariati MA. (2022f). Smart Sensing for COVID-19 Pandemic. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds). Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://link.springer.com/chapter/10.1007/978-3-03079753-9_9 Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Petukhova E and Shariati MA. (2022g). Internet of Health Things (IoHT) for COVID19. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://link.springer.com/chapter/ 10.1007/978-3-030-79753-9_5 Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Koriagina N and Shariati MA. (2022h). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds). Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://link.springer. com/chapter/10.1007/978-3-030-79753-9_3 Adetunji CO, Olugbemi OT, Akram M, Laila U, Samuel MO, Oshinjo AM, Adetunji JB, Okotie GE, Esiobu ND, Oyedara OO and Adeyemi FM. (2022i). Application of Computational and Bioinformatics Techniques in Drug Repurposing for Effective Development of Potential Drug Candidate for the Management of COVID-19. https://accounts.taylorfrancis.com/identityv1/#/login?authorize=trueandclient_id=1e 4a7127d79e837214ba643156e37f599d0c2cd15c69d1b2d31cdcf9ee2279d0andrespo nse_type=codeandscope=mailandredirect_uri=https:%2F%2Fwww.taylorfrancis.co m%2Flogin%2Fcallbackandstate=eyJjdXJyZW50VXJsIjoiaHR0cHM6Ly93d3cudG F5bG9yZnJhbmNpcy5jb20vYm9va3MvZWRpdC8xMC4xMjAxLzk3ODEwMDMx Nzg5MDMvbWVkaWNhbC1iaW90ZWNobm9sb2d5LWJpb3BoYXJtYWNldXRp Y3MtZm9yZW5zaWMtc2NpZW5jZS1iaW9pbmZvcm1hdGljcy1oYWppeWEtbWF pcm8taW51d2EtaWZlb21hLW1hdXJlZW4tZXplb251LWNoYXJsZXMtb2x1d2FzZ XVuLWFkZXR1bmppLWVtbWFudWVsLW9sdWZlbWktZWt1bmRheW8tYWJ1 YmFrYXItZ2lkYWRvLWFiZHVscmF6YWstaWJyYWhpbS1iZW5qYW1pbi1ld2Et dWJpIn0%3Dandflow=andbrand=ubx. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition, First Published 2022, Imprint CRC Press. Pages 14. eBook ISBN 9781003178903. Adetunji CO, Samuel MO, Adetunji JB and Oluranti OI. (2022j). Corn Silk and Health Benefits. https://accounts.taylorfrancis.com/identityv1/#/login?authorize=trueand client_id=1e4a7127d79e837214ba643156e37f599d0c2cd15c69d1b2d31cdcf9ee2279 d0andresponse_type=codeandscope=mailandredirect_uri=https:%2F%2Fwww.taylor francis.com%2Flogin%2Fcallbackandstate=eyJjdXJyZW50VXJsIjoiaHR0cHM6Ly9 3d3cudGF5bG9yZnJhbmNpcy5jb20vYm9va3MvZWRpdC8xMC4xMjAxLzk3ODE

50

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

wMDMxNzg5MDMvbWVkaWNhbC1iaW90ZWNobm9sb2d5LWJpb3BoYXJtYW NldXRpY3MtZm9yZW5zaWMtc2NpZW5jZS1iaW9pbmZvcm1hdGljcy1oYWppe WEtbWFpcm8taW51d2EtaWZlb21hLW1hdXJlZW4tZXplb251LWNoYXJsZXMtb 2x1d2FzZXVuLWFkZXR1bmppLWVtbWFudWVsLW9sdWZlbWktZWt1bmRhe W8tYWJ1YmFrYXItZ2lkYWRvLWFiZHVscmF6YWstaWJyYWhpbS1iZW5qYW 1pbi1ld2EtdWJpIn0%3Dandflow=andbrand=ubx. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Adetunji CO, Ogundolie FA, Ajiboye MD, Mathew JT, Inobeme A, Dauda WP and Adetunji JB. (2022k). Nano-engineered Sensors for Food Processing. In Bio-and Nano-sensing Technologies for Food Processing and Packaging, (pp. 151-166). Royal Society of Chemistry. DOI:10.1039/9781839167966-00151. Adetunji, CO, Mathew JT, Inobeme A, Olaniyan OT, Singh KRB, Abimbola OF, Nayak V, Singh J and Singh RP. (2022l). Microbial and Plant Cell Biosensors for Environmental Monitoring. In: Singh RP, Ukhurebor KE, Singh J, Adetunji CO, Singh, K.R. (eds) Nanobiosensors for Environmental Monitoring. Springer, Cham. https://link.springer.com/chapter/10.1007/978-3-031-16106-3_9 Adetunji, C. O., Inobeme, A., Singh, K. R., Bodunrinde, R. E., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2022m) Genomic Analysis of Heavy Metal-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 113-126). CRC Press. Adetunji, C.O., Abimbola, O.F., Singh, K.R., Olaniyan, O.T., Bodunrinde, R.E., Inobeme, A., Mathew, J.T., Singh, J. and Singh, R.P., (2022m). Microbe Performance and Dynamics in Activated Sludge Digestion. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 99-112). CRC Press. https://doi.org/10.1201/ 9781003354147 Adetunji, C.O., Ogundolie, F.A., Ajiboye, M.D., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajayi, O.O., Dauda, W.P. and Ghazanfar, S., (2022o). Bio-and Nanosensors in the Food Industry. In Bio-and Nano-sensing Technologies for Food Processing and Packaging (pp. 22-36). Royal Society of Chemistry. Adetunji, C.O., Mathew, J.T., Singh, K.R., Bodunrinde, R.E., Inobeme, A., Olaniyan, O.T., Abimbola, O.F., Singh, J., Nayak, V. and Singh, R.P., (2022p) Molecular Characterization of Multidrug-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 127-141). CRC Press. Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O., Dauda, W. and Ghazanfar, S., (2023a). Graphene-based nanomaterials for targeted drug delivery and tissue engineering. Novel Platforms for Drug Delivery Applications, pp. 277-288. https://doi.org/ 10.1016/B978-0-323-91376-8.00014-8 Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ghazanfar, S., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O. and Dauda, W., (2023b). Nanotube platforms for effective drug delivery applications. Novel Platforms for Drug

Next-Generation Sequencing and Omics Data Analysis Techniques

51

Delivery Applications, pp.317-332. https://doi.org/10.1016/B978-0-323-91376-8.00 005-7 Alaimo S, Marceca GP, Giugno R, Ferro A and Pulvirenti A. (2018). Current knowledge and computational techniques for grapevine meta-omics analysis. Frontiers in Plant Science, 8, 2241. https://www.frontiersin.org/articles/10.3389/fpls.2017.02241/full Ayodeji AO, Ogundolie FA, Bamidele OS, Kolawole AO and Ajele JO. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus CFU-01: Purification and characterization for biotechnological application. J Microbiol Biotechnol, 6, 90-100. https://www.semanticscholar.org/paper/Raw-StarchDegrading%2C-Acidic-Thermostable-from-and-Ayodeji-Ogundolie/a890fc2eacb99 9b 4122 acd0a59 eaa855e643b353 Balamurugan Jagadeesan, Gerner-Smidt P, Allard MW, Leuillet S, Winkler A, Xiao Y, Chaffron S, Van Der Vossen J, Tang S, Katase M, McClure P, Kimura B, Ching Chai L, Chapman J, Grant K. (2019) The use of next-generation sequencing for improving food safety: Translation into practice. Food Microbiology, 79 (2019) 96-115. https://www.sciencedirect.com/science/article/pii/S0740002018305306?via%3Dihub Bao R, Huang L, Andrade J, Tan W, Kibbe WA, Jiang H, Feng G. (2014). Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing. Cancer Informatics, 13(S2) 67-82. https://pubmed.ncbi.nlm.nih.gov/25288881/ Ballereau S, Glaab E, Kolodkin A, Chaiboonchoe A, Biryukov M, Vlassis N, Ahmed H, Pellet J, Baliga N, Hood L, Schneider R, Balling R and Auffray C. (2013). Functional Genomics, Proteomics, Metabolomics and Bioinformatics for Systems Biology. Springer Science. Chapter 1. 3-41. https://core.ac.uk/download/pdf/11858431.pdf Behera RN, Roy M and Dash S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. https://www.researchgate.net/publication/ 305361957_Ensemble_ based_Hybrid_Machine_Learning_Approach_for_Sentiment_Classification-_A_ Review Chaillou S, Chaulot-Talmon A, Caekebeke H, Cardinal M, Christieans S, Denis C, Hélène Desmonts M, Dousset X, Feurer C, Hamon E, Joffraud JJ, La Carbona S, Leroi F, Leroy S, Lorre S, Macé S, Pilet M-F, Prévost H, Rivollier M, Roux D, Talon R, Zagorec M and Champomier-Vergès M-C. (2015). Origin and ecological selection of core and food-specific bacterial communities associated with meat and seafood spoilage. The ISME Journal, 9(5), 1105-1118. https://www.nature.com/articles/ ismej2014202 Chen G, Chen C and Lei Z. (2017). Meta-omics insights in the microbial community profiling and functional characterization of fermented foods. Trends in Food Science and Technology, 65, 23-31. https://www.researchgate.net/publication/ 316803677_ Meta-omics_insights_in_the_microbial_community_profiling_and_functional_ characterization_of_fermented_foods Cui F, Cheng L and Zou Q. (2021). Briefings in functional genomics special section editorial: Analysis of integrated multiple omics data. Briefings in Functional Genomics, 20(4), 196-197. https://pubmed.ncbi.nlm.nih.gov/34279568/

52

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Dash S and Abraham A. (2018). Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In International Conference on Hybrid Intelligent Systems, (pp. 176-188). Springer, Cham. Dash S, Abraham A, Luhach AK, Mizera-Pietraszko J and Rodrigues JJ. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. https:// journals.sagepub.com/doi/full/10.1177/1550147719895210 Dash S, Ahmad M and Iqbal T. (2021). Mobile cloud computing: A green perspective. In Intelligent Systems. vol. 185, pp: 523-533, Springer, Singapore. http:// doi.org/ 10.1007/978- 981-33-6081-5-46.

Dash S, Thulasiram R and Thulasiraman P. (2017, December). An enhanced chaos-based firefly model for Parkinson's disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT) (pp. 159-164). IEEE. Dash S, Thulasiram R and Thulasiraman P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. https://www.research gate.net/publication/331995476_Modified_Firefly_Algorithm_With_Chaos_Theory _for_Feature_Selection_A_Predictive_Model_for_Medical_Data den Besten HM, Amézquita A, Bover-Cid S, Dagnas S, Ellouze M, Guillou S, Nychas G, O'Mahony C, Pérez-Rodriguez F and Membré JM. 2018. Next generation of microbiological risk assessment: Potential of omics data for exposure assessment. International Journal of Food Microbiology, 287:18-27. Dong X, Liu C and Dozmorov M. (2021). Review of multi-omics data resources and integrative analysis for human brain disorders. Briefings in Functional Genomics, 20(4), 223-234. https://pubmed.ncbi.nlm.nih.gov/33969380/ Ellouzee M, Guillouf S, Nychasg G, O'Mahonyh C, Pérez-Rodriguezi F, Membré JM. (2018). Next-generation of microbiological risk assessment: Potential of omics data for exposure assessment. International Journal of Food Microbiology, 287 (2018) 1827. Fadare OA, Omisore NO, Adegbite OB, Awofisayo OA, Ogundolie FA, Adesanwo JK and Obafemi CA. (2021). Structure based design, stability study and synthesis of the dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. https://pubmed.ncbi.nlm.nih.gov/34168948/ Feng Min, Sumei Wang and Li Zhang. (2015) Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico. Hindawi Publishing Corporation BioMed Research International, Volume 2015, Article ID 831352, 9 pages. https://www.hindawi.com/journals/bmri/2015/831352/ Guohua Wang, Yunlong Liu, Dongxiao Zhu, GunnarW. Klau and Weixing Feng (2015). Bioinformatics Methods and Biological Interpretation for Next-Generation Sequencing Data. Hindawi Publishing Corporation BioMed Research International, Volume 2015, Article ID 690873, 2 pages. https://www.hindawi.com/ journals/bmri/ 2015/690873/

Next-Generation Sequencing and Omics Data Analysis Techniques

53

Jourdon A, Scuderi S, Capauto D, Abyzov A and Vaccarino FM. (2021). PsychENCODE and beyond: Transcriptomics and epigenomics of brain development and organoids. Neuropsychopharmacology, 46(1), 70-85. https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC7689467/ Joyce AR and Palsson BØ. (2006). The model organism as a system: Integrating'omics' data sets. Nature Reviews Molecular Cell Biology, 7(3), 198-210. https:// pubmed.ncbi.nlm.nih.gov/16496022/ Jun Li, Hu Chen, Yumeng Wang, Mei-Ju May Chen, and Han Liang. (2021). NextGeneration Analytics for Omics Data. Cancer Cell, 39. 1-6. https:// www.cell.com/cancer-cell/fulltext/S1535-6108(20)30433-5 Kaur P, Singh A and Chana I. (2021). Computational techniques and tools for omics data analysis: State-of-the-art, challenges, and future directions. Archives of Computational Methods in Engineering, 28(7), 4595-4631. https://www.semanticscholar.org/paper/ Computational-Techniques-and-Tools-for-Omics-Data-KaurSingh/ce0a5dcee5973c98d48ff144fbcf94f316af5e45 Kergourlay G, Taminiau B, Daube G and Vergès MCC. (2015). Metagenomic insights into the dynamics of microbial communities in food. International Journal of Food Microbiology, 213, 31-39. https://pubmed.ncbi.nlm.nih.gov/26414193/ Kulski JK. (2015). Next-Generation Sequencing — An Overview of the History, Tools, and “Omic” Applications. Intech. Chapter 1. https://www.intechopen.com/chapters/ 49602 Lee C-Y, Chiu Y-C, Wang L-B, Kuo Y-L, Chuang EY, Lai L-C, Tsai M-H. (2013). Common applications of next-generation sequencing technologies in genomic research. Transl Cancer Res, 2013; 2(1):33-45. https://tcr.amegroups.com/article/ view/962/html Misra BM, Langefeld C, Olivier M and Cox LA. (2019). Integrated omics: Tools, advances and future approaches. Journal of Endocrinology. Society for Endocrinology. 62, R21R45. https://jme.bioscientifica.com/view/journals/jme/62/1/JME-18-0055.xml Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM and Culhane AC. (2016). Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings in Bioinformatics, 17(4), 2016, 628–641 https://pubmed.ncbi.nlm.nih.gov/ 26969681/ Mrozek D, Ste˛pien´ K, Grzesik P and Małysiak-Mrozek B. (2021). A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses. Front Genet, 12:699280. doi: 10.3389/fgene.2021.699280. https://www.frontiersin.org/articles/10.3389/ fgene. 2021.699280/full Nomura J, Mardo M and Takumi T. (2021). Molecular signatures from multi‐omics of autism spectrum disorders and schizophrenia. Journal of Neurochemistry, 159(4), 647-659. https://onlinelibrary.wiley.com/doi/full/10.1111/jnc.15514 Ogundolie FA. (2015). Characterization Of A Purified β–Amylase From Black Marble Vine (Dioclea reflexa) Seeds (Doctoral dissertation, Federal University of Technology, Akure). http://196.220.128.81:8080/xmlui/handle/123456789/4407 Ogundolie FA. (2021). Cloning of α-Amylase and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from cocoa (Theobroma cacao L.) Pods and Biochemical

54

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). http://196.220.128.81:8080/xmlui/handle/123456789/4548 Ogundolie FA, Ayodeji AO, Olajuyigbe FM, Kolawole AO and Ajele JO. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. https://www.sciencedirect.com/science/article/abs/pii/S1878818122000883?via%3D ihub Oh M, Park S, Kim S and Chae H. (2021). Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Briefings in Bioinformatics, 22(1), 66-76. https://pubmed.ncbi.nlm.nih.gov/32227074/ Olaniyan Olugbemi T, Adetunji Charles O, Adeniyi Mayowa J, Hefft Daniel Ingo. 2022a. Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. https://www.taylorfrancis.com/chapters/edit/ 10.1201/9780367548445-20/machine-learning-techniques-high-performancecomputing-iot-applications-healthcare-olugbemi-olaniyan-charles-adetunji-mayowaadeniyi-daniel-ingo-hefft Olaniyan Olugbemi T, Adetunji Charles O, Adeniyi Mayowa J, Hefft Daniel Ingo. In. Computational Intelligence in IoT Healthcare. 2022 b. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. https://www.taylorfrancis.com/books/edit/ 10.1201/9780367548445/deep-learningmachine-learning-iot-biomedical-health-informatics-sujata-dash-subhendu-kumarpani-joel-rodrigues-babita-majhi?refId=bb05e6a3-4290-40e6-97bb4ef2d39a0b6c&context=ubx Oyedara OO, Adeyemi FM, Adetunji CO, Elufisan TO. 2022. Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARSCoV-2 Infection. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. https://www.taylorfrancis.com/ chapters/edit/10.1201/9781003178903-10/repositioning-antiviral-drugs-rapid-costeffective-approach-discover-treatment-sars-cov-2-infection-omotayo-opemipooyedara-folasade-muibat-adeyemi-charles-oluwaseun-adetunji-temidayo-oluyomielufisan Pavkovic M, Pantano L, Gerlach CV, Brutus S, Boswell SA, Everley RA, Shah JV, Sui SH and Vaidya VS. (2019). Multi omics analysis of fibrotic kidneys in two mouse models. Scientific Data, 2019. 6:92. 1-9. https://www.nature.com/articles/s41597-019-0095-5 Pak TR and Kasarskis A. (2015). How Next-Generation Sequencing and Multiscale Data Analysis will Transform Infectious Disease Management. VIEWPOINTS. 61. 16951702. https://academic.oup.com/cid/article/61/11/1695/333895 Picard M, Scott-Boyer MP, Bodein A, Périn O and Droit A. (2021). Integration strategies of multi-omics data for machine learning analysis. Computational and Structural Biotechnology Journal, 19, 3735-3746. https://www.sciencedirect.com/science/ article/pii/S2001037021002683

Next-Generation Sequencing and Omics Data Analysis Techniques

55

Rahman AU, Dash S and Luhach AK. (2021). Dynamic MODCOD and power allocation in DVB-S2: A hybrid intelligent approach. Telecommunication Systems, 76(1), 49-61. https://dl.acm.org/doi/abs/10.1007/s11235-020-00700-x Rahman A, Sultan K, Dash S and Khan MA. (2018). Management of resource usage in mobile cloud computing. Int J Pure Appl Math, 119(16), 255-261. https://acadpubl.eu/ hub/2018-119-16/1/26.pdf Reel PS, Reel S, Pearson E, Trucco E and Jefferson E. (2021). Using machine learning approaches for multi-omics data analysis: A review. Biotechnology Advances, 49, 107739. https://pubmed.ncbi.nlm.nih.gov/33794304/ Sahu B, Dash S, Mohanty SN and Rout SK. (2018). Ensemble comparative study for the diagnosis of breast cancer datasets. International Journal of Engineering and Technology, 7(4.15), 281-285. https://www.sciencepubco.com/index.php/ijet/ article/ view/23007 Segata N, Boernigen D, Tickle TL, Morgan XC, Garrett WS and Huttenhower C. (2013). Computational meta'omics for microbial community studies. Molecular Systems Biology, 9(1), 666. https://pubmed.ncbi.nlm.nih.gov/23670539/ Sirangelo TM, Calabrò G. (2020) Next-generation Sequencing Approach and Impact on Bioinformatics: Applications in Agri-Food Field. Journal of Bioinformatics and Systems Biology, 3 (2020): 032-044. https://www.fortunejournals.com/articles/nextgeneration-sequencing-approach-and-impact-on-bioinformatics-applications-inagrifood-field.html Tan B, Ng CM, Nshimyimana JP, Loh LL, Gin KYH and Thompson JR. (2015). Nextgeneration sequencing (NGS) for assessment of microbial water quality: Current progress, challenges, and future opportunities. Frontiers in Microbiology, 6, 1027. https://www.frontiersin.org/articles/10.3389/fmicb.2015.01027/full Thoendel M, Jeraldo PR, Greenwood-Quaintance KE, Yao JZ, Chia N, Hanssen AD, Abdel MP and Patel R. (2016). Comparison of microbial DNA enrichment tools for metagenomic whole genome sequencing. Journal of Microbiological Methods, 127, 141-145. https://pubmed.ncbi.nlm.nih.gov/27237775/ Verheijen M, Tong W, Shi L, Gant TW, Seligman B and Caiment F. (2020). Towards the development of an omics data analysis framework. Regulatory Toxicology and Pharmacology, 112, 104621. https://pubmed.ncbi.nlm.nih.gov/32087354/ Wu P-Y, Chandramohan R, Phan JH, Mahle WT, Gaynor JW, Maher KO, Wang MD. (2014). Cardiovascular Transcriptomics and Epigenomics Using Next-Generation Sequencing Challenges, Progress, and Opportunities. Circ Cardiovasc Genet, 2014; 7: 701-710. https://pubmed.ncbi.nlm.nih.gov/25518043/ Wang X, Afrane YA, Yan G and Li J. (2015). Constructing a Genome-Wide LD Map of Wild A. gambiae Using Next-Generation Sequencing. Hindawi Publishing Corporation BioMed Research International. Volume 2015, Article ID 238139, 8 pages. https://www.hindawi.com/journals/bmri/2015/238139/ Wang X. (2018). Next-generation sequencing data analysis. Briefings in Bioinformatics, 19(5), 2018, 1082–1083. https://academic.oup.com/bib/article/19/5/1082/3097953 Zhao S, Watrous K, Zhang C and Zhang B. (2017). Cloud Computing for Next-Generation Sequencing Data Analysis. InTech. Chapter 2. 29-50. https:// www.intechopen.com/ chapters/53334

Chapter 3

In silico Approaches to Vaccine Design Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Simeon Kayowa Olatunde4 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi5 and Lawrence Achilles Nnyanzi5 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4Microbiology and Immunology Unit, Department of Life Sciences, All Saints University School of Medicine, Hillsborough Street, Roseau, Commonwealth of Dominica 5School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract The term ‘in silico’ is a recent term that refers to computer-assisted testing or experiments performed by the computer and is connected to the more well-known biology terms in vivo and in vitro. Bioinformatics is a branch of biology that explores the use of computer-based approaches for analyzing biological systems to make exact predictions *

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

58

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. that might be confirmed in laboratory investigations and clinical trials. The introduction of computer-based biological procedures has transformed life science research. Bioinformatics tools have relieved medical facilities of much strain connected to the price of laboratory work and animal sacrifices. Therefore, this chapter intends to provide detailed information on some techniques that could be applied during in silico approaches for vaccine design.

Keywords: in silico approaches, vaccine designing, bioinformatics, computer-based approaches, clinical trials

Introduction Vaccines are biological preparations given to humans and animals to increase their protection. This stimulates the immune system against pathogens such as bacteria, infections, viruses, or even toxins. These preparations can be done using weakened, recombinant, or killed microbes or toxins. Today, the treatment of diseases by stimulating the immune system through the introduction of a vaccine called vaccination is regarded as the most remarkable human intervention toward improving global health (Nabel, 2013; Greenwood, 2013; Janse et al. 2021). This intervention is through large-scale immunization (Janse et al. 2021). The journey involved in producing these vaccines starts with vaccine candidates and is referred to as the research and discovery stage (RADS). In the past decades, RADS has been a tedious and time-consuming stage that can take weeks, months, and even years to conclude a potential rationale after several kinds of research and data analysis. Traditionally, the process of obtaining a vaccine candidate, also known as vaccine design, starts with growing the pathogens causing the disease, isolating these pathogens from the media, and weakening, killing, or inactivating them before injecting them into the test subject (Gandon et al. 2001; Graham, 2013; Yang et al. 2021). These processes are time-consuming and involve a lot of tedious laboratory schedules. The emergence of vaccinomics (Poland et al. 2009), a field of science that involves applying immunogenetics and immunogenomics with bioinformatics (Oany et al. 2014, Adetunji et al. 2022, Olaniyan et al. 2022a, b; Oyedara et al. 2022), next generation sequencing, immuo-informatics (Fathollahi et al. 2021), proteomics and other omics fields have provided a much-needed revolution in the development of vaccines. Unlike the development of

In silico Approaches to Vaccine Design

59

vaccines like the smallpox vaccine, measles, polio vaccine, yellow fever vaccine, Adenovirus vaccine, or rubella vaccine, which takes several years, today, the emergence of high throughput sequencing technologies has made it possible to have vaccine candidates in days or weeks (Behera et al. 2016; Dash and Abraham, 2018; Dash et al. 2019; Rahman et al. 2021). This chapter will review the different in silico approaches involved in vaccine design.

In silico Approaches for Vaccine Designing In silico techniques are beneficial for categorizing proteins based on their structure and function. They may be valuable when developing servers to sort these molecules using machine learning (ML) methods. (Sunita et al. 2020; Josefsberg and Buckland, 2012; Jarząb et al. 2013). Computational vaccinology is becoming increasingly popular in addressing the difficulty of vaccine design. Analysis of antigen processing, population coverage, analysis of antigenicity, analysis of conservancy, toxicity prediction, allergenicity evaluation, and Prediction of B cell and T cell epitopes are all crucial aspects in designing and manufacturing effective vaccines against many viruses and malignancies. Various bioinformatics tools and online web servers have been created to perform all of the analyses. Scientists must decide whether to use more appropriate and precise servers for each component based on its precision (Kardani et al. 2020). Specific authors have used in silico approaches to design and developed vaccine against viruses such as coronavirus (Oany et al. 2014; Sharmin and Islam, 2014), human immunodeficiency virus-1 (HIV-1) (De Groot et al. 2003; Kardani et al. 2019; Kardani et al. 2020), human papillomavirus (HPV) (Negahdaripour et al. 2017; de Oliveira et al. 2015; Namvar et al. 2020), Dengue virus (Chakraborty et al. 2010; Fahimi et al. 2016; Ali et al. 2017; Subramaniyan et al. 2018;), hepatitis C virus (HCV) (Ikram et al. 2018), Kaposi sarcoma (Chauhan et al. 2019), Influenza (Eickhoff et al. 2019; Hasan et al. 2019), Ebola (Khan et al. 2015; Ayub et al. 2016; Bazhan et al. 2019), and zika (Janahi et al. 2017), bacteria such as meningitis (Munikumar et al. 2013), Helicobacter pylori (Moise et al. 2015; Zhou et al. 2009; Meza et al. 2017;), Leptospira serovars (Umamaheswari et al. 2012), Mycobacterium tuberculosis (De Groot et al. 2005; Shah et al. 2018), and Coxiella burnetii (Rashidian et al. 2020; Scholzen et al. 2019), microbial cells (Adetunji et al. 2022l) parasites (such as Plasmodium falciparum for malaria) (Pandey et al. 2018), Taenia solium (Kaur et al. 2020),

60

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Leishmania (Seyed et al. 2014; Khatoon et al. 2017), and Echinococcus granulosus (Pourseif et al. 2019), and fungi; such as Aspergillus fumigatus (Thakur et al. 2016), and Candida albicans (Tarang et al. 2020) as well as breast cancer (Mahdavi and Moreau, 2016; Atapour et al. 2020). The list of Specific authors that have used in silico techniques to predict potential vaccine candidates against Bacteria are shown in Table 1, while the fungi Table 2, virus Table 3 and parasite Table 4. Table 1. Specific authors that have used in silico techniques to predict potential vaccine candidates against bacteria Authors Meza et al. 2017.

Organism Helicobacter pylori

Jahangiri et al. 2011

Listeria monocytogenes

Farhadi et al. 2015

Klebsiella pneumoniae

Delfani et al. 2015 Fatoba et al. 2021

Staphylococcus aureus Enterococcus faecium

Hajizade et al. 2016

Zeinalzadeh et al. 2014

Shigella, Enterotoxigenic or Enterohemorragic Escherichia coli Pseudomonas aeruginosa Escherichia coli

Zeinalzadeh et al. 2014

Treponema pallidum

Aminnezhad et al. 2020

Immunoinformatics tool RANKPEP server, NetMHC I pan server, MetaMHCII, Propred-II, NetMHC II PRALINE, PROTPARAM Online software, TmhcPred, RANKPEP, ProPred-1, MAPPP, HLApred, BIMAS, n HLApred, “CTLpred PRED-TMBB server, Propred, I-TASSER, PROCHECK, Discotope Server, MetaMHCII online tool, ProtParam tool Bcepred, Discotope NetCTL v1.2, Protparam server, Java Codon Adaptation Tool (JCAT), naccess 2.1.1 package BCPred, Discotope, VaxiJen, SYFPEITHI, ProPred

MHCPred 2.0 server, JCAT BepiPred, ABCpred, DiscoTope, SYFPEITHI, VaxiJen, Algpred BepiPred, VaxiJen, ABCpred servers, DiscoTope 1.2 server

Table 2. Specific authors that have used in silico techniques to predict potential vaccine candidates against fungi Authors Thakur and Shankar, (2016). Tarang et al. (2020)

Organism Aspergillus fumigatus

Immunoinformatics tool VaxiJen v2.0 server, IEDB-AR

Candida albicans

IEDB-AR, ANTIGENpro, AllerTOP VaxiJen v2.0 server,

In silico Approaches to Vaccine Design

61

Table 3. Specific authors that have used in silico techniques to predict potential vaccine candidates against viruses Authors Kathwate, (2022).

Organism SARS-CoV2

Samad et al. (2022).

Bovine leukemia virus (BLV) SARS-CoV-2 Hepatitis B

Yang et al. (2021) Zheng et al. (2017) Dash et al. (2017) Negahdaripour et al. (2017)

Ebola virus (EBOV) Human papillomavirus (HPV)-caused cervical cancer

Immunoinformatics tool MHC-NP, netCTLpan1.1, RANKPEP, VaxiJen, BepiPred ANTIGENpro, VaxiJen, NetCTL-1.2, ToxinPred servers, AllergenFP DeepVacPred computational framework IEDB MHC Tools, MHC-NP, netCTLpan, and netMHCpan VaxiJen, NetCTL, IEDB, AllerHunter NetMHC 4.0, IEDB MHC Tools, RANKPEP, MHCPred V.2.0, LBtope, EPMLR, BCPREDS, and BepiPred 1.0b.

Table 4. Specific authors that have used in silico techniques to predict potential vaccine candidates against parasites Authors Khatoon et al. (2017)

Organism Leishmania

Ajibola et al. (2021).

Kaur et al. (2020)

Plasmodium falciparum Plasmodium falciparum Schistosoma haematobium Taenia solium

Nakayasu et al. (2012)

Trypanosoma cruzi

Damfo et al. (2017). Gomase et al. (2012).

Immunoinformatics tool IEDB analysis tools, BCPREDS server, MODELLER program, AlgPred PopGenome R package, I-TASSER, Random forest algorithm IEDB analysis tools Bepipred, RankPep IEDB MHC Tools, ANTIGENpro software, BepiPred-1.0 IEDB MHC Tools

Numerous In silico Approaches for Vaccine Designing Martinelli, (2022) reported that the application of computational biology in the study of immune system called immuno-informatics has changed the principle of drug design and vaccine development in silico. The author revealed that there are basic principles that govern in silico drug design such as epitope identification and target selection, refinement and vaccine analysis and vaccine construction. Reverse vaccinology is the central point in vaccine development compared to the pathological activity of the microbe. In silico tools and approaches can be utilized to study macromolecules and small

62

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

molecules like accurate molecular dynamics simulations and peptide folding prediction. Muhammad et al. (2021) revealed that HIV-related infections are linked to tuberculosis which is the leading cause of death affecting the immune system. The authors showed that an epitope-based vaccine can be designed to resolve immune suppression from people living with these infections using in silico approach. They were able to identify potential T-cell and B-cell epitopes from the protein-polysaccharide of M tuberculosis and that of the immunodeficiency virus-1. Vaccine safety, efficiency and effectiveness can be tested using computational prediction and molecular docking in silico methods for antigenic peptides. It is known that drug design, it is a very complicated and expensive area due to the number of criteria that drug must fulfil such as bioavailable, non-toxic, safe and potency. Recently, computational biology, bioinformatics and 3D structures of molecular drug targets are being utilized to analyze drug-like molecules and perform molecular modelling like ligand- structure-based drug designing, visualization, protein modelling, molecular docking, molecular dynamics simulation, virtual screening, QSAR and pharmacophore modelling approaches. These approaches will provide an in-depth understanding of biochemical and physiological interaction and function between proteins facilitating applications in medicine and biological sciences. Muhammad et al. (2021) noted that the binding sites of proteins or molecules determine functionality like hormones, inhibitors, activators, toxins and neurotransmitters. The functionalities include geometry, electrostatic charge and physicochemical properties determined through X-ray crystallography, nuclear magnetic resonance, mass spectrometry, isothermal titration calorimetry and throughput screening. In silico models are utilized to analyze, classify, and validate ADMET prediction in drug designing, and development. Drug design using computational models and chemical datasets is the backbone of drug discovery, interaction and modulation in certain disease conditions. Therapeutic molecule discovery in the post-genomic era has been improved by utilizing bioinformatics and in silico tools for vaccines, determination of immunogenic properties and drug design. Afrizal et al. (2021) reported that COVID-19 vaccine design and development can be done through in silico modeling of safe, less allergic reactions and inexpensive epitope-based vaccine. The authors designed the vaccine through phylogenic analysis of SARS-CoV-2 sequence data, identification of CD 8+ T cell epitopes and prediction of protein antigenicity. Immunoinformatics and bioinformatics tools are the modern techniques

In silico Approaches to Vaccine Design

63

utilized in the design and discovery of vaccines. The construction of vaccine for some viruses such as Ebola involves potential antigenic proteins such as envelope glycoprotein, matrix protein VP40, and nucleoprotein. Using immunoinformatics tools and computational techniques, the spike glycoprotein antigen molecule on the novel coronal virus can be utilized for the development of novel vaccine by identification of 5 MHC II B-cell and 5 MHC I derived T-cell epitopes. Suitable adjuvant can join these epitopes, thus forming multi-epitope-based vaccine construct. In this way, in silico technique can be deployed to analyze the antigenicity, solubility, immunogenic, physicochemical and allergenicity properties. Utpal et al. (2018) revealed that epitope-based vaccine candidates and ligand-binding pockets through immunoinformatics could be utilized to treat the Oropouche virus. The authors identified B-cell and T-cell epitopes of the antigenic OROV polyprotein to generate cell-mediated and humoral immunity. Their results showed they could develop vaccine candidates with binding interaction using docking simulation. Onyeka et al. (2021) reported that the application of bioinformatics techniques is an essential tool for developing vaccine candidates for the treatment of infectious diseases like severe acute respiratory syndrome coronavirus 2. These bioinformatics platforms are used to analyze antigenic epitopes and antibody structures, predict protein-peptide docking, simulate antigen-antibody reactions, and drug repurposing. Muhammad et al. (2020) showed that the Respiratory syncytial virus is responsible for respiratory disorders. Despite the advancement in research and sciences, no single vaccine is available for the virus’s treatment. The authors developed a multi-epitope-based subunit vaccine using fusion and glycoprotein protein through a computational approach and in silico cloning. Saad et al. (2021) revealed that Marburg virus disease is a flu-like and hemorrhagic fever symptom of a life-threatening pathogenic virus. The authors revealed that there is no vaccine available for the treatment. Through vaccinology and immunoinformatics techniques, they could design a multiepitope vaccine based on structural proteins of Marburg virus disease. The authors were able to screen B-cell and T-cell epitopes of the virus and carried out physicochemical characteristics, structure validation, human homology assessment; vaccine construct delineated convenient outcomes, normal-mode analysis, molecular dynamics simulation, and protein−protein docking for regulating immune reaction against Marburg virus disease pathogenesis. Also, in the construction of the vaccine candidate, a higher affinity for receptor molecules and the constructed vaccine were taken into account, such as toll-

64

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

like receptor-4, simulation trajectories, root mean square deviations, multiple descriptors, the radius of gyration, solvent-accessible surface area, root mean square fluctuations and hydrogen bonds. Syed et al. (2020) demonstrated that liver cancer is a global disease. The management can be through developing vaccine candidates through immunoinformatics framework by artificial ex vivo proliferation of T cells for the immunotherapeutic strategies. This technique was able for data mining, functional proteomic analysis, immunogenicity prediction, conservation studies, in-vivo validation analysis, and molecular modeling. The authors suggested that the ALB and C6 may contain potential epitopes for protective effector molecules in candidate vaccine treatment for liver cancer. Smrithi and Elizabeth, (2021) reported that the Ebola virus is a serious threat to human health. Thus, there is a severe need for the development of vaccine and drugs reverse vaccinology approach and computational algorithm for the selection of epitopes for the design of HLA alleles whose 3D structures through docking and validation by molecular dynamics study. Mandana et al. (2021) highlighted that vaccine design for the influenza virus is critical in biomedical science using multi-epitope vaccine candidates based on the neuraminidase proteins through in silico approach. In the construction of the vaccine, the immune receptors and molecular interactions were investigated and characterization, physicochemical properties allergenicity and toxicity were carried out. Somayeh et al. (2016) reported that Staphylococcus aureus can cause serious infection and multiple antibiotic-resistant, thus there should be an alternative method to control Staphylococcus aureus infections through in silico tools. This technique can be utilized in the study, evaluate and design of vaccine candidates against Staphylococcus aureus infections. Victor et al. (2019) identified that helicobacter pylori against gastric mucosa affect people across the globe, causing infection, gastritis, cancer, and peptic ulcers. The authors reported that due to the development of resistance to several antibiotics and the lack of novel drugs against the bacteria, there is a need to develop novel vaccine candidates as anti-H—pylori therapy through bioinformatics tools and immunoinformatics approach. The authors designed a novel multiepitope oral mucosal adjuvant vaccine. They screened the vaccine to validate the potential as non-allergenic, good molecular weight, antigenic, isoelectric point, and solubility. Soltan et al. (2022) reported that in the digestive system, Escherichia coli are known to cause serious complications that may result in life-threatening issues like diarrhea. Vaccine development for these type of bacteria is essential

In silico Approaches to Vaccine Design

65

due to the development of antibiotic-resistance. The authors utilized the immunoinformatics and reverse vaccinology approach for the detection and screening of potential antigens. Lipopolysaccharide assembly protein was identified as a candidate vaccine through BLASTp, and cytotoxic T lymphocyte, B cell lymphocyte, and helper T lymphocyte epitopes were used in the construction of the vaccine and linked adjuvant. Analysis was done through computational biology and all-atom molecular dynamics simulation with docking carried out for free binding energy and predicted affinity. Bishajit et al. (2021) revealed that human coronaviruses are known to be pathogenic and responsible for the global pandemic due to the lack of effective drugs or vaccines to curb the spread of the virus. The authors proposed utilizing the immunoinformatics approach to design epitope-based polyvalent vaccines that are highly antigenic, nontoxic, non-allergenic, non-homologous B-cell and T-cell epitopes and conserved to provide more robust protection against strains. The authors also carried out Protein-protein docking, immune stimulation, MD simulation and in silico cloning for mass production. Enzymes with significant industrial values for biotechnology applications such as glucoamylases (James et al. 2012; Ayodeji et al. 2017; Guan et al. 2022), transketolase (Fadare et al. 2021) alpha and beta amylases, endochitinase (Ogundolie, 2015; Ahmadi et al. 2020; Ogundolie, 2022; Marana et al. 2017) and pullulanases (Ogundolie, 2021) are being used as inhibitors for Insilco design of various epitope-based vaccine. Aftab et al. (2016) reported that Zika virus, is fast becoming a threat to humanity. It is transmitted by mosquitoes related to pathogenic vector-borne flaviviruses like dengue, Japanese encephalitis and West Nile viruses.. The authors reported that lack of treatment option and effective vaccine has stalled the eradication of the virus, thus the need for the development of drug/vaccine using computational approach. Investigating the Zika virus genome will provide identification of core features, properties, and evolutionary relationship. The authors were able to utilize the Zika virus Envelope glycoprotein and using in silico docking approach to design an epitope-based peptide candidate vaccine. Sandeep et al. (2017) noted that the conventional method for the development of vaccines involves immune system stimulation against the pathogen causing the disease. The authors identified computational means and in silico tools for epitope-based immunotherapy discovery, including predicting conformational and linear B-cell epitopes, identification of transporter-associated protein binders, prediction of the major histocompatibility complex, designing epitope-based immunotherapy or vaccines

66

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

and screening for therapeutic characteristics of peptides such as half-life, immune toxicity, and cytotoxicity. Vaccines, according to RibasAparicio et al. (2017), have the best costbenefit ratio of any pharmaceutical product in terms of disease prevention and treatment. The production of vaccines involves the selection of antigenic or antigen structures, adjuvants, and carriers through bioinformatics techniques such as reverse vaccineology, structural vaccineology and immunoinformatics. Elaheh et al. (2014) revealed that human cytomegalovirus is a deadly pathogenic virus that affects the fetus congenitally, resulting in immunosuppression, congenital infection, neurological damage, and death. The clinical treatment of this condition has been proposed to be through epitope-based vaccine design by selecting the virus surface phosphoprotein 65 and glycoprotein B. The authors also applied bioinformatics tools for mapping, in silico epitope discovery, in vitro and in vivo verification and validation, cytomegalovirus immunodominant epitopes, design of chimeric gene construct and the detection of each epitope’s antigenicity. Chukwudozie et al. (2021) noted that designing an effective vaccines, particularly for Coronavirus involves a bioinformatics approach, immune simulations, and in silico techniques through epitope peptide-based vaccine against the viral spike proteins. Khan et al. (2020) revealed that Visceral leishmaniasis is an epidemic parasitic disease with high socioeconomic burden. Currently there are no effective vaccine for the disease. Through immunoinformatics, selection of epitope predictions, and docking, molecular dynamics simulations were carried out. The results suggested that further screening may help to select candidate antigenic proteins for vaccine production. Vaccination is crucial for the prevention of pathogens without stimulating the aggravation of resistance. Also, an increased in the use of antibiotics have been shown to exacerbate antimicrobial resistant. Hammadul et al. (2021) revealed that the Chikungunya and Mayaro viruses have reemerged across the globe. Presently, there is no drug or vaccine for the treatment of this virus, so we need to develop novel and potent therapy to prevent the spread of the infection. The authors utilized the T cell and B cell of the virus to identify the epitopes using computational, molecular docking simulation, and immunoinformatics techniques to design the peptide based vaccine candidates. Alexander et al. (2021) reported that Bartonella bacilliform is a major cause of Carrions disease which is a highly lethargic disease found in South American Andes. The spread is rapid, and vaccines coupled with sero-diagnostic approaches seem to be the urgent approach to

In silico Approaches to Vaccine Design

67

curbing the infection. The authors identified immunodominant protein using heterologous genomic expression on B. bacilliformis and further Immunoblotting, ELISA technique provided immunodominant antigens that could be utilized for vaccine development for treating Carrions disease. Patronov and Doytchinova, (2013) revealed that vaccine for treatment of diseases and infections is considered to be the cheapest and effective way of preventing disease condition. Immune response is generated through introduction of foreign antigen into the immune system. Most peptide vaccine production utilizes the approach of chemical synthesis and identification of T and B cells epitopes. The authors noted that bioinformatics involving immunoinformatics and in silico techniques have accelerated the advancement in vaccine development. Matin et al. (2021) highlighted that the problem of antibiotic resistant is becoming a global threat thus need to utilize alternative approach through the development of vaccine to fight infectious diseases. The authors noted that New Delhi metallo-beta-lactamase is resistant to several beta-lactamases. Through bioinformatics techniques, the authors were able to develop vaccine against New Delhi metallo-beta-lactamase variants. Zikun et al. (2021) demonstrated that COVID-19 infectious disease is a devastating virus spreading across the globe with over millions of death recorded. The lack of therapy has resulted into the development of vaccine to curb the spread. In their study, the authors noted that in silico deep learning approach will provide prediction, computational framework to construct a multi-epitope vaccine that will fight against SARS-CoV-2 virus. Population coverage, allergenicity, antigenicity, toxicity, secondary structure and physicochemical properties, validation, optimization were analyzed through immunoinformatics and artificial intelligence approach. Fahad et al. (2022) reported that due to the evolution of different strains of SARS-CoV-2, there is need to design multi-epitope subunit vaccine through structural vaccinology technology, in silico cloning, immune simulation which can trigger an immunological response against. Harish et al. (2021) utilized immunoinformatics techniques to develop vaccine against Staphylococcus aureus. Staphylococcus aureus is associated with food poison. The Staphylococcus aureus TSST-1 pyrogenic and secretory super-antigen is the main target for the construction of candidate vaccine against Staphylococcus aureus.

68

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Tables 5. In silico approaches for vaccine designing S. No. 1

2

3

4 5 6 7 8 9 10

11

12

13 14 15 16

Technique Immunoinformatics In silico modeling MD simulation In silico modeling MD simulation Bioinformatics Immunoinformatics In silico modeling Bioinformatics

Disease All

References Martinelli, 2022, Sandeep et al. 2017

HIV virus

Muhammad et al. 2021

SARS-CoV-2 virus

Immunoinformatics Bioinformatics Immunoinformatics Bioinformatics Immunoinformatics Bioinformatics Computational biology In silico modeling Bioinformatics In silico modeling Bioinformatics Immunoinformatics In silico modeling MD simulation Immunoinformatics In silico modeling MD simulation MD simulation Bioinformatics In silico modeling In silico modeling Immunoinformatics Immunoinformatics Bioinformatics In silico modeling MD simulation MD simulation In silico modeling Immunoinformatics

Oropouche virus

Afrixal et al. 2021 Onyeka et al. 2021 Bishajit et al. 2021 Chukwudozie et al. 2021 Zikun et al. 2021 Fahad et al. 2022 Muhammad et al. 2020 Utpal et al. 2018

Marburg Virus

Saad et al. 2021

Liver cancer

Syed et al. 2020

Ebola virus Influenza virus

Smrithi and Elizabeth, 2021 Mandaza et al. 2021

Staphylococcus aureus Helicobacter pylori

Somayeh et al. 2016 Harish et al. 2021 Victor et al. 2019

Escherichia coli

Soltan et al. 2022

Zika virus

Aftab et al. 2016

Visceral leishmaniasis

Khan et al. 2020

Chikungunya and Mayaro virus Carrion’s disease

Hammadul et al. 2021

New Delhi Metallo-beta Lactamase

Matin et al. 2021

Alexander et al. 2021

The authors utilized computational approach and Immune simulation to design non-toxic antigen and non-allergic vaccine. The multi-epitope subunit

In silico Approaches to Vaccine Design

69

vaccine polypeptide has a number advantage such as safety, lower production cost, designed by in silico tools and computational approach. Table 5 shows in silico approaches for Vaccine Designing.

Conclusion Immuno-informatics is a fast-growing discipline of bioinformatics that employs cutting-edge technology to predict possible peptides for vaccine development against a variety of diseases. Recent breakthroughs in immuneinformatics have enabled us to use available data to anticipate the most efficient epitopic areas in antigenic proteins, leading to the development of epitope-based vaccines. In this chapter, we provided insights and detailed information on the authors that has utilized in silico approaches for Vaccine Design, specific organisms involved (Bacterial, Fungi, Viruses and Parasite) and Immuno-informatics tools used were also identified.

References Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D., (2022a). Computational Intelligence Techniques for Combating COVID-19. doi: 10.1201/9781003178903-16. In book: In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics (pp. 251-269). CRC Press. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Olugbenga, M. S., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D. (2022b). Machine Learning and Behaviour Modification for COVID-19. doi: 10.1201/9781003178903-17. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903 Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Garipova, L. and Shariati, M. A. (2022c). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_10 Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022d). Machine Learning Approaches for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S.A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and

70

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_8. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Isabekova, O. and Shariati, M. A., (2022e). Smart Sensing for COVID19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_9. Adetunji, C. O., Inobeme, A., Tadso, J., Olaniyan, O. T., Abimbola, O. F., Shahnawaz, M., & Anani, O. (2022f). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota (pp. 301-312). Springer, Singapore. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022g). Internet of Health Things (IoHT) for COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_5. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Koriagina, N. and Shariati, M. A., (2022h). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_3. Adetunji, C. O., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E., Esiobu, N. D., Oyedara, O. O. and Adeyemi, F. M. (2022i). Application of Computational and Bioinformatics Techniques in Drug Repurposing for Effective Development of Potential Drug Candidate for the Management of COVID-19. doi: 10.1201/9781003178903-15. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition, First Published 2022, Imprint CRC Press. Pages 14. eBook ISBN 9781003178903. Adetunji, C. O., Samuel, M. O., Adetunji, J. B. and Oluranti, O. I., (2022j). Corn Silk and Health Benefits. doi: 10.1201/9781003178903-11. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Adetunji, C. O., Ogundolie, F. A., Ajiboye, M. D., Mathew, J. T., Inobeme, A., Dauda, W. P., & Adetunji, J. B. (2022k). Nano-engineered Sensors for Food Processing. In Bioand Nano-sensing Technologies for Food Processing and Packaging (pp. 151-166). Royal Society of Chemistry. doi:10.1039/9781839167966-00151. Adetunji, Oluwaseun C. John Tsado Mathew, Abel Inobeme, Olugbemi T. Olaniyan, Kshitij RB Singh, Ogundolie Frank Abimbola, Vanya Nayak, Jay Singh & Ravindra Pratap Singh (2022l). Microbial and Plant Cell Biosensors for Environmental Monitoring. In: Singh, R. P., Ukhurebor, K. E., Singh, J., Adetunji, C. O., Singh, K.

In silico Approaches to Vaccine Design

71

R. (eds) Nanobiosensors for Environmental Monitoring. Springer, Cham. https://doi.org/10.1007/978-3-031-16106-3_9. Adetunji, C. O., Bodunrinde, R. E., Inobeme, A., Singh, K. R., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2002m) Microbial Community Analysis of Contaminated Soils. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 83-97). CRC Press. Adetunji, C. O., Inobeme, A., Singh, K. R., Bodunrinde, R. E., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2002n) Genomic Analysis of Heavy Metal-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 113-126). CRC Press. Afrizal Firmansyah M., Aris Susilo, Septina D. Haryanti, Rina Herowati (2021) EpitopeBased Vaccine Design with Bioinformatics Approach to Suppress Spike Glycoprotein of SARS-CoV-2. Jurnal Farmasi Indonesia Vol. 18 No. 2, 82-96. Aftab Alam, Shahnawaz Ali, Shahzaib Ahamad, Md. Zubbair Malik1 and Romana Ishrat1 (2016) From ZikV genome to vaccine: in silico approach for the epitopebased peptide vaccine against Zika virus envelope glycoprotein. Immunology, 149, 386–399. doi:10.1111/imm.12656. Ahmadi, F., Dorosti, H., Ghasemi, Y., & Nezafat, N. (2020). In silico design of epitopebased allergy vaccine against bellatella germanica cockroach allergens. International Journal of Peptide Research and Therapeutics, 26(4), 1739-1749. Ajibola, O., M. F. Diop, A. Ghansah, L. Amenga-Etego, L. Golassa, T. Apinjoh, M. Randrianarivelojosia, O. Maiga-Ascofare, W. Yavo, M. Bouyou-Akotet, K. M. Oyebola, B. Andagalu, U. D’Alessandro, D. Ishengoma, A. A. Djimde, E. Kamau, & A. Amambua-Ngwa, “In silico characterisation of putative Plasmodium falciparum vaccine candidates in African malaria populations.” Scientific reports 11, no. 1 (2021): 1-13. Alexander A. Dichter, Tilman G. Schultze, Anne Wenigmann, Wibke Ballhorn, Andreas Latz, Elif Schlüfter, Palmira Ventosilla, Humberto Guerra Allison, Cesar Ugarte-Gil, Pablo Tsukayama, Volkhard A. J. Kempf (2021). Identification of immunodominant Bartonella bacilliformis proteins: a combined in-silico and serology approach. Lancet Microbe 2021; 2: e685–94. https://doi.org/10.1016/S2666-5247(21)00184-1. Ali, M., Pandey, R. K., Khatoon, N., Narula, A., Mishra, A., & Prajapati, V. K. (2017). Exploring dengue genome to construct a multi-epitope based subunit vaccine by utilizing immunoinformatics approach to battle against dengue infection. Scientific reports, 7(1), 1-13. Aminnezhad, S., Abdi-Ali, A., Ghazanfari, T., Bandehpour, M., & Zarrabi, M. (2020). Immunoinformatics design of multivalent chimeric vaccine for modulation of the immune system in Pseudomonas aeruginosa infection. Infection, Genetics and Evolution, 104462. doi:10.1016/j.meegid.2020.104462. Atapour, A., Negahdaripour, M., Ghasemi, Y., Razmjuee, D., Savardashtaki, A., Mousavi, S. M., & Nezafat, N. (2020). In silico designing a candidate vaccine against breast cancer. International Journal of Peptide Research and Therapeutics, 26(1), 369-380. Ayodeji, A. O., Ogundolie, F. A., Bamidele, O. S., Kolawole, A. O., & Ajele, J. O. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus

72

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

CFU-01: purification and characterization for biotechnological application. J Microbiol Biotechnol, 6, 90-100. Ayub, G., Waheed, Y., & Najmi, M. H. (2016). Prediction and conservancy analysis of promiscuous T-cell binding epitopes of Ebola virus L protein: An in silico approach. Asian Pacific Journal of Tropical Disease, 6(3), 169-173. Bazhan, S. I., Antonets, D. V., Karpenko, L. I., Oreshkova, S. F., Kaplina, O. N., Starostina, E. V., & Ilyichev, A. A. (2019). In silico designed ebola virus T-cell multi-epitope DNA vaccine constructions are immunogenic in mice. Vaccines, 7(2), 34. Behera, R. N., Roy, M., & Dash, S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. Bishajit Sarkar, Md. Asad Ullah, Yusha Araf, Nafisa Nawal Islama and Umme Salma Zohora (2021). Immunoinformatics-guided designing and in silico analysis of epitopebased polyvalent vaccines against multiple strains of human coronavirus (HCoV). Expert Review of Vaccines. 1-21. https://doi.org/10.1080/14760584.2021.1874925. Chakraborty, S., Chakravorty, R., Ahmed, M., Rahman, A., Waise, T. M., Hassan, F., Rahman, M., & Shamsuzzaman, S. (2010) A computational approach for identification of epitopes in dengue virus envelope protein: a step towards designing a universal dengue vaccine targeting endemic regions. In silico biology, 10(5, 6), 235246. Chauhan, V., Rungta, T., Goyal, K., & Singh, M. P. (2019). Designing a multi-epitope based vaccine to combat Kaposi Sarcoma utilizing immunoinformatics approach. Scientific reports, 9(1), 1-15. Chukwudozie O. S., Gray C. M., Fagbayi T. A., Chukwuanukwu R. C., Oyebanji V. O., Bankole T. T., Richard A. Adewole, Eze M. Daniel. (2021) Immuno-informatics design of a multimeric epitope peptide based vaccine targeting SARSCoV-2 spike glycoprotein. PLoS ONE 16(3): e0248061. https://doi.org/10.1371/journal.pone. 0248061. Damfo, S. A., Reche, P., Gatherer, D., & Flower, D. R. (2017). In silico design of knowledge-based Plasmodium falciparum epitope ensemble vaccines. Journal of Molecular Graphics and Modelling, 78, 195–205. Dash, R., Das, R., Junaid, M., Akash, M. F. C., Islam, A., & Hosen, S. Z. (2017). In silicobased vaccine design against Ebola virus glycoprotein. Advances and applications in bioinformatics and chemistry: AABC, 10, 11. Dash, S., & Abraham, A. (2018, December). Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In International Conference on Hybrid Intelligent Systems (pp. 176-188). Springer, Cham. Dash, S., Abraham, A., Luhach, A. K., Mizera-Pietraszko, J., & Rodrigues, J. J. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. Dash, S., Ahmad, M., & Iqbal, T. (2021). Mobile cloud computing: a green perspective. In Intelligent Systems. vol. 185, pp:523-533, Springer, Singapore. http://doi.org/10.1007 /978-981-33-6081-5-46.

In silico Approaches to Vaccine Design

73

Dash, S., Thulasiram, R., & Thulasiraman, P. (2017, December). An enhanced chaos-based firefly model for Parkinson’s disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT) (pp. 159-164). IEEE. Dash, S., Thulasiram, R., & Thulasiraman, P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. De Groot, A. S., Jesdale, B., Martin, W., Saint Aubin, C., Sbai, H., Bosma, A., Lieberman, J., Skowron, G., Mansourati, F., & Mayer, K. H. (2003). Mapping cross-clade HIV-1 vaccine epitopes using a bioinformatics approach. Vaccine, 21(27-30), 4486-4504. De Groot, A. S., McMurry, J., Marcon, L., Franco, J., Rivera, D., Kutzler, M., Weiner, D, & Martin, B. (2005). Developing an epitope-driven tuberculosis (TB) vaccine. Vaccine, 23(17-18), 2121-2131. de Oliveira, L. M. F., Morale, M. G., Chaves, A. A. M., Cavalher, A. M., Lopes, A. S., Diniz, M. D. O., Schanoski, A. S., Melo, R. L. De., Ferreira, L. C. de S., Oliveira, M. L. S, de, Demasi, M., & Ho, P. L. (2015). Design, immune responses and anti-tumor potential of an HPV16 E6E7 multi-epitope vaccine. PloS one, 10(9), e0138686. Delfani, S., Fooladi, A. A. I., Mobarez, A. M., Emaneini, M., Amani, J., & Sedighian, H. (2015). In silico analysis for identifying potential vaccine candidates against Staphylococcus aureus. Clinical and experimental vaccine research, 4(1), 99-106. Eickhoff, C. S., Terry, F. E., Peng, L., Meza, K. A., Sakala, I. G., Van Aartsen, D., Moise, L., Martin, W. D., Schriewer, J., Buller, R. M., Groot, A. S. De., & Hoft, D. F. (2019). Highly conserved influenza T cell epitopes induce broadly protective immunity. Vaccine, 37(36), 5371-5381. Elaheh Sabbaghian, Fatemeh Roodbari, Alireza Rafiei, Jafar Amani (2014) In Silico Design of a Multimeric Polytope as a Highly Immunogenic DNA Vaccine Against Human Cytomegalovirus. Journal of Applied Biotechnology Reports. Volume 1, Issue 4, Autumn 2014; 143-153. Fadare, O. A., Omisore, N. O., Adegbite, O. B., Awofisayo, O. A., Ogundolie, F. A., Adesanwo, J. K., & Obafemi, C. A. (2021). Structure based design, stability study and synthesis of the dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. Fahad Humayun, Yutong Cai, Abbas Khan, Syed Ali Farhan, Fatima Khan, Usman Ishrat Rana, Anum binte Qamar, Nasim Fawad, Shazia Shamas, Dongqing-Wei (2022). Structure-guided design of multi-epitopes vaccine against variants of concern (VOCs) of SARS-CoV-2 and validation through In silico cloning and immune simulations. Computers in Biology and Medicine 140 (2022) 105122. Fahimi, H., Sadeghizadeh, M., & Mohammadipour, M. (2016). In silico analysis of an envelope domain III-based multivalent fusion protein as a potential dengue vaccine candidate. Clinical and Experimental Vaccine Research, 5(1), 41-49. Farhadi, T., Nezafat, N., Ghasemi, Y., Karimi, Z., Hemmati, S., & Erfani, N. (2015). Designing of complex multi-epitope peptide vaccine based on omps of Klebsiella pneumoniae: an in silico approach. International Journal of Peptide Research and Therapeutics, 21(3), 325-341.

74

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Fathollahi, M., Fathollahi, A., Motamedi, H., Moradi, J., Alvandi, A., and Abiri, R. (2021). In silico vaccine design and epitope mapping of New Delhi metallo-beta-lactamase (NDM): an immunoinformatics approach. BMC bioinformatics, 22(1), 1-24. Fatoba, A. J., Adeleke, V. T., Maharaj, L., Okpeku, M., Adeniyi, A. A., & Adeleke, M. A. (2021). Immunoinformatics Design of Multiepitope Vaccine Against Enterococcus faecium Infection. International Journal of Peptide Research and Therapeutics, 27(4), 2183-2198. Gandon, S., Mackinnon, M., Nee, S. and Read, F. (2001) Imperfect vaccines and the evolution of pathogen virulence. Nature 414, 751–756. https://doi.org/10.1038/ 414751a. Gomase, V. S., Chitlange, N. R., Changbhale, S. S., Sherkhane, A. S., & Kale, K. V. (2012). In-Silico Approach for Prediction of Vaccine Potential Antigenic Peptides from 23kDa Transmembrane Antigen Protein of Schistosoma haematobium. International Journal of Bioinformatics Research, 4(3), 276. Graham, B. (2013) Advances in antiviral vaccine development. Immunol. Rev. 255, 230– 242. https://doi.org/10.1111/imr.12098. Greenwood B. The contribution of vaccination to global health: past, present, and future. Philos Trans R Soc Lond B Biol Sci. (2014) 369:20130433. 10.1098/rstb.2013.0433. Guan, S., Han, X., Li, Z., Xu, X., Cui, Y., Chen, Z., Zhang, S., Chen, S., Shan, Y., Wang, S. and Li, H., 2022. Exploration of the Interactions between Maltase–Glucoamylase and Its Potential Peptide Inhibitors by Molecular Dynamics Simulation. Catalysts, 12(5), p. 522. Hajizade, Abbas, Ebrahimi, Firouz, Amani, Jafar, Arpanaei, Ayoob, & Salmanian, Ali, Hatef, “Design and in silico analysis of pentavalent chimeric antigen against three enteropathogenic bacteria: enterotoxigenic E. coli, enterohemorragic E. coli and Shigella.” Bioscience Biotechnology Research Communications 9.2 (2016): 229-243. Hammadul Hoque, Rahatul Islam, Srijon Ghosh, Md. Mashiur Rahaman, Nurnabi Azad Jewel, Md. Abunasar Miah (2021) Implementation of in silico methods to predict common epitopes for vaccine development against Chikungunya and Mayaro viruses. Heliyon 7 (2021) e06396. Harish Babu Kolla, Chakradhar Tirumalasetty, Krupanidhi Sreerama and Vijaya Sai Ayyagari (2021) An immunoinformatics approach for the design of a multi-epitope vaccine targeting super antigen TSST-1 of Staphylococcus aureus. Journal of Genetic Engineering and Biotechnology 19:69. https://doi.org/10.1186/s43141-021-00160-z. Hasan, M., Ghosh, P. P., Azim, K. F., Mukta, S., Abir, R. A., Nahar, J., Khan, M. M. (2019). Reverse vaccinology approach to design a novel multi-epitope subunit vaccine against avian influenza A (H7N9) virus. Microbial Pathogenesis. 130: 19-37. https://doi.org/10.1016/j.micpath.2019.02.023. Ikram, A., Zaheer, T., Awan, F. M., Obaid, A., Naz, A., Hanif, R., Paracha, R. Z., Ali, A., Naveed, A. K., & Janjua, H. A. (2018). Exploring NS3/4A, NS5A and NS5B proteins to design conserved subunit multi-epitope vaccine against HCV utilizing immuneinformatics approaches. Scientific reports, 8(1), 1-14. Jahangiri, A., Rasooli, I., Gargari, S. L. M., Owlia, P., Rahbar, M. R., Amani, J., & Khalili, S. (2011). An in silico DNA vaccine against Listeria monocytogenes. Vaccine, 29(40), 6948–6958. doi:10.1016/j.vaccine.2011.07.04.

In silico Approaches to Vaccine Design

75

James, E. R., Van Zyl, W. H., Van Zyl, P. J., & Görgens, J. F. (2012). Recombinant hepatitis B surface antigen production in Aspergillus niger: evaluating the strategy of gene fusion to native glucoamylase. Applied microbiology and biotechnology, 96(2), 385394. Janahi, E. M., Dhasmana, A., Srivastava, V., Sarangi, A. N., Raza, S., Arif, J. M., Bramha Bhatt, M. L., Lohani, M., Areeshi, M. Y., Saxena, A. M., & Haque, S. (2017). In silico CD4+, CD8+ T-cell and B-cell immunity associated immunogenic epitope prediction and HLA distribution analysis of Zika virus. EXCLI journal, 16, 63. Janse, M., Brouwers, T., Claassen, E., Hermans, P., & van de Burgwal, L. (2021). Barriers Influencing Vaccine Development Timelines, Identification, Causal Analysis, and Prioritization of Key Barriers by KOLs in General and Covid-19 Vaccine R&D. Frontiers in public health, 9, 612541. https://doi.org/10.3389/fpubh.2021.612541. Jarząb, A., Skowicki, M., & Witkowska, D. (2013). Szczepionki podjednostkowe– antygeny, nośniki, metody koniugacji i rola adiuwantów [Subunit vaccines - antigens, carriers, conjugation methods and the role of adjuvants]. Advances in Hygiene & Experimental Medicine/Postepy Higieny i Medycyny Doswiadczalnej, 67. Josefsberg, J. O., & Buckland, B. (2012). Vaccine process technology. Biotechnology and bioengineering, 109(6), 1443-1460. Kardani, K., Bolhassani, A., & Namvar, A. (2020). An overview of in silico vaccine design against different pathogens and cancer. Expert Review of Vaccines, 19(8), 699-726. Kardani, K., Hashemi, A., & Bolhassani, A. (2019). Comparison of HIV-1 Vif and Vpu accessory proteins for delivery of polyepitope constructs harboring Nef, Gp160 and P24 using various cell penetrating peptides. Plos one, 14(10), e0223844. Kardani, K., Hashemi, A., & Bolhassani, A. (2020). Comparative analysis of two HIV-1 multiepitope polypeptides for stimulation of immune responses in BALB/c mice. Molecular Immunology, 119, 106-122. Kathwate, G. H. (2022). In silico Design and Characterization of Multi-epitopes Vaccine for SARS-CoV2 from Its Spike Protein. International Journal of Peptide Research and Therapeutics, 28(1), 1-15. Kaur, R., Arora, N., Jamakhani, M. A., Malik, S., Kumar, P., Anjum, F., Tripathy, S., Mishra, A., & Prasad, A. (2020). Development of multi-epitope chimeric vaccine against Taenia solium by exploring its proteome: an in silico approach. Expert review of vaccines, 19(1), 105-114. Khan Md Anik Ashfaq, Jenifar Quaiyum Ami, Khaledul Faisal, Rajashree Chowdhury, Prakash Ghosh, Faria Hossain, Ahmed Abd El Wahed and Dinesh Mondal (2020) An immunoinformatic approach driven by experimental proteomics: in silico design of a subunit candidate vaccine targeting secretory proteins of Leishmania donovani amastigotes. Parasites Vectors (2020) 13:196. https://doi.org/10.1186/s13071-02004064-8. Khan, M. A., Hossain, M. U., Rakib‐Uz‐Zaman, S. M., & Morshed, M. N. (2015). Epitope‐ based peptide vaccine design and target site depiction against Ebola viruses: an immunoinformatics study. Scandinavian journal of immunology, 82(1), 25-34. Khatoon, N., Pandey, R. K., & Prajapati, V. K. (2017). Exploring Leishmania secretory proteins to design B and T cell multi-epitope subunit vaccine using immunoinformatics approach. Scientific reports, 7(1), 1-12.

76

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Mahdavi, M., & Moreau, V. (2016). In silico designing breast cancer peptide vaccine for binding to MHC class I and II: a molecular docking study. Computational Biology and Chemistry, 65, 110-116. Mandana Behbahani, Mohammad Moradi, Hassan Mohabatkar (2021) In silico design of a multi‑epitope peptide construct as a potential vaccine candidate for Influenza A based on neuraminidase protein. In Silico Pharmacology (2021) 9:36. https://doi.org/ 10.1007/s40203-021-00095-w. Marana, M. H., Jørgensen, L. V. G., Skov, J., Chettri, J. K., Holm Mattsson, A., Dalsgaard, I., Kania, P. W. and Buchmann, K., (2017). Subunit vaccine candidates against Aeromonas salmonicida in rainbow trout Oncorhynchus mykiss. PLoS One, 12(2), e0171944. Martinelli Dominic D. (2022), In silico vaccine design: A tutorial in immunoinformatics, Healthcare Analytics (2022), doi: https://doi.org/10.1016/j.health.2022.100044. Matin Fathollahi, Anwar Fathollahi, Hamid Motamedi, Jale Moradi, Amirhooshang Alvandi, and Ramin Abiri (2021) In silico vaccine design and epitope mapping of New Delhi metallo‑beta‑lactamase (NDM): an immunoinformatics approach. BMC Bioinformatics (2021) 22:458. https://doi.org/10.1186/s12859-021-04378-z. Meza, B., Ascencio, F., Sierra-Beltrán, A. P., Torres, J., & Angulo, C. (2017). A novel design of a multi-antigenic, multistage and multi-epitope vaccine against Helicobacter pylori: an in silico approach. Infection, Genetics and Evolution, 49, 309-317. Meza, B., Ascencio, F., Sierra-Beltrán, A. P., Torres, J., & Angulo, C. (2017). A novel design of a multi-antigenic, multistage and multi-epitope vaccine against Helicobacter pylori: An in silico approach. Infection, Genetics and Evolution, 49, 309–317. doi:10.1016/j.meegid.2017.02.007. Moise, L., Gutierrez, A., Kibria, F., Martin, R., Tassone, R., Liu, R., Terry, F., Martin, B., & De Groot, A. S. (2015). iVAX: An integrated toolkit for the selection and optimization of antigens and the design of epitope-driven vaccines. Human vaccines & immunotherapeutics, 11(9), 2312-2321. Muhammad Ihsan Muttaqin, Filia Stephanie, Mutiara Saragih and Usman Sumo Friend Tambunan (2021) Epitope-Based Vaccine Design for Tuberculosis HIV Infection Through in silico Approach. Pakistan Journal of Biological Sciences. 24. 765-772. Muhammad Tahir ul Qamar, Zeeshan Shokat, Iqra Muneer, Usman Ali Ashfaq, Hamna Javed, Farooq Anwar, Amna Bari, Barira Zahid and Nazamid Saari (2020) Multiepitope-Based Subunit Vaccine Design and Evaluation against Respiratory Syncytial Virus Using Reverse Vaccinology Approach. Vaccines 2020, 8, 288; 1-27. doi:10.3390/vaccines8020288. Munikumar, M., Priyadarshini, I. V., Pradhan, D., Umamaheswari, A., & Vengamma, B. (2013). Computational approaches to identify common subunit vaccine candidates against bacterial meningitis. Interdisciplinary Sciences: Computational Life Sciences, 5(2), 155-164. Nabel G. J. Designing tomorrow’s vaccines. N Engl J Med. (2013) 368:551–60. 10.1056/NEJMra1204186. Nakayasu, E. S., Sobreira, T. J., Torres, R., Ganiko, L., Oliveira, P. S., Marques, A. F., & Almeida, I. C. (2012). Improved proteomic approach for the discovery of potential vaccine targets in Trypanosoma cruzi. Journal of proteome research, 11(1), 237-246.

In silico Approaches to Vaccine Design

77

Namvar, A., Panahi, H. A., Agi, E., & Bolhassani, A. (2020). Development of HPV16, 18, 31, 45 E5 and E7 peptides-based vaccines predicted by immunoinformatics tools. Biotechnology Letters, 42(3), 403-418. Negahdaripour, M., Eslami, M., Nezafat, N., Hajighahramani, N., Ghoshoon, M. B., Shoolian, E., & Ghasemi, Y. (2017). A novel HPV prophylactic peptide vaccine, designed by immunoinformatics and structural vaccinology approaches. Infection, Genetics and Evolution, 54, 402-416. Negahdaripour, M., Eslami, M., Nezafat, N., Hajighahramani, N., Ghoshoon, M. B., Shoolian, E., Dehshahri, A., Erfani, N., Morowvat, M. H., & Ghasemi, Y. (2017). A novel HPV prophylactic peptide vaccine, designed by immunoinformatics and structural vaccinology approaches. Infection, Genetics and Evolution, 54, 402-416. Oany, A. R., Emran, A. A., & Jyoti, T. P. (2014). Design of an epitope-based peptide vaccine against spike protein of human coronavirus: an in silico approach. Drug design, development and therapy, 8, 1139. Oany, A. R., Emran, A. A., & Jyoti, T. P. (2014). Design of an epitope-based peptide vaccine against spike protein of human coronavirus: an in silico approach. Drug design, development and therapy, 8, 1139–1149. https://doi.org/10.2147/DDDT. S67861 Ogundolie, F. A. (2015). Characterization of a Purified β–Amylase from Black Marble Vine (Dioclea reflexa) Seeds (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A. (2021). Cloning of α-AMYLASE and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from cocoa (Theobroma cacao L.) Pods and Biochemical Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A., Ayodeji, A. O., Olajuyigbe, F. M., Kolawole, A. O., & Ajele, J. O. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. 2022a. Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-20 Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. In. Computational Intelligence in IoT Healthcare. 2022 b. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics. doi: 10.1201/9780367548445-19. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Omotayo Opemipo Oyedara, Folasade Muibat Adeyemi, Charles Oluwaseun Adetunji, Temidayo Oluyomi Elufisan. 2022. Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARS-CoV-2 Infection. doi: 10.1201/9781003178903-10. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903.

78

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Onyeka S. Chukwudozie, Vincent C. Duru, Charlotte C. Ndiribe, Abdullahi T. Aborode, Victor O. Oyebanji and Benjamin O. Emikpe (2021) The Relevance of Bioinformatics Applications in the Discovery of Vaccine Candidates and Potential Drugs for COVID19 Treatment. Bioinformatics and Biology Insights Volume 15: 1–8. DOI: 10.1177/11779322211002168. Pandey, R. K., Ali, M., Ojha, R., Bhatt, T. K., & Prajapati, V. K. (2018). Development of multi-epitope driven subunit vaccine in secretory and membrane protein of Plasmodium falciparum to convey protection against malaria infection. Vaccine, 36(30), 4555-4565. Patronov A., and Doytchinova I. (2013) T-cell epitope vaccine design by immunoinformatics. Open Biol 3: 120139. http://dx.doi.org/10.1098/rsob.120139. Poland GA, Ovsyannikova IG, Jacobson RM. 2009Application of pharmacogenomics to vaccines. Pharmacogenomics.;10(5):837–852. Pourseif, M. M., Yousefpour, M., Aminianfar, M., Moghaddam, G., & Nematollahi, A. (2019). A multi-method and structure-based in silico vaccine designing against Echinococcus granulosus through investigating enolase protein. BioImpacts: BI, 9(3), 131. Rashidian, E., Gandabeh, Z. S., Forouharmehr, A., Nazifi, N., Shams, N., & Jaydari, A. (2020). Immunoinformatics approach to engineer a potent poly-epitope fusion protein vaccine against Coxiella burnetii. International Journal of Peptide Research and Therapeutics, 26(4), 2191-2201. Ribas‐Aparicio Rosa María, Castelán‐Vega Juan Arturo, Jiménez‐Alberto Alicia, Monterrubio‐López Gloria Paulina and Aparicio‐Ozores Gerardo (2017). The Impact of Bioinformatics on Vaccine Design and Development. INTECH. Chapter 7. 124-146. http://dx.doi.org/10.5772/intechopen.69273. Saad Ahmed Sami, Kay Kay Shain Marma, Shafi Mahmud, Md. Asif Nadim Khan, Sarah Albogami, Ahmed M. El-Shehawi, Ahmed Rakib, Agnila Chakraborty, Mostafah Mohiuddin, Kuldeep Dhama, Mir Muhammad Nasir Uddin, Mohammed Kamrul Hossain, Trina Ekawati Tallei, and Talha Bin Emran (2021). Designing of a MultiEpitope Vaccine against the Structural Proteins of Marburg Virus Exploiting the Immunoinformatics Approach. ACS Omega 2021, 6, 32043−32071. Samad, A., Meghla, N. S., Nain, Z., Karpiński, T. M., & Rahman, M. (2022). Immune epitopes identification and designing of a multi-epitope vaccine against bovine leukemia virus: a molecular dynamics and immune simulation approaches. Cancer Immunology, Immunotherapy, 1-14. Sandeep Kumar Dhanda, Salman Sadullah Usmani, Piyush Agrawal, Gandharva Nagpal, Ankur Gautam and Gajendra P. S. Raghava (2017) Novel in silico tools for designing peptide-based subunit vaccines and immunotherapeutics. Briefings in Bioinformatics, 18(3), 2017, 467–478. doi: 10.1093/bib/bbw025. Scholzen, A., Richard, G., Moise, L., Baeten, L. A., Reeves, P. M., Martin, W. D., Brauns, T. A., Boyle, C. M., Paul, S. R., Bucala, R., Bowen, R. A., Garritsen, A., Groot, A. S. De., Sluder, E. A., & Poznansky, M. C. (2019). Promiscuous Coxiella burnetii CD4 epitope clusters associated with human recall responses are candidates for a novel Tcell targeted multi-epitope Q fever vaccine. Frontiers in immunology, 10, 207.

In silico Approaches to Vaccine Design

79

Seyed, N., Taheri, T., Vauchy, C., Dosset, M., Godet, Y., Eslamifar, A., I., Sharifi, I., Adotevi, O., Borg, C., Rohrlich, P. S., & Rafati, S. (2014). Immunogenicity evaluation of a rationally designed polytope construct encoding HLA-A* 0201 restricted epitopes derived from Leishmania major related proteins in HLA-A2/DR1 transgenic mice: steps toward polytope vaccine. PLoS One, 9(10), e108848. Shah, P., Mistry, J., Reche, P. A., Gatherer, D., & Flower, D. R. (2018). In silico design of Mycobacterium tuberculosis epitope ensemble vaccines. Molecular Immunology, 97, 56-62. Sharmin, R., & Islam, A. B. M. M. K. (2014). A highly conserved WDYPKCDRA epitope in the RNA directed RNA polymerase of human coronaviruses can be used as epitopebased universal vaccine design. BMC bioinformatics, 15(1), 1-10. Smrithi Radhakrishnan, M. Elizabeth Sobhia (2021). An in silico Approach for Epitope Based Vaccine Design Against Ebola virus. Arch Microbiol Immunology 2021; 5 (1): 182-206 10.26502/ami.93650057. Soltan M. A., Behairy M. Y., Abdelkader M. S., Albogami S., Fayad E., Eid R. A., Darwish K. M., Elhady S. S., Lotfy A. M. and Alaa Eldeen M. (2022) In silico Designing of an Epitope-Based. Somayeh Delfani, Abbas Ali Imani Fooladi, Ashraf Mohabati Mobarez, Mohammad Emaneini, Jafar Amani, Hamid Sedighian (2016) In silico analysis for identifying potential vaccine candidates against Staphylococcus aureus. Clinical and Experimental Vaccine Research. 4. 99-106. http://dx.doi.org/10.7774/cevr.2015. 4.1.99. Subramaniyan, V., Venkatachalam, R., Srinivasan, P., & Palani, M. (2018). In silico prediction of monovalent and chimeric tetravalent vaccines for prevention and treatment of dengue fever. Journal of Biomedical Research, 32(3), 222. Sunita, Sajid, A., Singh, Y., & Shukla, P. (2020). Computational tools for modern vaccine development. Human Vaccines & Immunotherapeutics, 16(3), 723-735. Syed Aun Muhammad, Sidra Zafar, Samana Zahra Rizvi, Imran Imran, Fahad Munir, Muhammad Babar Jamshed, Amjad Ali, Xiaogang Wu, Numan Shahid, Muhammad Zaeem, and Qiyu Zhang (2020). Experimental analysis of T cell epitopes for designing liver cancer vaccine predicted by system-level immunoinformatics approach. Am J Physiol Gastrointest Liver Physiol 318: G1055–G1069, 2020. doi:10.1152/ajpgi. 00068.2020. Tarang, S., Kesherwani, V., LaTendresse, B., Lindgren, L., Rocha-Sanchez, S. M., & Weston, M. D. (2020). In silico design of a multivalent vaccine against Candida albicans. Scientific Reports, 10(1), 1-7. Thakur, R., & Shankar, J. (2016). In silico identification of potential peptides or allergen shot candidates against Aspergillus fumigatus. BioResearch open access, 5(1), 330341. Thakur, R., & Shankar, J. (2016). In silico identification of potential peptides or allergen shot candidates against Aspergillus fumigatus. BioResearch open access, 5(1), 330341. Umamaheswari, A., Pradhan, D., and Hemanthkumar, M. (2012). Computer aided subunit vaccine design against pathogenic Leptospira serovars. Interdisciplinary Sciences: Computational Life Sciences, 4(1), 38-45.

80

C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Utpal Kumar Adhikari, Mourad Tayebi, and M. Mizanur Rahman (2018) Immunoinformatics Approach for Epitope-Based Peptide Vaccine Design and Active Site Prediction against Polyprotein of Emerging Oropouche Virus. Hindawi Journal of Immunology Research Volume 2018, Article ID 6718083, 22 pages. https://doi.org/10.1155/2018/6718083. Vaccine Against Common E. coli Pathotypes. Front. Med. 9:829467. doi: 10.3389/ fmed.2022.829467. Victor Hugo Urrutia-Baca, Ricardo Gomez-Flores, Myriam Ange´Lica De La GarzaRamos, Patricia Tamez-Guerra, Daniela Guadalupe Lucio-Sauceda, and Mari´A Cristina Rodri´Guez-Padilla (2019) Immunoinformatics Approach to Design a Novel Epitope-Based Oral Vaccine Against Helicobacter pylori. Journal of Computational Biology. Volume 26, Number 10, Pp. 1177–1190. doi: 10.1089/cmb.2019.0062. Yang, Z., Bogdan, P. & Nazarian, S. An in silico deep learning approach to multi-epitope vaccine design: a SARS-CoV-2 case study. Sci Rep 11, 3238 (2021). https://doi.org/ 10.1038/s41598-021-81749-9 Yang, Z., Bogdan, P., & Nazarian, S. (2021). An in silico deep learning approach to multiepitope vaccine design: a SARS-CoV-2 case study. Scientific reports, 11(1), 1-21. Zeinalzadeh, N., Salmanian, A. H., Ahangari, G., Sadeghi, M., Amani, J., Bathaie, S. Z., & Jafari, M. (2014). Design and characterization of a chimeric multiepitope construct containing C fa B, heat‐stable toxoid, C ss A, C ss B, and heat‐labile toxin subunit B of enterotoxigenic E scherichia coli: a bioinformatic approach. Biotechnology and applied biochemistry, 61(5), 517-527. Zheng, J., Lin, X., Wang, X., Zheng, L., Lan, S., Jin, S., & Wu, J. (2017). In silico analysis of epitope-based vaccine candidates against hepatitis B virus polymerase protein. Viruses, 9(5), 112. Zhou, W. Y., Shi, Y., Wu, C., Zhang, W. J., Mao, X. H., Guo, G., & Zou, Q. M. (2009). Therapeutic efficacy of a multi-epitope vaccine against Helicobacter pylori infection in BALB/c mice model. Vaccine, 27(36), 5013-5019. Zikun Yang, Paul Bogdan and Shahin Nazarian (2021). In silico deep learning approach to multi‑epitope vaccine design: a SARS‑CoV‑2 case study. Scientific Reports. (2021) 11:3238 | https://doi.org/10.1038/s41598-021-81749-9.

Chapter 4

Evolution of Genomic Medicine Sujata Mohanty*, PhD and Kopal Singhal, PhD Department of Biotechnology, Jaypee Institute of Information Technology, Uttar Pradesh, India

Abstract Prior to the advent of sequencing technology, understanding the genetic variations at a macro level would have taken a century of research, but the completion of human genome project in 2003 with the revolutionary changes in technology accelerated the time required to achieve this dream. As the saying goes Prevention is better than cure, the field of Genomic Medicine was born to identify the variations, polymorphisms, phenotypes associated with majority of diagnosed and undiagnosed diseases. The idea was to understand the disease process, associated signalling pathways and molecules which could be used as a biomarker in the detection of the diseases. This early detection could ensure preventive therapies and could also increase the quality of life of the patients if not completely cure the disease. However, for the implementation of this therapy, all the events leading to the creation of this field needs to be understood for gaining insight into the evolution of genomic medicine. In the present chapter, we have tried to identify the significant technological advancement in genomics and bioinformatics towards the development of Genomic Medicine and discussed the probable challenges and opportunities of this emerging field.

Keywords: human genome projects, bioinformatics, evolution, genomic medicine 

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

82

Sujata Mohanty and Kopal Singhal

Introduction The concept of genomic medicine took a century of research and methodological advancements to materialise. The early days of medicine in India mostly followed traditional methods of Ayurveda and surgical procedures laid in Sushruta Samhita by “Father of Surgery” Sushruta (Tewari and Shukla, 2005; Singh, 2017). Other forms of ancient medicine like Unani, Traditional Chinese Medicine (TCM), Kampo or Traditional Korean Medicine (TKM) were practised throughout the world and were based on natural medicinal herbs (Wachtel-Galor and Benzie, 2011; Yuan et al., 2016). With early man evolving into modern man and development of civilisations to technology, medicine also underwent an evolution from traditional to pharma to genomic. The first drug made of morphine as the active compound was produced by German pharmacist in 1805 (Yuan et al., 2016). Traditional medicine based on the natural products was practised for a long time and due to this long experience, the parameters as drug-dosage, efficacy, active compound of plant, its side-effects etc. were already established and formed the basis of modern medicine based on synthetic chemicals (Patwardhan and Mashelkar, 2009). The discovery of active compounds significant in treatment of deadly diseases, for many modern drugs was done during the time of traditional medicine e.g., the well-known anti-malarial drug artemesnin was discovered in ancient Chinese medicine for its use in treatment of malaria (Yuan et al., 2016). However, lack of regulatory approvals and longer treatment time made it difficult for the evolving homo sapien to settle at traditional methods for healthcare. Later, as technological advancements surfaced, chemically produced synthetic drugs started ruling the markets and various pharmaceutical industries were set up. These chemically synthesised drugs were made using the active compound from traditional medicine along with some excipients to enhance the efficacy thereby resulting in faster recovery (Pan et al., 2013). The pharmaceutical industry was at boom during the late 19th and early 20th century. Several regulatory bodies were also formed to keep a check on the quality, active compound, and effects of the drug on the human body (Ishiguro et al., 2013). These regulations meant years of research involving various model organism followed by clinical trials on human patients to check the efficacy and effects of the molecule. Only after fulfilling all the criteria the drug could be marketed. These years of research involved a great deal of resources including manpower and patenting cost thereby raising the cost of the drug and making it difficult to reach the poor. Another drawback of this medication was development of resistance as a result

Evolution of Genomic Medicine

83

of frequent intake of the drug like antibiotics, making the microorganism resistant to the effect of the drug (Zaman et al., 2017; Naylor et al., 2018; Singer et al., 2016). The advent of sequencing technology with the completion of Human Genome Project (HGP) in early 21st century paved way for the era of genomic and personalised medicine (Venter et al., 2001; Chan and Ginsburg, 2011; Mathur and Sutton, 2017). With the knowledge of “book of life” in hand scientist began to wonder if they can link this genomic information with medicine making personalised medicine for every individual, bypassing the problems of allergic reactions or conditional inefficacy of the drugs. Various sequencing projects were completed to gather the genomic information of all living organism across the three kingdoms (Sasaki and Burr, 2000; Venter et al., 2001; Stark, et al., 2007). The existing experimental knowledge was used to identify the causative agent behind different diseases and with the information of genome in hand, the culprit gene could be pinpointed. This world is full of diversity based on culture, geography, or ethnicity of human beings. One of the significant outcomes of HGP was that human genome share 99.9% similarity with each other and only 0.1% accounts for the variation we see around us. Yet this 0.1% holds much importance in studying the different response to same compounds by two individuals. The population genetic studies also helped in identification of these differences. The problem with traditional or modern medicine lies in this difference. These medications were broad ranged and did not consider the genetic differences of the individuals and thus were rendered inefficient in certain cases (Halliwell, 2004; Mathur and Sutton, 2017). Overtime with the growth of the genomic information scientist began wondering if it was possible to provide medication based on the genetic identity of the organism and thus the field of personalised or genomic medicine was born. Environmental pressures create stress for living organisms breaking down the inertia of the genome and putting it under evolution. In order to adapt to the changing surroundings this genome then undergoes changes which may or may not be beneficial for the organism, however the response of different individuals to the stimulus is different giving rise to variable phenotypes and genotypes. Some mutations result in disease condition in one genome while they lay silent in others. The new approach of personalised medicine involved understanding the genomics behind diseases i.e., linking mutations, polymorphisms, variations, haplotypes etc. with diseases (Chan and Ginsburg, 2011). Various disease specific databases were created to store the information linked with diseases and identify the genetic changes associated with disease

84

Sujata Mohanty and Kopal Singhal

conditions. A major clause in all these databases was freely available information for the public use. This resulted in a development of curated databases with up-to-date information. Further, the development of computational tools helped in modelling the protein structures and carrying out simulation studies to understand the molecular mechanism behind the disease condition. The field of bioinformatics was developed, and various tools were created to carry out the simulation. The overnight decreasing cost of sequencing projects accompanied with new computational tools helped in understanding the disease etiology and identification of disease related genes. Overtime not only genomic information but transcriptomics and proteomics information were also generated to identify disease linked mutations or genes (Figure 1). Various sequencing methods such as ChipSeq, RNAseq, etc. were developed to see the expression of genes under diseased condition and identify their role (Schmidt et al., 2009; Wang et al., 2009).

Figure 1. Timeline of the major breakthroughs in the field of medicine and genomics.

The completion of Human Genome Project (HGP) in 2003 opened the possibility of carrying out genome/exome scale studies for identification of mutations, polymorphisms linked to the disease conditions. This amalgamation of genomics in the field of pharma and diagnostics has identified and enabled the use of genomic biomarkers in the diagnosis of the disease promoting to an early detection of the condition. The field of personalised medicine will majorly reduce the treatment cost and wastage of medical resources and can improve the quality of life and the great examples

Evolution of Genomic Medicine

85

we see in malaria and Covid-19 (Visvikis-Siest et al., 2020). The mere thought of having our whole genome information and the associated risk factors with it is quite fascinating. However, like any big revolution in medicine, this field is also not free of challenges including social, legal, and ethical issues with the amount of genomic information generated. With big dreams come bigger responsibilities of making them into a success and for the field of genomic medicine to reach the masses of both society and administration needs to go hand in hand to create awareness of the latest techniques with the medical researchers. A cyber security cell focussed mainly on the genomic data needs to be created and managed. The present chapter is a compilation of the history and growth of the era of the genomic medicine and how the field of biotechnology and bioinformatics proven to be a boon for the medical researchers.

Methodological Advancement DNA, the Genetic Material The mid-20th century witnessed remarkable discoveries in the field of biology, firstly by discovering the DNA as the genetic material (Hershey and Chase, 1952) and later by decoding its structure (Watson and Crick, 1953). Alfred Hershey and Martha Chase solved the long going conflict of whether DNA or Protein is the genetic material. However, their discovery was based upon a long list of experiments that identified the existence of any genetic material in the living organism. Mendel’s law of inheritance was recognised in the early 20th century and laid the foundation of genetics by identifying them as “Factors” (von Tschermak-Seysenegg, 1951). Alter Sutton and Bovori independently proposed that these heritable factors are located on chromosomes which was later confirmed by Thomas Hunt Morgan using fruit fly model (Sutton, 1903; Morgan and Cattell, 1912; Holland and Cleveland, 2009). The findings of these three scientists gave rise to the “Chromosomal theory of inheritance.” Having known that chromosomes are the basis for inheritance, the scientific community was divided into two school of thoughts; one of which believed for proteins to be the genetic material while the other thought DNA to be the one. In 1928, British bacteriologist Frederick Griffith described the process of transformation; while Avery, McCarty, and MacLeod identified DNA as this transforming material (Griffith, 1928; Avery et al.,

86

Sujata Mohanty and Kopal Singhal

1944). However, it was with Hershey and Chase experiment with bacteriophage that confirmed that “DNA is the genetic material.” With this new-found knowledge of DNA as the genetic material, scientist started exploring the structure of this biological entity. Erwin Chargaff, James Watson, Francis Crick, and Rosalind Franklin solved this mystery (Watson and Crick, 1953; Pollock et al., 1970). Biochemist such as Phoebus Levene and others reported that DNA was made up of nucleotides subunits. Further, Erwin Chargaff provided with the base complementarity rules known as “Chargaff Rule.” James Watson, Francis Crick, and Rosalind Franklin later proposed the 3-D double helical structure of DNA followed by the cracking of genetic code by the eminent scientist Marshall Nirenberg and Har Gobind Khorana (Nirenberg, 1963; Pollock et al., 1970). All these developments laid the foundations of the field of molecular biology and the beginning of an era of genomics.

High Throughput Sequencing Technology

Pre-Genomic Era Now with the knowledge of DNA and its basic structure in hand, the question as to what is the sequence of this DNA in the different organisms started boggling the minds of researchers. The knowledge of the sequence features could be linked to a particular disease haplotype and would thus become revolutionary in the field of medical biotechnology. However, nucleic acids sequencing was accompanied with structural or biochemical limitations in comparison to the proteins. Thus, the first biomolecule to be sequenced was a protein i.e., Insulin around 1955 by Fredrick Sanger, followed by alanine tRNA by Holley and collaborators (Stretton, 2002; Holley et al., 1965).

First Generation Sequencing The first line of DNA sequencing techniques was developed by two independent groups of researchers. This first method was developed by Allan Maxam and Walter Gilbert in 1976 and thus came to be known as maxam gilbert sequencing and was based on the chemical properties of the nucleic acid bases. The second and most widely used method till date is the sanger sequencing based on dideoxy chain termination method. In comparison to the chemical sequencing method, sanger was found to be more popular than maxam gilbert sequencing as the later was involved with the use of a

Evolution of Genomic Medicine

87

radioactive material which is a neurotoxin (Heather and Chain, 2016). Sanger sequencing on the other hand witnessed several changes in becoming an automated method of sequencing along with the use of fluorometric detection methods and capillary-based electrophoresis. The first generation of the automated sequencer i.e., ABI 310 was developed by Applied Biosystems incorporated in 1986 by Leory Hood and colleagues (Heather and Chain, 2016). However, the read lengths of the first-generation sequencers were generally short, and it required longer times for sequencing large reads. This gave rise to the method of shotgun sequencing which was used to sequence the first whole genome of an organism i.e., Haemophilus influenza (1.83 Mb) which took a year of hard work to complete (Fleischmann et al., 1995). J. Craig Venter founded the Institute for Genome Research (TIGR) where they developed a software, TIGR assembler for obtaining the whole genome of H. influenza (Sutton et al., 1995).

Second Generation Sequencing The next generation sequencing technology marked the beginning of a second generation of sequencing overcoming the limitations of first-generation sequencing (Schuster, 2007). The development of PCR technique by Kary Mullis in 1993 and exploitation of a polymerase enzyme formed the basis of these next generation sequencing technologies (Mullis, 1994). The secondgeneration technologies were powerful with massively parallel, high throughput and cheap sequencing. This was mainly based on pyrosequencing, Illumina, or SOLiD technology (Voelkerding et al., 2009; Vigliar et al., 2015). All these methods are discussed briefly below: a) Pyrosequencing: While sanger sequencing was based on the dideoxy chain termination, Pyrosequencing was a method utilising the process of “sequence by synthesis” i.e., instead of breaking and decoding the single stranded DNA as in case of Sanger; pyrosequencing involved the incorporation of nucleotides with respect to the complementary strand (Nyren, 2007). This addition of nucleotide is accompanied with release of a pyrophosphate thereby generating a chemiluminescent signal which can be detected by a pyrogram, and the sequence of the strand be known. The first automated pyrosequencer was developed by the company Pyrosequencing AB (Now Biotage AB) and this technique was commercialised using the Roche 454 pyrosequencing instrument which used the technique of

88

Sujata Mohanty and Kopal Singhal

emulsion PCR (Ronaghi, 2001). The workflow of Roche 454 is given in Figure 2.

Figure 2. Flowchart depicting the process of pyrosequencing.

b) Illumina: Illumina sequencing technology applies the basics of sanger but with the modification of using a reversible dye terminator and bridge amplification PCR. This technology was developed by Solexa in 2006 but was later acquired by Illumina in 2007 (Heather and Chain, 2016). It consists of four sequential steps as described in Illumina 2015: 1. Library Preparation: In this step, the DNA is fragmented and specific sequences such as adapter, linkers and complimentary oligos are ligated to the ends of the fragmented DNA sequences after which PCR amplification for the fragments is done. 2. Cluster Generation: This step requires a unique glass instrument known as flow cell, where the process of cluster generation occurs. The bed of the flow cell has a lawn of two different types of oligos attached to it. The complimentary sequences of these oligos have already been attached to the DNA fragments as adapters in the previous step. The adapter ligated fragments then bind to these complimentary sequences and DNA polymerase starts the process of synthesis by adding complimentary nucleotides to the single stranded DNA. This ds DNA is denatured, and the original template is washed away followed by

Evolution of Genomic Medicine

89

the clonal bridge amplification process. After the clonal amplification, the reverse strand of DNA fragments is washed away leaving behind a lawn of forward DNA strand on the flow cell. 3. Sequencing: A special kind of reversible terminator dye is used in this method of sequencing which emits a fluorescent signal upon addition of a nucleotide complimentary to the template strand. All the four types of dNTPs are bound with this dye, sequencing proceeds with base by base calling for each cluster and the emission wavelength and intensity is used to identify the nucleotide at the position. The read length is based on the number of cycles of sequencing. The sequencing can be done in two ways i.e., single-end or paired-end reads. 4. Data analysis: The millions of reads generated during the previous processes now need to be analysed to give the final sequences. This step involves the removal of adapter, linkers or primer sequences from the reads and alignment of the forwards and reverse reads to give contiguous sequences known as scaffolds. These scaffolds can then be assembled together to give the whole sequence and numerous analyses can be performed using these sequences. A series of different machines each advanced than the other have been designed by Illumina to carry out this process e.g., the Mi-Seq, Next-Seq or Hi-Seq series (Heather and Chain, 2016). All these machines have their specific utility in terms of whole genome, exome or metagenome sequencing and it would not be wrong to say that the current sequencing field is mostly owned by the Illumina technology. c) SOLiD technology: In 2005, Applied Biosystems came up with a new approach for NGS based on ligation. This technology SOLiD (Supported Oligonucleotide Ligation and Detection) is based on “sequencing by ligation” (Voelkerding et al., 2009). Library preparation can be done for fragmented or mate-paired sequences where adapters are ligated onto the single stranded fragments of DNA which is then attached to the microbeads on which emulsion PCR is carried out to prepare clones of the fragments. These clonally amplified fragments on the microbeads are attached to a glass slide similar to a flow-cell and primers complimentary to the adapter

90

Sujata Mohanty and Kopal Singhal

sequences are attached. Unique set of 8 oligo probes are designed which attaches with the primer sequence with the help of ligase. These probes are tagged with a fluorescent tag and sequencing is a result of repeated process of primer hybridization, ligation of probe followed by imaging and cleavage of probe to start another round. A single run of this process gives us the information of each 5th base, so the whole reaction is offset by one base and repeated 4 times to get the whole sequence decoded using two base colour coding. d) Ion Torrent: In 2010, Ion Torrent came up with a new sequencing technology based on semiconductor using PGM (Ion Personal Genome Machine). This technology exploits the chemistry of addition of nucleotide by polymerase which is accompanied with the release of H+ ion thereby changing the pH of the solution (Rothberg et al., 2011). This change in pH is detected by a CMOS (complementary metal- oxide semiconductor) sensor and the value is associated with the number and type of nucleotide incorporated. Sample preparation is done similar to the other methods with fragmentation, adapter ligation and clonal amplification is done on beads. Each sensor well has one bead containing the template DNA, primers and DNA polymerase. This technology reduced the cost of whole genome sequencing to a great extent.

Third Generation Sequencing Now while second generation sequencing proved to be a revolution for sequencing whole genomes in lesser time, the shorter read lengths (less than 500bps) with all the different technologies created an issue in the reliability of these methods for complete genome coverage. Shorter read lengths lead to missing out of some of the functionally important genes, inaccuracy in gene numbers and also provided a limitation for studying the various structural variations (Ozsolak, 2012; Bleidorn, 2016). To overcome the shorter read lengths and high error rates, the third generation of sequencing came into being with read lengths upto 10,000 bps and less error rates. These third-generation technologies do not need the amplifications steps and are able to sequence single molecules (Ozsolak, 2012; Bleidorn, 2016). Till date three commercial platforms are available for third generation sequencing. a) PacBio SMRT: Pacific Biosciences’ smart technology came up with the Single Molecule Real Time Sequencing method in 2010. This method is also based on “sequence by synthesis.” Each SMRT has

Evolution of Genomic Medicine

91

thousands of ZMW (Zero-mode waveguide) and on the base of these ZMWs DNA polymerase is attached. These ZMWs cells are made up of small wells that are nanometers in diameter. The bottom of these ZMWs is illuminated with light signal, and DNA template is loaded on to the polymerase (Ozsolak, 2012). Different fluorophore labelled nucleotides come and attach to the template DNA based on complementarity and this attachment releases a signal which is monitored by real-time imaging. b) Oxford Nanopore Technology (ONT): The first commercial nanopore sequencer MinION was developed in 2014. The roots of nanopore sequencing dates back to the late 20th century when two independent research groups Church et al. and Deamer and Akeson proposed the concept that while passing through a nanopore, the ionic current torrents produced by each of the nucleotide base in the DNA sequence will be different (Deamer and Akeson, 2000; Jain et al., 2018). The nanopore is surrounded by two different types of electrolytes on both sides and when DNA passes through the nanopore, it stops the current flow, and this blockage could be measured to know the nucleotide sequence.

Computational Tools Development The large array of sequencing data with overnight falling prices of sequencing created a need for managing this data so that it can be used by the research community (Kitano, 2002; Chain et al., 2003). This gave rise to the development of various computational tools for storing, analysing, and understanding the data. 20th century became the era of sequencing with the amalgamation of the fields of biological and computer science and gave rise to the field of bioinformatics. These computational tools can be broadly classified into four types, a brief overview of which has been provided below.

Whole Genome Analysis Tools Various computational tools were developed in the wake of sequencing era for microbe, plant, animal as well as human genome analysis. After obtaining the WGS of an organism, the computational tools needed are for alignment, assembly, and annotation (Table 1). The whole genome sequences of

92

Sujata Mohanty and Kopal Singhal

organism(s) added in the database are assessed for the quality and characteristics of the genome using computational tools during pre- and postsubmission processing. With improved sequencing instruments, advanced bioinformatics tools are also being developed to match the output of sequencing.

Disease and Drug Target Tools The large array of sequencing data also paved way for the field of medical genomics. The possibility of identifying disease related haplotypes, mutations, polymorphisms provided an opportunity to create disease specific databases which will store all genomic information pertaining to the disease etiology. The genomic information also helped in identification of potential drug targets by simulation and docking studies. Various docking tools for drug-ligand binding and interaction have been developed. A list of different drug-target identification tools is given in Table 1.

Evolutionary Studies “Nothing in biology makes sense except in the light of evolution” as said by geneticist Theodosius Dobzhansky. The field of evolutionary genomics was largely benefitted by the advancements in computational techniques (Sherbakov et al., 2013). Prior to the development of these tools, population geneticist used to manually calculate the genetic distance for constructing phylogeny and other sequence analysis. These computational tools increased the validity of the evolutionary studies by enhancing the efficiency and amount of genomic data through data simulation. These tools made possible to compare several phyla belonging to the three different domains of life at once thereby helping in understanding the evolutionary events leading to the classification of these phyla. These tools were mainly used for multiple sequence alignment, phylogeny, snp analysis etc.

Structural Modelling The fusion of biological information with computational tools helped in visualising the structures of complex biomolecules through structural

Evolution of Genomic Medicine

93

modelling. The tools used for structure modelling derive information from the genomic databases and by using highly versatile computational chemistry they calculate the energy of all possible models thereby helping the scientist to understand what lies inside the human body in the form of proteins or nucleic acids (Genheden et al., 2017). The data from different protein structure techniques such as crystallography, XRD, etc. formed the basis of this structural modelling. Based on the available information about the known proteins, structure can be predicted based on homology as well as ab-initio modelling. The foundation of these modelling tools was the Ramachandran plot (Hooft et al., 1997). Table 1. List of computational tools used in genomic studies Tool (ACT) Mauve BRIG SPAdes Velvet RAST

Prokka VISTA PipMaker EDGAR Bowtie/Bowtie 2 BWA/BWA-SW RMAP mrFAST/mrsFAST PASS

SOAP/SOAPv2/ SOAPv3 MOM

SHRiMP2

Use Artemis Comparison Tool Visualisation and analysis of whole genome Multiple genome alignment BLAST Ring Image Generator: Multiple genome comparisons for prokaryotes Genome assembler Assembler Rapid annotation based on Subsytem technology Prokaryotic genome annotation and comparison tool Prokaryotic genome annotation tool (Visualization Tool for Alignment) Genomic sequence comparison Percent Identity Plot Maker Genomic sequence comparison Whole genome comparison Alignment of sequencing reads Burrow-Wheeler aligner: Aligning sequence reads against reference genomes Sequence mapping tool Micro Read Fast Alignment Search Tool: Read mapping tool Aligns short sequences mainly used for reads generated by Solexa, SOLiD or 454 technologies Aligns short sequences generated by Illumina/Solexa onto reference sequences Maximum oligonucleotide mapping: Short read mapping programme more sensitive than the existing programmes Short Read Mapping Programme) Sensitive yet Practical Short Read Mapping

Reference Carver et al., 2011 Darling et al., 2004 Alikhan et al., 2011 Bankevich et al., 2012 Zerbino, 2010 Aziz et al., 2008

Seemann, 2014 Frazer et al., 2004 Schwartz et al., 2000 Blom et al., 2016 Langmead, 2009 Li and Durbin, 2009 Smith et al., 2009 Alkan et al., 2009 Campagna et al., 2009

Li et al., 2008 Eaves and Gao, 2009

David et al., 2011

94

Sujata Mohanty and Kopal Singhal

Table 1. (Continued) Tool GASSST

Use Global alignment tool for short sequences

BFAST

Mapping tool for large scale genome resequencing Publically accessible database containing data of interaction of protein and small molecules Biological General Repository for Interaction Datasets- Open access database focusing mainly on the model organism and human. It contains annotation and archives of protein, genetic and chemical interaction. Public integrative database focussed on cancer translational research and drug discovery A confederated database of chemical bioactivities that can help in determining potential drug targets. An open access database for information on potential drug targets Chemical Similarity Network Analysis Pulldown- A computational target identification method for drug target analysis Consolidated database containing information on drug-gene interactions from papers, websites or databases. A webserver for predicting drug target interactions based on existing biological information A repository of information on disease related genes and their variants Tool for analysis of DNA polymorphism data

BindingDB BioGRID

canSAR CARLSBAD

ChEMBL CSNAP

DGIdb 3.0

DINIES

DisGeNET DNAsp MEGA PHYLIP

PAUP EMBOSS ESyPred3D YASARA SWISS MODEL PHYRE2

Desktop application for evolutionary analysis of DNA and protein sequence data A tool for constructing phylogenetic trees based on both distance and character-based methods Constructs phylogeny based on maximum parsimony Open source software that contains different tools for sequence analysis Prediction of 3D structures of proteins Structural prediction tool for bio catalytic reactions Most widely used tool for protein tertiary and quaternary structure prediction Protein modelling tool

Reference Rizk and Lavenier, 2010 Homer et al., 2009 Gilson et al., 2015 Chatr-Aryamontri et al., 2017

Bulusu et al., 2013 Mathias et al., 2013

Gaulton et al., 2011 Lo et al., 2015

Cotto et al., 2017

Yamanishi et al., 2014

Pinero et al., 2016 Librado and Rozas, 2009 Kumar et al., 1994 Felsenstein,1986

Swofford, 1990 Rice et al., 2000 Lambert et al., 2002 Land and Humble, 2018 Schwede et al., 2003 Kelley et al.,2015

Evolution of Genomic Medicine Tool Arlequin 3 Simcoal2 CoaSim Other structural modelling tools

Use Integrated software package for population genetics studies Population genetics tool using coalescent simulations Simulation tool for genetic data APSSP2, PROCLASS, PSA, RPFOLD, BTeval, GammaPred, AlphaPred, BetaTPred, Modeller, Procheck, Psipred, Prof, MEmstat

95 Reference Excoffier et al., 2005 Antao et al., 2007 Mailund et al., 2005

Advent of Human Genome Projects A Brief History The rediscovery of Mendelian law of genetics and the knowledge that genes are the units of inheritance carried on chromosomes was revolutionary (von Tschermak-Seysenegg, 1951). These chromosomes carry the genetic information in the form of genes which are responsible for not only the regulation of various metabolic processes but any disparity in their function can lead to a diseased condition. Thus, understanding the etiology of various genetic and molecular disorders became important for the researchers. This raised an immense need to understand the genomic organisation of an individual and the long battle towards the unravelling of human genome began. Up until the late 20th century, the prospect of decoding the sequence of human genome remained a distant dream. However, the debates on the prospect of sequencing human genome were prevalent in those days. The preliminary thoughts of a human genome sequencing project came in the mind of Robert Sinsheimer, Chancellor of University of California and the idea was discussed in a meeting in 1985 (Chial, 2008; International Human Genome Sequencing Consortium, 2001). Later, Charles DeLisi, the then assistant director of health and environmental research at the Department of Energy (DOE) funded an early genome project in 1987. However, the involvement of a physics-driven research institute into breaking the code of life became questionable for the biological fraternity. Finally, in 1990 the US Congress funded both NIH (National Institute of Health) and DOE towards a 15 yearlong 3 billion Human Genome Project. In 1998, Craig J Venter in collaboration with Perkin-Elmer Instruments (Boston, MA) formed a company Celera Genomics with the sole motif of sequencing human genome (Collins, 1999; Collins, 2003). 20 different groups from different parts of the

96

Sujata Mohanty and Kopal Singhal

world joined together to make the HGP a success in the stipulated timeline (Venter et al., 2001). Sequencing the 3 million base pairs(bps) of human genome was not an easy task considering the available sequencing technologies at that time. Hence, the major focus for the initial 5-year period of the project was to develop a physical and genetic map of human genome. The developments in molecular biology techniques in the form of recombinant DNA technology, discovery of restriction enzymes and cloning proved to be the major driving force towards the methodology selected for whole genome sequencing. The methodology adopted by the public funded genome project and Celera genomics were quite different (Venter et al., 2001). The public funded project mainly relied on using BACs (bacterial artificial chromosomes) and followed hierarchical shotgun sequencing. The 1.5 kb DNA fragments inserted into the BACs were mapped using genome mapping to find the position of each fragment in the genome followed by shotgunning of these large fragments to smaller stretches of DNA and repeated sequencing of BAC fragments to attain overlapping sequences which can be assembled to obtain the whole genome. Celera genomics on the other hand used the whole genome shotgun method for sequencing using DNA of five individuals with different geographical background. The sequencing process was based on dideoxy chain termination method and sequence determination by ABI Prism 3700 DNA analyser. They also used the BACs cloned sequence data generated by public funded HGP and utilised two assembling strategies i.e., whole genome assembly and regional chromosome assembly for obtaining whole genome sequences. The beginning of a new millennium was marked by the release of first ever sequence of human chromosome 22 in December 1999 (Dunham et al., 1999). Finally, in 2001 these two groups independently released the initial drafts of the human genome sequence and in 2003 a final working draft of the human genome was published (Venter et al., 2001).

Significant Outcomes The HGP was one of its kind project and had a major effect on the fields of medical biology. Following are the significant outcomes of HGP (International Human Genome Sequencing Consortium, 2001) •

HGP generated both a physical and genetic map of the human genome and a comparison of the two provided that average rate of recombination is inversely proportional to the length of chromosome

Evolution of Genomic Medicine



• •



• •





97

and is higher in the distal regions of the chromosome compared to centromeres. The whole genome sequences of several model organism including fruit fly, Arabidopsis and several bacterial strains were also completed and made available for the public access. The genome of different individuals is 99.9% same with only 0.1% accounting for the variations we see in the human population. Less than 2% of the total genome encodes for proteins and contains around 21000 protein coding genes whereas about half of the genome constitutes of repetitive elements More than 1.4 million SNPs were found in the genome; however, the functional significance of synonymous SNPs opened another area of experimental research. Mutations are sex dependent with males having mutation rate twice that of females Availability of genome sequences in the public domain prove to be a boon for the medical biotechnologist with easier identification of disease related genes and their paralogs and also to study various diseases linked to chromosomal aberrations. Easy identification of potential drug targets of various metabolic disorders by cloning different disease related genes or developing different simulation studies. Book of life will become a reference point for all the future sequencing projects.

All these outcomes gave rise to the development of two different projects as the by-product of HGP.

ENCODE Post completion of initial HGP, the National Human Genome Research Institute (NHGRI) began a project in 2003 to identify all functional elements in human genome initially focussing on 1% of the genome. The project was named ENCODE (Encyclopaedia of DNA Elements). The pilot phase of the project took 4 years for completion and was further funded for a second and third phase to cover the whole genome analysis of human as well as mouse genomes. The major goals and outcome of the project are

98

Sujata Mohanty and Kopal Singhal

• • • • •

Identification of promoters, enhancers, and transcribed regions of the human genome Make resources and information freely available Test and compare the exiting methods for identification of functional elements Identify and build a strategy to fill the gaps in genome This project successfully assigned functional role to 80% of the human genome.

GENCODE As part of the ENCODE project, the GENCODE collaboration was established in order to annotate all evidence-based gene features on the human genome with an initial focus on 1% of the genome. This consortium is an amalgamation of computational techniques, manual annotation, and experimental validation for the genes. By creating this consortium, a reference human genome annotation database was put into order. This international sequencing project has opened several areas of research to work upon. The knowledge of genome and its variations at hand will enable a better understanding of the evolutionary processes of speciation and divergence. Furthermore, it will be crucial for treatment of metabolic and genetic disorders.

Development of Genomic Database One of the crucial clauses of the HGPs and other sequencing projects was disclosing of the genomic information created, with the scientific community. Since, the first molecule to be sequenced was of a protein i.e., Insulin the first biological database to be created was the Protein Data Bank. Later, the Bermuda principles for the Human Genome Project enabled the instant release of sequence data and thus, post HGP, the field of biology was flooded with a large pool of genomic information which needed to be arranged properly for attaining meaningful outcomes creating a need for development of a genomic database. These databases can be divided into three major categories based on information they contain (Table 2).

Evolution of Genomic Medicine

99

Table 2. Genomic databases and their uses Name ARED-Plus lncRNASNP2 miRTarBase The Sequence Read Archive (SRA) EBI patent sequences NCBI BioSample/BioProject European Genomephenome Archive (EGA) European Nucleotide Archive SMART (Simple Modular Architecture Research Tool) PIR (Protein information resource) The Transporter Classification Database (TCDB) PFAM OMIM (Online Mendelian Inheritance in Man) Cancer Resource

HAGR - Human Ageing Genomic Resources

HERVd - Human Endogenous Retrovirus database HCAD - Human Chromosome Aberration Database Exome Aggregation Consortium

Use Nucleotide Sequence Databases for AU rich elements Nucleotide Sequence Databases focusing on long non-coding RNAs Nucleotide Sequence Databases for micro RNAs Repository for experimental raw sequencing data Non-redundant databases of patent DNA and protein sequences Stores sequence data identified based on sequencing projects Repository for genetic and phenotype data from research projects Nucleotide Sequence Databases

Reference Bakheet et al., 2017

Protein sequence database

Schultz Jet al., 1998

Integrated Proteomic and genomic sequence database Curated database containing sequence, functional and structural information on transporter proteins Curated database of protein families Database of human genes and genetic disorders

Wu et al., 2003

Comprehensive database that integrates all cancer related information including cancerrelevant target genes and proteins Contains two main branches i.e., GenAge which contains information of genes related to human ageing and AnAge which describes the ageing process in several organisms Database providing information on retroviral elements in human genome

Ahmed et al., 2011

Database stores information on genes involved in chromosomal aberrations for identification of breakpoints A freely available open source browser that provides information on gene, gene transcripts with clinical implications

Hoffmann et al., 2005

Miao et al., 2017 Chou et al., 2017 Leinonen et al., 2010 Li et al., 2013 Barrett et al., 2011 Lappalainen et al., 2015 Silvester et al., 2017

Saier et al., 2006

Finn et al., 2013 Hamosh et al., 2005

De Magalhaes et al., 2005

Paces et al., 2002

Karczewski et al., 2016

100

Sujata Mohanty and Kopal Singhal

Primary Databases The first-hand experimental data repositories are the primary databases. The sequences or structures are submitted directly to the databanks and are provided unique accession ids for universal use. There are four major primary databases: a) Genbank: Owned by NCBI (National Centre for Biotechnology Information) of NLM (National Library of Medicine), genbank contains the nucleotide sequences deposited by individual researchers or whole genome sequencing projects. It contains sequences of around 300,000 organisms and all the sequences are made freely available to the public for use (Benson et al., 2008). b) EMBL (European Molecular Biology Laboratory): Established in 1980 and maintained by the European Bioinformatics Institute (EBI), EMBL like genbank contains primary nucleotide sequences submitted by either individuals or collaborative projects. All the data submitted is released immediately for public (Stoesser, G et al., 2002). EBI has a unique search engine named Sequence Retrieval System (SRS) for searching the sequence data. c) DDBJ (DNA Data Bank of Japan): the third publicly available nucleotide database is owned by National Institute of Genetics in Japan. This DDBJ database was established in 1987 and is timely updated (Mashima et al., 2016). These three major databases have collaborated to form The International Nucleotide Sequence Database Collaboration (INSDC) to share the sequence information with one another and make all the information available to public (Cochrane et al., 2015). It contains raw sequence data in the form of SRA (Sequence raw archive). a) PDB (Protein Data Bank): In 1971, the first macromolecular structure database was established by Brookhaven National Laboratory (BNL). Starting with seven structures, the improved MNR and crystallographic techniques increased the number of structures deposited in this data bank. Later in 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) was given the responsibility of maintaining and managing the PDB. RCBS developed ADIT (Auto Dep Input Tool) for submission of data which is then annotated and validated (Berman et al., 2000).

Evolution of Genomic Medicine

101

Secondary Databases The secondary databases use the raw data from the primary database and add specific sequence features to them. These secondary databases provide the information like conserved motifs, signature sequences, or active sites of molecules. These mainly store the structural and functional annotation results of the raw DNA or protein sequences in a curated manner. The major secondary databases are: a) Ensemble: Established in 1999 by collaboration of Welcome trust sanger institute and EMBL European bioinformatics institute after completion of HGP, ensemble project’s initial focus was human genome annotation but now includes more than 80 vertebrate species. Gene prediction is carried out based on both ab-initio and homology modelling methods for identification of genes with known functions as well as novel genes. This tool provides both functional and structural information of gene sequences (Hubbard et al., 2002). b) SWISS PROT: It is an annotated protein sequence database and is different from other databases with respect to improved annotation, minimal redundancy and integration with other databases. This database is now maintained by equal partnership of EMBL and SIB (Swiss Institute of Bioinformatics) (Bairoch and Apweiler, 2000). c) Inter Pro, the integrative protein signature database: Interpro is an integrative protein database containing several other databases, which provide the information of protein families, domains, functions etc. This collaborative project from different databases ensures less redundancy (Apweiler et al., 2001). d) Uniprot the Universal Protein knowledgebase: Formed by the collaboration of the Swiss- Prot, TrEMBL and PIR protein database in 2002, this database also provides accurate and updated information on the protein sequences (Apweiler et al. 2004). The major goal of this consortium was to provide fully classified, rich and accurately curated protein information. This database consists of three layers: i) The UniProt Archive (UniParc) stores all publicly available protein sequence data. ii) The UniProt Knowledgebase (UniProt) - provides sequence and functional annotation for the proteins iii) The UniProt NREF databases (UniRef) uses UniProt to provide non-redundant data collections

102

Sujata Mohanty and Kopal Singhal

Composite Databases Composite databases are an amalgamation of different primary databases. These are specialised databases wherein instead of looking for multiple primary databases you can gather information from one composite database. These databases are focussed on either some disease or a specific taxonomic order e.g., the OMIM (Online Mendelian Inheritance in Man) contains information on the proteins involved in genetic disorders or Drug Bank provides information regarding drugs and their targets. Genome Annotation and Comparison The amount of genomic information created due to various sequencing project remains insignificant without addition of biological information to the sequenced genes or genomes through a process known as annotation (Stein, 2001). Annotation is the next step to follow after whole genome or metagenome sequencing (Brent, 2008; Petty, 2010). This process bypasses the need for carrying out tedious experimental studies for knowing function of every protein sequence and has led to the development of various annotation and gene prediction tools. These tools help to identify the structure as well as the functional roles of the genes based on the different prediction models (abinitio or homology based). All these information is stored in different genomic databases as discussed earlier and those made available for the public. Analysis of genes structure, their function and roles in different metabolic pathways helps in understanding the genome of any organism and for performing a comparative study of several genomes. Comparative genomics enables the scientist to understand gene distribution, structural variation, and similarities across different genera (Reed et al., 2006). This also plays a key role in evolutionary biology by carrying out phylogenetic analysis of conserved genes and in understanding the pattern of divergence of these genera.

Paradigm Shift: Traditional to Genome Medicine The common diseases like flu, typhoid, tuberculosis, jaundice, malaria etc. are treated by the physicians based on traditional clinical diagnosis and management focusing mainly on signs, symptoms, and clinical sample test results and sometimes on family history (Liao and Tsai, 2013). The concept

Evolution of Genomic Medicine

103

of personalised medicine is not new to the healthcare sector as it exists since ancient period in “Ayurveda’. Ayurveda is not only based on the symptoms of any disease or unhealthy body, but it characterizes individuals based on their physiological and metabolic constitutions as three doshas e.g., vata, pitta and kaph called as movement, metabolic transformation, and growth and support respectively. The same concept has been explained in this post genomic era in a convincing way under the term “system biology” (Jayasundar et al., 2018). ‘Omics’ technologies have increased our understanding of cell as a system and fuelled the integration of system biology into health care and clinical medicine (Jayasundar et al., 2018). Ultimately these advances in system biology continue to broaden our understanding of human complexity and direct towards patient-centric systems approach. Thus, it is time to relook at the Indian medical system of Ayurveda known to have an integrated approach to health and disease. Individuals are born with unique biological characteristics. Genome medicine is a relatively new paradigm of evidence-based medicine which includes individual’s unique biological and genetic characteristics. It brings a revolutionary change in the biomedical sciences, and this becomes possible with the technical advancement in genomics and bioinformatics methods. Genomics, basically the science of genes and their interaction help us to understand the cause of many deadly/congenital diseases e.g., cancer, neurological disorders, and various birth defects. Implementation of genomics approach to the disease diagnosis and treatment has been proven to be one of the most effective ways to cure various diseases. Large scale genetic cohort studies help in identifying the disease-causing genetic factors and able to provide possible drug targets to the Pharmacology companies for coming up with new potential drugs (Liao and Tsai, 2013). Thus, Genome medicine is a new hope for effective curing of many diseases now.

Need of Genome Medicine: Broader Vision Although there is technical advancement in diagnosis tools and implementation of various advanced computerised and image-based test analysis methods, still problems arise in timely diagnose due to showing similar symptoms in the initial stages of several diseases, which compels the Physician to prescribe incorrect or multiple drugs to the patient at a time. Best examples are the undifferentiated febrile illness (Shrestha et al., 2018). The completion of first Human genome project in 2003 attracts researchers to

104

Sujata Mohanty and Kopal Singhal

understand the molecular mechanism behind any unhealthy status and disease conditions. Based on the successful achievement of functional and structural genome annotations, lots of other genomic information like cis regulatory genes and exon-intron structural variations were being explored and specific disease-oriented human genome projects were being initiated such as cancer and other rare genetic disorders. Over the last one decade, a significant increase in generating big genomic data set due to technological improvements have led to a vast expansion in the understanding of the genomic architecture of human disease (Raghavan and Vassy, 2014). However, the implementation of genomics information in the medical science as genome medicine or in routine clinical care has been growing at a comparatively slow pace. The practitioners or physicians must play a crucial role by changing their perspective in implementing the genomic technology into routine clinical care (Raghavan and Vassy, 2014). To achieve this goal, they need to come up with plausible mechanism for enhancing the usefulness and overcoming the limitations of genomic medicine for the betterment of the society.

Genetic Diversity Each individual is unique with respect to their genotype and phenotype, the reason behind lies in their genetic makeup. By comparing the genome of different individuals, the genetic variations in form of insertion, deletion, duplication, point mutation, copy number variation etc., are estimated. The genetic diversity can be measured from the sum of all these genetic variations existing in a population. Thus, genetic variability provides the base material on which natural selection and other evolutionary forces act and decide the ultimate fate of gene adaptation. Integration of evolutionary science into healthcare and disease should be acknowledged as one of the high priority areas of research to estimate the efficacy and long-term hold of any drugs. The occurrence of mutations in functional genes mostly leads the genome to encode improper proteins and results in various abnormality and disease condition. Over the last one and half decade since the human genome project completed, genome wide analysis has made possible to identify human genetic variations more accurately and efficiently. Large scale genomic studies accelerated the process of identifying and establishing new associations between DNA variants and human disease and creditability should go to this tremendous technological advancement in last two decades (Iafrate et al., 2004; Sebat et al., 2004). At present, the technological progress has enabled to examine the entire genome of an individual for disease-associated variation, rather than focusing on one specific gene or several genes.

Evolution of Genomic Medicine

105

It has been noticed that the response to any drug(s) varies among patients, which may be due to various genetic and non-genetic factors. In many cases, the drug responses are found to be associated with genetic variations in genes that encode drug-metabolizing enzymes, drug transporters, or drug targets (Choi et.al., 2015). Broad range of genetic polymorphisms is observed in most of these proteins associated with drug-metabolizing and transport, which may bring inter-individual variability towards different concentrations/doses of drugs, because of differential activity of their drug-metabolic efficacy (Choi et.al., 2015). The first draft of the human genome project opened a new era in understanding gene and genome architectural evolution in human and associated diseases (Venter, 2001). The release of huge data set, development of new tools and technologies provided an excellent platform for international researchers to establish new collaborations, data sharing, and designing other disease-specific human genome projects to characterize new set of genomic variation in diverse populations belongs to specific region, culture, and ethnic group. As a result of which, first International HapMap Project (http://www.hapmap.org) was executed and the outcome was, a catalogue of common genetic variation in eleven global populations and further analyses provided hints of different ancestries. This haplotype map, or "HapMap," is basically a tool that allows the researchers in finding disease related genes and associated genetic variations. Any two individuals are known to be varying only 0.5 percent from each other with respect to their DNA sequences, but these small variations may greatly affect their lifestyle and susceptibility for disease condition. Single nucleotide polymorphisms (SNPs) are known to be one of the most widespread types of DNA sequence variation present in genomes (Fareed and Afzal, 2013). The SNPs are used as genetic markers in evolutionary and population genomics studies and have gained high popularity in explaining the heritable risk factor for common genetic diseases. The SNPs are inherited together due to hitchhiking, when present close to each other and are thus inherited as blocks. This block of SNPs and their pattern forms a haplotype. Mapping of these SNPs block on the entire genome represents the HapMap and a set of specific SNPs (tag SNPs) are used to identify a particular haplotype. The HapMap is found to be very helpful in genome scanning as it helps to focus on the tag SNPs instead of total millions of SNPs and becomes an efficient and comprehensive way in finding regions with disease causing genes. In comparative genomics studies, these tag SNPs are of great help in finding chromosome regions that have different haplotype distributions among groups of people with or without a disease or on basis of response to

106

Sujata Mohanty and Kopal Singhal

different drugs. Thus, it becomes easier to characterize the new set of SNPs or a gene variant, which contributes to those diseases or responses and allows the pharma company to come up with individual, group or population specific effective drugs or vaccines. The follow up 1000 Human Genomes Project under which genomes of 2500 individuals were sequenced from 26 global populations, greatly expanded our knowledge on human genetic variation (Via et al., 2010). The available data in various genomic databases provides an overwhelming experience to the biomedical researchers with a broad scope of human genetic variation analysis and has led to model new research projects and technologies such as whole genome sequencing (WGS), whole exome sequencing and genome-wide association studies (GWAS) for identifying more disease-gene loci. Thus, Next-generation sequencing (NGS) projects are revolutionizing our understanding of genetic variation and their diseasecausing potentiality (Van Dijk et al., 2014). The ever-mounting volume of genomics information and their meaningful interpretation help us to understand well the molecular mechanism of disease conditions and lead us finding effective drug targets, thus the dawn of the genomic medicine era is not very far (Willard et al., 2005, Muenke, 2013). Similarly, studies related to variation in gene expression have become significantly informative in finding determinants of human disease susceptibility due to the active participation of mutations present in the noncoding region on the gene expression pattern. The genome-wide comparative studies help in both way; in finding genetic variation associated with human disease and in predicting cis-regulatory regions on DNA with potential for bringing variation in disease gene expression, ultimately leading to disease condition. Such large-scale genome-wide association studies have reported common variants in non-coding regions which are directly or indirectly associated with increased risk of several human diseases including asthma, some cases of cancer, diabetes, autism etc. The knowledge of the regulatory landscape surrounding the disease gene loci may direct towards the clinical translation of the same and for designing effective genome medicine. In addition, through genomic testing, insertions and deletions (Indels), duplications and copy-number variations are characterized and used for diagnosing various diseases including rare genetic disorders. As follow up, a greater number of human genome projects should come up to define worldwide genetic variations, so that the gaps can be fulfilled by exploring other populations as well and to come up with a promise of more tailor-made medicines. The rising genomics research over the years will introduce us more

Evolution of Genomic Medicine

107

candidate genes associated with mental and neurological disorders, which are increased more and more these days.

Shortcomings of Traditional Medicine The traditional medicine although consistently used on a priority basis, has many risks along with the benefits. Each country has their own set of medical knowledge based on the culture and past experience. Therefore, the health problems and the traditional medicines against those are very specific and vary from one country to other. Lack of reliable information and poor quality of certain traditional medicines becomes problematic. The consumers without knowledge of their potential risk keep using them on a regular basis. Reliable source of public information on these tailored made traditional medicines, their benefits as well as the potential risk factors are needed to make available to avoid unnecessary harm. Few of the shortcomings of traditional medicine are being pinpointed below: • • •

• • •

Lack of reliable source and ineffective methods for dissemination of information Unawareness leads misuse of Imported traditional medicines Lack of meaningful communication among different stakeholders (international organizations, manufacturers, regulatory bodies, regional authorities, suppliers etc.) prohibit collaboration and knowledge sharing Improper knowledge of dose/concentration and toxicity/side effects Lack of strictness on availability of drugs to common people without doctor’s prescription Differential response by some people to certain drugs

Thus, integration of genomics into Ayurveda and modern medicines will increase our understanding about the health problem at molecular level and help us to formulate appropriate drugs against it.

GWAS and QTL Mapping Phenotypic diversity is a direct consequence of both genotype and environmental variations in a population (Rahim et al., 2008). Over the past decades, several studies have conducted to analyse the genetic basis of human

108

Sujata Mohanty and Kopal Singhal

phenotypic variation both at classical and molecular level. Exploring the molecular basis of phenotypic variation is a prime goal of human genetics, encompassing disease susceptibility, variable response to drugs and ultimately treatment and public health (Stranger et al., 2007). Previous studies have also investigated the effects of nucleotide variation in specific genes or genomic regions of complex and monogenic diseases and their phenotypic manifestations in human. Technological advances have now made genomewide association studies (GWAS), a reasonable and affordable approach to the study of complex phenotypes (Bush and Moore, 2012). SNPs possess a significant risk to human health, which enforce the biologists to study their association as discussed previously. Recently, there has been an explosion of genome-wide studies examining the genetic basis of complex diseases by exploring the effects of genetic variation such as single nucleotide polymorphisms (SNPs) and copy number variants (CNVs), which are observed both in coding and non-coding regions of the genome. A GWAS of single nucleotide polymorphisms (SNPs) are conducted to identify candidate genes and favourable alleles for controlling various complex diseases. The concept of quantifying traits for their range of phenotypic variation was there even before DNA to be known as the hereditary material and the blending inheritance of traits are good examples to notice which were not exactly matching to any of the parents, but a mixture of two characters. Earlier and even at present, linkage mapping of candidate genes is performed based on the recombination frequencies between them in order to get an insight of their inheritance relation e.g., possibility of any kind of epistatic interaction among them. Linkage mapping are also called genetic mapping as it provides a logical view on the inheritance pattern of genes. Thus, genetic mapping of quantitative trait loci (QTL) of biomedical interests offers a powerful approach for efficient localization of the functional disease associated genes on the chromosomes and their variants regulating various cellular and biological processes. In addition, the studied genetic variations also help to discover the putative regulatory regions of traits and to define novel functional implications of genetic variants (Liu, 2012). Gene-gene and gene environment interactions are common which make these loci difficult to analyse (Complex Trait Consortium, 2003). Several efforts have been made on quantitative trait loci (QTLs) mapping and genome wide association studies (GWAS). Large sample size and limited polymorphic loci between parents are required to identify QTLs and their accurate mapping. In case of pleiotropic genes, quantifying their individual effect on bringing the phenotypic effect becomes critical. Disease based QTL

Evolution of Genomic Medicine

109

mapping help us to uncover understanding of biology in complex traits and diseases and greatly enhance the power of genetic association studies. Identifying the genetic loci associated with various human diseases would aid in understanding the heredity mechanism underlying them. Previous research shows that environmental factors can also influence the disease gene expression as observed in many disease conditions e.g., cardiac and respiratory problems (Fave et al., 2018). Thus, multi-environment testing is necessary to determine whether the effects of QTLs are due to different genes or environments. The increased development of statistical methods and computational tools make ease to some extent to analyse the QTLs and their linkage association (Almasy and Blangero, 2009). From several independent studies, the human QTLs are getting confirmed instead of their complex association like gene-gene or gene-environment and few potentially functional QTLs present within the genes have also been confirmed by in vitro functional assays. Thus, QTL mapping is found to be an increasingly pragmatic approach to identify and manipulate complex traits important in evolution and medicine (Figure 3) (Complex Trait Consortium,2003). Many complex disease-associated traits were mapped in human (Obesity, Rheumatid arthritis, Malaria etc.) and mice (Stylianou et al., 2004; Casellas et al., 2009).

Figure 3. Factors influencing the evolution of genomic medicine.

110

Sujata Mohanty and Kopal Singhal

Implementation of Genome Medicine Success, Challenges and Opportunities The implementation of genome medicine in the society faced several success stories some of which are discussed below:

Cancer The mystery behind the etiology of cancer was prevalent in the pre-genomic era. Researchers were trying to understand the cause and the molecular mechanism behind the onset of various cancers; however, the limited molecular biology technologies impeded the discovery of genes and associated mutations. Despite these limitations, as many as 291 genes associated with cancer were identified before the completion of human genome project (Wheeler and Wang, 2013). Post completion of HGP, in 2007 a whole exome sequencing approach based on PCR and dye terminator sequencing was used to sequence coding exon of 18,000 human genes in 11 each of breast and colorectal tumours. This project revealed that mutations in APC, TP53 and KRAS were related to colon cancer whereas TP53 mutations were linked to breast cancer. Although, this was not the first discovery revealing the names of these genes, but it validated the existing literature on the role of these genes in specific cancer thereby widening the scope of genomic sequencing projects in better understanding of cancer mechanism (Wheeler and Wang, 2013). One of the major cancer genomic projects, the Cancer Genome Atlas (TCGA) project was initiated in 2006 by joint efforts of National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) (Ow et al., 2013). The major goals of this project were to study of more than 20 different types of human cancer genomes based on the latest available technology. This project also made the data available for the public use. The National Cancer Institute played a crucial role in characterising the cancer genomes and identifying the disease related mutations. The genomic knowledge gained from these ongoing cancer projects will be crucial in bringing down the number of deaths associated with this disease. Sickle Cell Anaemia Sickle cell- anaemia is an autosomal recessive disorder which occurs due to mutation in the haemoglobin gene resulting in sickle shaped RBCs and chronic

Evolution of Genomic Medicine

111

organ damage (Makis, 2006; Puliyel, 2017). Although, this disease is controlled by mutation in a single gene, its phenotypic effects are quite varied with completely asymptomatic and healthy patients to severely affected ones. To understand the genomics behind this disease a study was carried out in West African children (Quinlan et al., 2014). A comprehensive study was carried out using cohorts of patients showing clinical variations and unaffected siblings of the patients. This study used their whole genome genotyping and transcriptomics data to find the genetic factors linked with the variations observed in the pathogenesis of this disease. They identified several SNPs linked through the GWAS study and put together a map of genes involved in the disease. These marker genes will be useful in identifying the affected ones in other parts of the world.

Lactose Intolerance Lactose intolerance results from the deficiency of the metabolising enzyme Lactase due to mutation in the LCT gene regulated by MCM6(Swallow, 2003; Obermayer‐Pietsch et al., 2004) The initial diagnosis methods were mainly biochemical procedures but post HGP, the use of genomic sequencing techniques also came into picture (Mattar, et al., 2012). The genetic tests are non-invasive and easy focussing on identification of the genetic markers linked with condition. These tests give a direct representation of the factors involved in the condition. The genetic studies will also help in understanding other associated disease conditions as osteoporosis (Obermayer‐Pietsch et al., 2004). Several clinical trials have been ongoing to have a complete view of the mechanism behind this condition. All these disease studies highlight the evolution of genomic medicine in diagnosis and treatment of these conditions and glorifies the advent of the era of genomic medicine. However, the field of genomic medicine is also accompanied with certain challenges and opportunities which are enlisted below: 1. One of the major disadvantages with the implementation of genomic medicine are the epigenetic changes accumulated in the body of an organism due to changing environmental conditions which requires for a better understanding of these processes. 2. Reproductive fitness is not always associated with improved health conditions (Rodriguez et al., 2014). Scientist have postulated that adaptation favours mutation enhancing reproductive fitness at the cost of health. Any disease genotype which is now associated with a major

112

Sujata Mohanty and Kopal Singhal

3.

4.

5.

6.

7.

8.

9.

10.

disease etiology was beneficial in the past. e.g., increasing levels of testosterone is linked to higher reproductive fitness compromising the immune system of the organism. There is lack of understanding of evolutionary processes and therefore, the significance of these processes in the molecular mechanism behind the diseases is still unexplored (Rodriguez et al., 2014). Quality filtration and data management is another issue with overnight decreasing cost and increasing sequence data (MacArthur, 2012; Solomon, 2014). Various filtration tools can be developed to manage and check the quality of the data to be used for future reference. Absence of a comprehensive reference database for all kinds of medical research pose the issue of correctly assigning a mutation with the disease condition in the absence of validation from several sources (MacArthur, 2012). The creation of a curated database storing all the genomic information both genotypic and phenotypic of all diseases can solve this issue. Another challenge is the use of genomic markers and their association with drugs based upon them. The knowledge of these genomic markers in terms of frequency of occurrence is still not complete and therefore the mere presence of these markers cannot point towards the probability of occurrence of disease (Hulot, 2010). Since the field of genomic medicine is still in its infancy, the lack of funding and research limits the applicability of the techniques for the developed nations and rich people (Hulot, 2010). Modernisation of the existing healthcare facilities to incorporate the advanced genomic medicine diagnostic tools and equipment and development of databases storing the records of the same (PokorskaBocci et al., 2014). Lack of knowledge of annotation tools or computational methods for medical researchers also stands as a hurdle to bridge the gap between genomics and medicine. Significance of bioinformatics in understanding diseases is not yet completely realised (Solomon, 2014; Steward, et al., 2017). The rare genetic disorders which are also a major focus of personalised medicine face scarcity of significant sample data for conducting random clinical trials. The number of test subjects needs

Evolution of Genomic Medicine

113

to be increased to obtain a wholesome view of the disorders (McCarthy et al., 2013).

Genomic Medicine and Its Financial Impact To understand the clinical implications of mutations in the genome, several analyses need to be carried out. Although the cost of sequencing one’s genome has gone down to 1000 bucks, however, the cost associated with drawing meaningful inference from the genome through its analysis is much higher. The technology of personalised medicine will only be a boon if it reaches the oppressed and underprivileged society. This requires the interference of various governmental and non-governmental funding agencies. The framework of genome medicine project should include a financial aspect to it so that maximum of the population could be benefitted with this healthcare service. But the development of this technique will generate revenue just like that of the human genome project. This field of medicine will employ thousands of researchers and will create opportunities in several sectors. Once furnished, genomic medicine will not only improve our healthcare sector, but will also reduce the cost of the current medicinal practises by averting disease onset. Genomic medicine will also lower the medical expenses by reducing the harmful effects of the treatments, avoiding surgical procedures in some cases and increasing patient adherence towards a single line of treatment (Lu and Cohen, 2015).

Ethical and Legal Issues The major ethical issue in genomic medicine comes with sharing or use of the genomic information/entity generated and to obtain consent for personal genome sequencing e.g., as in case of HeLa cell lines. Certain other issues include patentability of DNA, termination of pregnancy during prenatal disease diagnosis. The concern over misuse of the facilities and overruling the existing principles of medicine has become a major issue in the implementation of personalised genomic medicine. Another social as well legal issue comes with the revelation of secondary data i.e., should all the variations or polymorphisms in the genes be reported which are still not associated with any disease (McCarthy, 2013). If yes, then it requires further consent from the individual for reproducing the data. If no, then are we not

114

Sujata Mohanty and Kopal Singhal

ignoring the possibility of a probable disease marker? These concerns are worth debating and require judicial statements. Also, the security and encryption of data associated with personal genome sequencing needs to be addressed to prevent data breaches and cybercrimes otherwise this technology for the betterment of the society will be lost in all these loopholes.

Genome Medicine: Hope or Hype In the genomic era, the traditional clinical tests have undergone revolutionary changes and including genomic information as part of their clinical care, making more effective in diagnosis and therapeutic ways. Thus, the symptomatic way of therapeutic decision is going more towards molecular mechanism and in many cases, e.g., cancer and neurological disorder they become the only way of diagnosing the disease and coming up with a targetspecific way of treatment. Various Human genome Projects since one and half decades and the phenomenal efforts of the researchers has made successful to dig out the molecular cause of certain diseases and with a better diagnostic tools and treatment methods. Although there is a continuous effort going on in search of drug targets and clinical trial for various monogenetic, polygenetic and multifactorial disorders, lack of proper population study is becoming a barrier in their efficacy. Due to drug target variation, specific group does not response to those drugs, which can again stop to achieve success in controlling of that disease. Thus, before implementation, the drug targets need to be verified for specific ethnic or geographical adapted population (Figure 3). The advanced sequencing methods and the quality analytical tools may bring genome medicine to its highest success.

Conclusion Genomic medicine in today’s world is like a tree sapling that needs to get nurtured through various advancements and challenges to become a fullfledged fruitful tree. Years of research and hard work has made it possible for commencement of personalised medicine therapy in healthcare. Every day, new genomic information is being created and added to the final pool of data expanding the horizons of genomics, transcriptomics, and metabolomics

Evolution of Genomic Medicine

115

research. The genomic medicine techniques will reduce the burden on the medical biotechnologist and negate the adverse effects of traditional or modern medicine.

References Ahmed, J., Meinel, T., Dunkel, M., Murgueitio, M. S., Adams, R., Blasse, C., Eckert, A., Preissner, S. and Preissner, R., (2010). Cancer Resource: a comprehensive database of cancer-relevant proteins and compound interactions supported by experimental knowledge. Nucleic acids research, 39, D960-D967. Alikhan, N. F., Petty, N. K., Zakour, N. L. B. and Beatson, S. A., (2011). BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC genomics, 12(1), 1-10. Alkan C., Kidd J. M., Marques-Bonet T., Aksay G., Antonacci F., Hormozdiari F., Kitzman J. O., Baker C., Malig M., Mutlu O., Sahinalp S. C., Gibbs R. A., Eichler E. E., (2009). Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 41: 1061–1067. Almasy, L. and Blangero, J., (2009). Human QTL linkage mapping. Genetica, 136(2), 333340. Antao, T., Beja-Pereira, A. and Luikart, G., (2007). MODELER4SIMCOAL2: A userfriendly, extensible modeler of demography and linked loci for coalescent simulations. Bioinformatics, 23(14), 1848-1850. Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D. and Durbin, R., (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic acids research, 29(1), 37-40. Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. and Martin, M. J., (2004). UniProt: the universal protein knowledgebase. Nucleic acids research, 32, D115-D119. Avery, O. T., MacLeod, C. M. and McCarty, M., (1944). Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. Journal of experimental medicine, 79, 137-158. Aziz, R. K., Bartels, D., Best, A. A., DeJongh, M., Disz, T., Edwards, R. A., Formsma, K., Gerdes, S., Glass, E. M., Kubal, M. and Meyer, F., (2008). The RAST Server: rapid annotations using subsystems technology. BMC genomics, 9(1), 1-15. Bairoch, A. and Apweiler, R., (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research, 28, 45-48. Bakheet, T., Hitti, E. and Khabar, K. S. A. (2017). ARED-Plus: an updated and expanded database of AU-rich element-containing mRNAs and pre-mRNAs. Nucleic acids research, 46(D1), D218-D220. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D. and Pyshkin, A. V., (2012).

116

Sujata Mohanty and Kopal Singhal

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477. Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K. D., Resenchuk, S., Tatusova, T. and Yaschenko, E., (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic acids research, 40(D1), 57-63. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. and Wheeler, D. L., (2008). GenBank. Nucleic acids research, 36, D26-31. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E., (2000). The protein data bank. Nucleic acids research, 28, 235-242. Bleidorn, C., (2016). Third generation sequencing: technology and its potential impact on evolutionary biodiversity research. Systematics and biodiversity, 14, 1-8. Blom, J., Kreis, J., Spänig, S., Juhre, T., Bertelli, C., Ernst, C. and Goesmann, A., (2016). EDGAR 2.0: an enhanced software platform for comparative gene content analyses. Nucleic acids research, 44(W1), W22-W28. Brent, M. R., (2008). Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nature Reviews Genetics, 9(1), 62. Bulusu, K. C., Tym, J. E., Coker, E. A., Schierz, A. C. and Al-Lazikani, B., (2013). canSAR: updated cancer research and drug discovery knowledgebase. Nucleic acids research, 42(D1), D1040-D1047. Bush, W. S. and Moore, J. H., (2012). Genome-wide association studies. PLoS computational biology, 8(12), 1002822. Campagna, D., Albiero, A., Bilardi, A., Caniato, E., Forcato, C., Manavski, S., Vitulo, N. and Valle, G., (2009). PASS: a program to align short sequences. Bioinformatics, 25(7), 967-968. Carver, T., Harris, S. R., Berriman, M., Parkhill, J. and McQuillan, J. A., (2011). Artemis: an integrated platform for visualization and analysis of high-throughput sequencebased experimental data. Bioinformatics, 28(4), 464-469. Casellas, J., Farber, C. R., Gularte, R. J., Haus, K. A., Warden, C. H. and Medrano, J. F., (2009). Evidence of maternal QTL affecting growth and obesity in adult mice. Mammalian genome, 20(5), 269-280. Chain, P., Kurtz, S., Ohlebusch, E. and Slezak, T., (2003). An applications-focused review of comparative genomics tools: Capabilities, limitations and future challenges. Briefings in bioinformatics, 4(2), 105-123. Chan, I. S. and Ginsburg, G. S., (2011). Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12, 217-244. Chatr-Aryamontri, A., Oughtred, R., Boucher, L., Rust, J., Chang, C., Kolas, N. K., O'Donnell, L., Oster, S., Theesfeld, C., Sellam, A. and Stark, C., (2017). The BioGRID interaction database: 2017 update. Nucleic acids research, 45(D1), D369-D379. Chial, H., (2008). DNA sequencing technologies key to the Human Genome Project. Nature Education, 1(1), 219. Choi, J. R., Kim, J. O., Kang, D. R., Shin, J. Y., Zhang, X. H., Oh, J. E., Park, J. Y., Kim, K. A. and Kang, J. H., (2015). Genetic variations of drug transporters can influence

Evolution of Genomic Medicine

117

on drug response in patients treated with docetaxel chemotherapy. Cancer research and treatment: official journal of Korean Cancer Association, 47(3), 509. Chou, C. H., Shrestha, S., Yang, C. D., Chang, N. W., Lin, Y. L., Liao, K. W., Huang, W. C., Sun, T. H., Tu, S. J., Lee, W. H. and Chiew, M. Y., (2017). miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions. Nucleic acids research, 46(D1), D296-D302. Cochrane, G., Karsch-Mizrachi, I., Takagi, T. and Sequence Database Collaboration, I. N., (2015). The international nucleotide sequence database collaboration. Nucleic acids research, 44(D1), D48-D50. Collins, F. S., (1999). Medical and societal consequences of the human genome project. New England Journal of Medicine, 341(1), 28-37. Collins, F. S., Morgan, M. and Patrinos, A., (2003). The Human Genome Project: lessons from large-scale biology. Science, 300(5617), 286-290. Complex Trait Consortium, (2003). The nature and identification of quantitative trait loci: a community's view. Nature Reviews Genetics, 4(11), 911. Cotto, K. C., Wagner, A. H., Feng, Y. Y., Kiwala, S., Coffman, A. C., Spies, G., Wollam, A., Spies, N. C., Griffith, O. L. and Griffith, M., (2017). DGIdb 3.0: a redesign and expansion of the drug–gene interaction database. Nucleic acids research, 46(D1), D1068-D1073. Darling, A. C., Mau, B., Blattner, F. R. and Perna, N. T., (2004). Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome research, 14(7), 13941403. David, M., Dzamba, M., Lister, D., Ilie, L. and Brudno, M., (2011). SHRiMP2: sensitive yet practical short read mapping. Bioinformatics, 27(7), 1011-1012. De Magalhães, J. P., Costa, J. and Toussaint, O., (2005). HAGR: the human ageing genomic resources. Nucleic acids research, 33, D537-D543. Deamer, D. W. and Akeson, M., (2000). Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends in biotechnology, 18(4), 147-151. Dunham, I., Hunt, A. R., Collins, J. E., Bruskiewich, R., Beare, D. M., Clamp, M., Smink, L. J., Ainscough, R., Almeida, J. P., Babbage, A. and Bagguley, C., (1999). The DNA sequence of human chromosome 22. Nature, 402(6761), 489-495. Eaves H. L, Gao Y., (2009). MOM: maximum oligonucleotide mapping. Bioinformatics, 25(7), 969-970. Excoffier, L., Laval, G. and Schneider, S., (2005). Arlequin (version 3.0): an integrated software package for population genetics data analysis. Evolutionary bioinformatics, 1, 117693430500100003. Fareed, M. and Afzal, M., (2013). Review Single nucleotide polymorphism in genomewide association of human population: A tool for broad spectrum service. Egyptian Journal of Medical Human Genetics, 14(2), 123-134. Fave, M. J., Lamaze, F. C., Soave, D., Hodgkinson, A., Gauvin, H., Bruat, V., Grenier, J. C., Gbeha, E., Skead, K., Smargiassi, A. and Johnson, M., (2018). Gene-byenvironment interactions in urban populations modulate risk phenotypes. Nature communications, 9(1), 827. Felsenstein, J., (1986). phylip: Phylogenetic Inference Package (University of Washington, Seattle).

118

Sujata Mohanty and Kopal Singhal

Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J. and Sonnhammer, E. L., (2013). Pfam: the protein families database. Nucleic acids research, 42(D1), D222-D230. Fleischmann R. D., Adams M. D., White O, Clayton R. A., Kirkness E. F., Kerlavage A. R., Bult C. J., Tomb J. F., Dougherty B. A., Merrick J. M.., (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223), 496-512. Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M. and Dubchak, I., (2004). VISTA: computational tools for comparative genomics. Nucleic acids research, 32, W273W279. Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., Light, Y., McGlinchey, S., Michalovich, D., Al-Lazikani, B. and Overington, J. P., (2011). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1), D1100-D1107. Genheden, S., Reymer, a., Saenz-Méndez, P., Leif A, (2017) Eriksson, Chapter 1: Computational Chemistry and Molecular Modelling Basics. Computational Tools for Chemical Biology, 1-38. Gilson, M. K., Liu, T., Baitaluk, M., Nicola, G., Hwang, L. and Chong, J., (2015). BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic acids research, 44(D1), D1045-D1053. Griffith, F., (1928). The significance of pneumococcal types. Epidemiology & Infection, 27(2), 113-159. Halliwell, B., (2004). Traditional Chinese medicine: problems and drawbacks. Herbal and Traditional Medicine. CRC Press, 898-907. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. and McKusick, V. A., (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33, D514-D517. Heather, J. M. and Chain, B., (2016). The sequence of sequencers: the history of sequencing DNA. Genomics, 107(1), 1-8. Hershey, A. D. and Chase, M., (1952). Independent functions of viral protein and nucleic acid in growth of bacteriophage. The Journal of general physiology, 36(1), 39-56. Hoffmann, R., Dopazo, J., Cigudosa, J. C. and Valencia, A., (2005). HCAD, closing the gap between breakpoints and genes. Nucleic acids research, 33, D511-D513. Holland, A. J. and Cleveland, D. W., (2009). Boveri revisited: chromosomal instability, aneuploidy and tumorigenesis. Nature reviews Molecular cell biology, 10(7), 478-487. Holley, R. W., Apgar, J., Everett, G. A., Madison, J. T., Marquisee, M., Merrill, S. H., Penswick, J. R. and Zamir, A., (1965). Structure of a ribonucleic acid. Science, 147(3664), 1462-1465. Homer, N., Merriman, B. and Nelson, S. F., (2009). BFAST: an alignment tool for large scale genome resequencing. PloS one, 4(11), e7767. Hooft, R. W., Sander, C. and Vriend, G., (1997). Objectively judging the quality of a protein structure from a Ramachandran plot. Bioinformatics, 13(4), 425-430. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T. and Durbin, R., (2002). The Ensembl genome database project. Nucleic acids research, 30(1), 38-41.

Evolution of Genomic Medicine

119

Hulot, J. S., (2010). Pharmacogenomics and personalized medicine: lost in translation? Genome Medicine, 2, 13. Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. W. and Lee, C., (2004). Detection of large-scale variation in the human genome. Nature genetics, 36(9), 949-951. Illumina, I., (2015). An introduction to next-generation sequencing technology. (www.illumina.com/science/technology/next-generation-sequencing.html). International Human Genome Sequencing Consortium, (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921. Ishiguro A., Otsubo Y., Uyama Y., (2013). Regulatory Experience at the FDA, EMA, and PMDA: Regulatory Experience at the PMDA. The Path from Biomarker Discovery to Regulatory Qualification, 41-44. Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T. and Malla, S., (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology, 36(4), 338345. Jayasundar, R., Ghatak, S., Makhdoomi, M. A., Luthra, K., Singh, A. and Velpandian, T., (2018). Challenges in integrating component level technology and system level information from Ayurveda: Insights from NMR phytometabolomics and anti-HIV potential of select Ayurvedic medicinal plants. Journal of Ayurveda and integrative medicine, 10(2), 94-101. Karczewski, K. J., Weisburd, B., Thomas, B., Solomonson, M., Ruderfer, D. M., Kavanagh, D., Hamamsy, T., Lek, M., Samocha, K. E., Cummings, B. B. and Birnbaum, D., (2016). The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic acids research, 45(D1), D840-D845. Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. and Sternberg, M. J., (2015). The Phyre2 web portal for protein modeling, prediction and analysis. Nature protocols, 10(6), 845-858. Kitano, H., (2002). Computational systems biology. Nature, 420(6912), 206-210. Kumar, S., Tamura, K. and Nei, M., (1994). MEGA: molecular evolutionary genetics analysis software for microcomputers. Bioinformatics, 10(2), 189-191. Lambert, C., Leonard, N., De Bolle, X. and Depiereux, E., (2002). ESyPred3D: Prediction of proteins 3D structures. Bioinformatics, 18(9), 1250-1256. Land H., Humble M. S., (2018). YASARA: A Tool to Obtain Structural Guidance in Biocatalytic Investigations. Methods Mol Biol.1685, 43-67. Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L., (2009). Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), R25. Lappalainen, I., Almeida-King, J., Kumanduri, V., Senf, A., Spalding, J. D., Saunders, G., Kandasamy, J., Caccamo, M., Leinonen, R., Vaughan, B. and Laurent, T., (2015). The European Genome-phenome Archive of human data consented for biomedical research. Nature genetics, 47(7), 692-695. Leinonen, R., Sugawara, H., Shumway, M. and International Nucleotide Sequence Database Collaboration, (2010). The sequence read archive. Nucleic acids research, 39, D19-D21.

120

Sujata Mohanty and Kopal Singhal

Li, H. and Durbin, R., (2009). Fast and accurate short read alignment with Burrows– Wheeler transform. bioinformatics, 25(14), 1754-1760. Li, R., Li, Y., Kristiansen, K. and Wang, J., (2008). SOAP: short oligonucleotide alignment program. Bioinformatics, 24(5), 713-714. Li, W., Kondratowicz, B., McWilliam, H., Nauche, S. and Lopez, R., (2013). The annotation-enriched non-redundant patent sequence databases. Database (Oxford). Liao, W. L. and Tsai, F. J., (2013). Personalized medicine: a paradigm shift in healthcare. BioMedicine, 3(2), 66-72. Librado, P. and Rozas, J., (2009). DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics, 25(11), 1451-1452. Liu, C., (2012). QTL Mapping of Molecular Traits for Studies of Human Complex Diseases. In Applied Computational Genomics. Springer, Dordrecht, 61-82. Lo, Y. C., Senese, S., Li, C. M., Hu, Q., Huang, Y., Damoiseaux, R. and Torres, J. Z., (2015). Large-scale chemical similarity networks for target profiling of compounds identified in cell-based chemical screens. PLoS computational biology, 11(3), e1004153. Lu, C. Y. and Cohen, J. P., (2015). Can genomic medicine improve financial sustainability of health systems? Molecular diagnosis & therapy, 19(2), 71-77. MacArthur, D. G., (2012). Challenges in clinical genomics. Genome Medicine, 4, 43. Mailund, T., Schierup, M. H., Pedersen, C. N., Mechlenborg, P. J., Madsen, J. N. and Schauser, L., (2005). CoaSim: a flexible environment for simulating genetic data under coalescent models. BMC bioinformatics, 6(1), 252. Makis, A. C., Hatzimichael, E. C. and Stebbing, J, (2006). The genomics of new drugs in sickle cell disease. Pharmacogenomics, 7(6), 909-917. Mashima, J., Kodama, Y., Fujisawa, T., Katayama, T., Okuda, Y., Kaminuma, E., Ogasawara, O., Okubo, K., Nakamura, Y. and Takagi, T., (2016). DNA data bank of Japan. Nucleic acids research, 45(D1), D25-31. Mathias, S. L., Hines-Kay, J., Yang, J. J., Zahoransky-Kohalmi, G., Bologa, C. G., Ursu, O. and Oprea, T. I., (2013). The CARLSBAD database: a confederated database of chemical bioactivities. Database, 2013. Mathur, S. and Sutton, J., (2017). Personalized medicine could transform healthcare. Biomedical reports, 7(1), 3-5. Mattar, R., de Campos Mazo, D. F. and Carrilho, F. J., (2012). Lactose intolerance: diagnosis, genetic, and clinical factors. Clinical and experimental gastroenterology, 5, 113-121. McCarthy, J. J., McLeod, H. L. and Ginsburg, G. S., (2013). Genomic medicine: a decade of successes, challenges, and opportunities. Science translational medicine, 5(189), 189sr4-189sr4. Miao, Y. R., Liu, W., Zhang, Q. and Guo, A. Y., (2017). lncRNASNP2: an updated database of functional SNPs and mutations in human and mouse lncRNAs. Nucleic acids research, 46(D1), D276-D280. Morgan, T. H. and Cattell, E., (1912). Data for the study of sex‐linked inheritance in Drosophila. Journal of Experimental Zoology, 13(1), 79-101. Muenke, M., (2013). Individualized genomics and the future of translational medicine. Molecular genetics & genomic medicine, 1(1), 1-3.

Evolution of Genomic Medicine

121

Mullis, K. B., (1994). The polymerase chain reaction (Nobel lecture). Angewandte Chemie International Edition in English, 33(12), 1209-1213. Naylor, N. R., Atun, R., Zhu, N., Kulasabanathan, K., Silva, S., Chatterjee, A., Knight, G. M. and Robotham, J. V., (2018). Estimating the burden of antimicrobial resistance: a systematic literature review. Antimicrobial Resistance & Infection Control, 7(1), 58. Nirenberg, M. W., (1963). The genetic code. Scientific American, 208(3), 80-95. Nyren, P., (2007). The History of Pyrosequencing®. Pyrosequencing® Protocols, Humana Press,1-13. Obermayer‐Pietsch, B. M., Bonelli, C. M., Walter, D. E., Kuhn, R. J., Fahrleitner‐Pammer, A., Berghold, A., Goessler, W., Stepan, V., Dobnig, H., Leb, G. and Renner, W., (2004). Genetic predisposition for adult lactose intolerance and relation to diet, bone density, and bone fractures. Journal of Bone and Mineral Research, 19(1), 42-47. Ow, T. J., Sandulache, V. C., Skinner, H. D. and Myers, J. N., (2013). Integration of cancer genomics with treatment selection: from the genome to predictive biomarkers. Cancer, 119(22), 3914-3928. Ozsolak, F., (2012). Third-generation sequencing techniques and applications to drug discovery. Expert opinion on drug discovery, 7(3), 231-243. Paces, J., Pavlícek, A. and Paces, V., (2002). HERVd: database of human endogenous retroviruses. Nucleic acids research, 30(1), 205-206. Pan, S. Y., Zhou, S. F., Gao, S. H., Yu, Z. L., Zhang, S. F., Tang, M. K., Sun, J. N., Ma, D. L., Han, Y. F., Fong, W. F. and Ko, K. M., (2013). New perspectives on how to discover drugs from herbal medicines: CAM's outstanding contribution to modern therapeutics. Evidence-Based Complementary and Alternative Medicine, 2013. Patwardhan, B. and Mashelkar, R. A., (2009). Traditional medicine-inspired approaches to drug discovery: can Ayurveda show the way forward? Drug discovery today, 14(1516), 804-811. Petty, N. K., (2010). Genome annotation: man versus machine. Nat Rev Microbiol, 8, 762. Pinero, J., Bravo, À., Queralt-Rosinach, N., Gutiérrez-Sacristán, A., Deu-Pons, J., Centeno, E., García-García, J., Sanz, F. and Furlong, L. I., (2016). DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research, 45(D1), D833-D839. Pokorska-Bocci, A., Kroese, M., Sagoo, G. S., Hall, A. and Burton, H., (2014). Personalised medicine in the UK: challenges of implementation and impact on healthcare system. Genome medicine, 6(4), 28. Pollock, M. R., Bacon, F. and Organum, N., (1970). The discovery of DNA: An ironic tale of chance, prejudice and insight. Journal of General Microbiology (rg7o), 63, 1-20. Puliyel, M. M., (2017). Genomic biomarker in sickle cell disease Blood. 129(22), 29562957. Quinlan, J., Idaghdour, Y., Goulet, J. P., Gbeha, E., de Malliard, T., Bruat, V., Grenier, J. C., Gomez, S., Sanni, A., Rahimy, M. C. and Awadalla, P., (2014). Genomic architecture of sickle cell disease in West African children. Frontiers in genetics, 5, 26. Raghavan, S. and Vassy, J. L., (2014). Do physicians think genomic medicine will be useful for patient care? Personalized medicine, 11(4), 425-433.

122

Sujata Mohanty and Kopal Singhal

Rahim, N. G., Harismendy, O., Topol, E. J. and Frazer, K. A., (2008). Genetic determinants of phenotypic diversity in humans. Genome biology, 9(4), 215. Reed, J. L., Famili, I., Thiele, I. and Palsson, B. O., (2006). Towards multidimensional genome annotation. Nature Reviews Genetics, 7(2), 130. Rice, P., Longden, I. and Bleasby, A., (2000). EMBOSS: the European molecular biology open software suite. Trends in genetics, 16(6), 276-277. Rizk, G. and Lavenier, D., (2010). GASSST: global alignment short sequence search tool. Bioinformatics, 26(20), 2534-2540. Rodriguez, J. A., Marigorta, U. M. and Navarro, A., (2014). Integrating genomics into evolutionary medicine. Current opinion in genetics & development, 29, 97-102. Ronaghi, M., (2001). Pyrosequencing sheds light on DNA sequencing. Genome research, 11(1), 3-11. Rothberg, J. M., Hinz, W., Rearick, T. M., Schultz, J., Mileski, W., Davey, M., Leamon, J. H., Johnson, K., Milgrew, M. J., Edwards, M. and Hoon, J., (2011). An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356), 348. Saier Jr, M. H., Tran, C. V. and Barabote, R. D., (2006). TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic acids research, 34, D181-D186. Sasaki, T. and Burr, B., (2000). International Rice Genome Sequencing Project: the effort to completely sequence the rice genome. Current opinion in plant biology, 3(2), 138142. Schmidt, D., Wilson, M. D., Spyrou, C., Brown, G. D., Hadfield, J. and Odom, D. T., (2009). ChIP-seq: using high-throughput sequencing to discover protein–DNA interactions. Methods, 48(3), 240-248. Schultz, J., Milpetz, F., Bork, P. and Ponting, C. P., (1998). SMART, a simple modular architecture research tool: identification of signaling domains. Proceedings of the National Academy of Sciences, 95(11), 5857-5864. Schuster, S. C., (2007). Next-generation sequencing transforms today's biology. Nature methods, 5(1), 16. Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R. and Miller, W., (2000). PipMaker—a web server for aligning two genomic DNA sequences. Genome research, 10(4), 577-586. Schwede, T., Kopp, J., Guex, N. and Peitsch, M. C., (2003). SWISS-MODEL: an automated protein homology-modeling server. Nucleic acids research, 31(13), 33813385. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Månér, S., Massa, H., Walker, M., Chi, M. and Navin, N., (2004). Large-scale copy number polymorphism in the human genome. Science, 305(5683), 525-528. Seemann, T., (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068-2069. Sherbakov, D., Panchin, Y., Baranova, A., (2013) Extracting Evolutionary Insights Using Bioinformatics. International Journal of Genomics, 2013, Article ID 376235. Shrestha, P., Roberts, T., Homsana, A., Myat, T. O., Crump, J. A., Lubell, Y. and Newton, P. N., (2018). Febrile illness in Asia: gaps in epidemiology, diagnosis and

Evolution of Genomic Medicine

123

management for informing health policy. Clinical Microbiology and Infection. 24(8), 815-826. Silvester, N., Alako, B., Amid, C., Cerdeño-Tarrága, A., Clarke, L., Cleland, I., Harrison, P. W., Jayathilaka, S., Kay, S., Keane, T. and Leinonen, R., (2017). The european nucleotide archive in 2017. Nucleic acids research, 46(D1), D36-D40. Singer, A. C., Shaw, H., Rhodes, V. and Hart, A., (2016). Review of antimicrobial resistance in the environment and its relevance to environmental regulators. Frontiers in microbiology, 7, 1728. Singh, V., (2017). Sushruta: The father of surgery. National journal of maxillofacial surgery, 8(1), 1-3. Smith, A. D., Chung, W. Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J., Xuan, Z. and Zhang, M. Q., (2009). Updates to the RMAP short-read mapping software. Bioinformatics, 25(21), 2841-2842. Solomon, B. D., (2014). Obstacles and opportunities for the future of genomic medicine. Molecular genetics & genomic medicine, 2(3), 205-209. Stark, A., Lin, M. F., Kheradpour, P., Pedersen, J. S., Parts, L., Carlson, J. W., Crosby, M. A., Rasmussen, M. D., Roy, S., Deoras, A. N. and Ruby, J. G., (2007). Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature, 450(7167), 219-232. Stein, L., (2001). Genome annotation: from sequence to biology. Nature reviews genetics, 2(7), 493-503. Steward, C. A., Parker, A. P., Minassian, B. A., Sisodiya, S. M., Frankish, A. and Harrow, J., (2017). Genome annotation for clinical genomic diagnostics: strengths and weaknesses. Genome medicine, 9(1), 49. Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C., Kulikova, T., Leinonen, R., Lin, Q., Lombard, V. and Lopez, R., (2002). The EMBL nucleotide sequence database. Nucleic acids research, 30(1), 21-26. Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D. and Montgomery, S., (2007). Population genomics of human gene expression. Nature genetics, 39(10), 1217. Stretton, A. O., (2002). The first sequence: Fred Sanger and insulin. Genetics, 162(2), 527532. Stylianou, I. M., Christians, J. K., Keightley, P. D., Bünger, L., Clinton, M., Bulfield, G. and Horvat, S., (2004). Genetic complexity of an obesity QTL (Fob3) revealedby detailed genetic mapping. Mammalian genome, 15(6), 472-481. Sutton, G. G., White, O., Adams, M. D. and Kerlavage, A. R., (1995). TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1(1), 9-19. Sutton, W. S., (1903). The chromosomes in heredity. The Biological Bulletin, 4(5), 231250. Swallow, D. M., (2003). Genetics of lactase persistence and lactose intolerance. Annual review of genetics, 37(1), 197-219. Swofford, D., (1990). PAUP: Phylogenetic Analysis Using Maximum Parsimony. Illinois: Illinois Natural History Survey Champaign.

124

Sujata Mohanty and Kopal Singhal

Tewari, M. and Shukla, H. S., (2005). Sushruta:'The Father of Indian Surgery'. Indian J surg, 67(4). Van Dijk, E. L., Auger, H., Jaszczyszyn, Y. and Thermes, C., (2014). Ten years of nextgeneration sequencing technology. Trends in genetics, 30(9), 418-426. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A. and Gocayne, J. D., (2001). The sequence of the human genome. Science, 291(5507), 1304-1351. Via, M., Gignoux, C. and Burchard, E. G., (2010). The 1000 Genomes Project: new opportunities for research and social challenges. Genome medicine, 2(1),3. Vigliar, E., Malapelle, U., De Luca, C., Bellevicine, C. and Troncone, G., (2015). Challenges and opportunities of next‐generation sequencing: a cytopathologist's perspective. Cytopathology, 26(5), 271-283. Visvikis-Siest S., Theodoridou Danai, K. Maria-Spyridoula, Kumar S., Marschler M., (2020). Milestones in Personalized Medicine: From the Ancient Time to Nowadays— the Provocation of COVID-19. Frontiers in Genetics,11, article 569175. Voelkerding, K. V., Dames, S. A. and Durtschi, J. D., (2009). Next-generation sequencing: from basic research to diagnostics. Clinical chemistry, 55(4), 641-658. von Tschermak-Seysenegg, E., (1951). The rediscovery of Gregor Mendel's work: An historical retrospect. Journal of Heredity, 42(4),163-171. Wachtel-Galor, S. and Benzie, I. F., (2011). Herbal medicine: biomolecular and clinical aspects. Boca Raton (FL), CRC Press. Wang, Z., Gerstein, M. and Snyder, M., (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1), 57. Watson, J. D. and Crick, F. H., (1953). Molecular structure of nucleic acids. Nature, 171(4356), 737-738. Wheeler, D. A. and Wang, L., (2013). From human genome to cancer genome: the first decade. Genome research, 23(7),1054-1062. Willard, H. F., Angrist, M. and Ginsburg, G. S., (2005). Genomic medicine: genetic variation and its impact on the future of health care. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 360(1460), 1543-1550. Wu, C. H., Yeh, L. S. L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R. S., Suzek, B. E. and Vinayaka, C. R., (2003). The protein information resource. Nucleic acids research, 31(1), 345-347. Yamanishi, Y., Kotera, M., Moriya, Y., Sawada, R., Kanehisa, M. and Goto, S., (2014). DINIES: drug–target interaction network inference engine based on supervised analysis. Nucleic acids research, 42(W1), W39-W45. Yuan, H., Ma, Q., Ye, L. and Piao, G., (2016). The traditional medicine and modern medicine from natural products. Molecules, 21(5), 559. Zaman, S. B., Hussain, M. A., Nye, R., Mehta, V., Mamun, K. T. and Hossain, N., (2017). A review on antibiotic resistance: alarm bells are ringing. Cureus, 9(6), e1403. Zerbino, D. R., (2010). Using the velvet de novo assembler for short‐read sequencing technologies. Current protocols in bioinformatics, Chapter 11,Unit 11.5.

Chapter 5

Bio-Inspired Computing Raghav Mishra1 Tanishq Mandloi1 and Anjali Priyadarshini1,2,* 1Department

of Biomedical Engineering, SRM University, Haryana, India of Biotechnology, SRM University, Haryana, India Delhi-NCR, Rajiv Gandhi Education City, Sonepat, Haryana, India 2Department

Abstract Bio-inspired computing is a unique approach to solving complex problems based on solutions and ideas from nature. Nature has a great tendency of self-optimization through evolution. Bio-inspired computing deals with computing with the same principles inspired by humans and other organism's problem-solving skills. Bio-inspired computing is aimed to show cognitive ability and inferential learning ability. Neuromorphic engineering is one such approach that uses very largescale integration systems containing electronic analog circuits to mimic neuro-biological architectures present in the nervous system. Just as our brains can predict the view of an object when the object is being visualized from some other direction, the same can be done with the help of bio-inspired computers using sets of eyes like sensors and a special processor that mimics the visual rendering of a human brain. One such approach of visual rendering is already represented by the TrueNorth processors of IBM. Such an immense level of computing requires the use of all disciplines of science. This chapter represents various bio-inspired optimized algorithms like the Genetic Bee Colony algorithm, Fish Swarm Optimization Algorithm, Cat Swarm Optimization, Whale 

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

126

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini Optimization algorithm, Artificial Algae algorithm, Elephant Search algorithm, Chicken Swarm Optimization algorithm, Moth Flame Optimization, and Grey Wolf Optimization algorithm, etc. These algorithms are generally based on metaheuristic approaches. “Further, technologies have come up with more innovative ideas with the upcoming advancement after learning from the nature.” The rapidly increasing technical world is grabbing ideas through different natural algorithms, as discussed in this chapter. Mixing computing with natural and biological phenomena will give a better understanding of applications of natural phenomenon in bioinspired computing. The world is full of biology; once it combines with technology, it gives the perfect gift for the computing future. Either switching from artificial intelligence, genetic algorithm, the immune system, or the human life itself, this wide scope of concepts is totally going to evolve the lifestyle of people. In this chapter, we will discuss the necessity of sandwiching computing around nature and its applications.

Keywords: P-Systems, optimizations, algorithms, bio-inspired, Genetic Bee Colony (GBC), Algorithm, Cat Swarm Optimization (CSO), Artificial Algae Algorithm (AAA), Elephant Search Algorithm (ESA), Chicken Swarm Optimization (CSOA), Grey Wolf Optimization (GWO) Algorithm, neuromorphic engineering, metaheuristics approaches

Introduction Nature is one of the most powerful attributes of this world. A fascinating algorithm is inspired from the whole universe. Everything in the universe is connected, from macro to micro, from the natural gravitational forces which keep everything in the universe intact to the small electrostatic forces that keep the atoms together. Another fascinating aspect is the intelligence that nature has given to living things so that they can survive or be consumed by other animals to maintain the food chain. All the sensations, including vision, audition, olfaction, gustation, and tactician, are the gifts that nature has given to living beings. The data collected from these sensations are further processed via the brain to recognize the activities in the surrounding environment. Here the complex neural circuit comes into play that delivers as well as also processes the information. Different organisms react differently to the same intensities of these sensations, which allows them to behave differently. Their behavior is well-optimized by the nature for their survival. Such welloptimized behavior can be mimicked and used in real-life calculations where

Bio-Inspired Computing

127

there is the requirement for more than one solution for a particular task. Disease diagnosis and treatment can be possible from such applications of biocomputing. For example, bacteria can be modified to become biocomputers capable of detecting and treating certain inflammatory diseases, including gut conditions such as inflammatory bowel disease or IBD. A portion of the bacterial DNA is used to determine whether it has met a specific chemical (in this case, one that indicates the presence of IBD). Amazingly, this technique requires only an IF/THEN test, simple logic gates, and fewer than three bits of memory. An AND gate requires both inputs to be true for the logic statement to return true. In this example, when the chemical is present, two sensors for IBD simultaneously activate the AND gate control region, turning on a gene that instructs the cell to produce an enzyme called luciferase. Luciferase glows in the dark and exits the body through the fecal matter. The glow can be detected using a microscope, evidencing the presence or absence of IBD in an individual. Bio-inspired computing is a research method aimed at solving problems using computer models based on the principles of biology and the natural world. Commonly seen as a philosophical approach, bio-inspired computing is used in several related fields of study within computing rather than a field of study itself. Bio-inspired computing puts less focus on optimized, highspeed algorithms and more focus on tractability and dependability. Generally, the approach is ground-up rather than taking a large foundation of knowledge and adding artificial intelligence to it. Bio-inspired computing often takes a small foundation of set rules and builds upon them in training. Examples of bio-inspired computing can often be found in artificial intelligence (AI), especially in machine learning, where the learning processes of biological organisms can be emulated. These are stochastic search techniques that are developed to achieve near-optimal solutions to large-scale optimization problems. The conventional mathematical optimization methods often fail as the solutions are trapped in local optima, which gave rise to the development of alternate derivative free metaheuristic global optimization techniques. In recent times, bio-inspired algorithms have been explored for various applications, and their performance is as good as or better than that of conventional techniques. The focus of this work is to introduce the important bio-inspired techniques available in the literature. These are not exhaustive, and many new optimization algorithms inspired by natural processes are being developed day-by-day. In this chapter, we will learn about some bio-inspired algorithms and their applications.

128

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

P-System A P-system is a computational model that is based on the principles of biology for computing. They work by mimicking the biological cell, abstracting from the way in which chemicals interact and cross membranes. The concept of the P-System was first introduced by Gheorghe Păun (1998); the P-system is based on his name by taking the first letter from his last name, that is “P.” The branch of research for making modified P-systems is known as ‘membrane computing’.

Informational Description A P system is defined as a set of membranes embedded with chemicals, catalysts, and rules which regulates the chemical reactivity and the order in which the chemicals react to form different products. Rules may cause chemicals to pass through or even cause the membranes to dissolve. Just like in a biological cell, the reaction only occurs when the chemicals come in contact, especially with a catalyst; exactly what happens in a P-system when the objects meet each other, and if the rules favor their reaction, than only the reaction takes place. This leads to a non-deterministic computation method resulting in different products when the same computation is repeated. When the P-system reaches a state where no more possible chemical reactions can occur, the chemicals that are passed out to the outermost membrane of the P-system or to the environment are considered the ‘result’ of the computation (Pãun, 2006).

Components of the P-System There are many modifications of the P-system, but all share the same components, which mimics the biological cell components. The Environment The environment constitutes the surrounds of the P-system. At the initial state of the system, as in a cell which has various components. At the end of the computation, the objects that are found in the environment are the ‘result’ of the computation.

Bio-Inspired Computing

129

Membranes Membranes are the main structure of the P-system. They contain the objects (symbols/catalysts), a set of rules, and other embedded membranes. Just like biological membranes, they are permeable and allow the crossing of symbols resulting from a rule. The outermost membrane is also known as the ‘container membrane’ or ‘skin membrane’. Sometimes the membranes (except the container membrane) ‘dissolve’, and the object of the membrane gets migrated to the membrane in which it was contained. The rules, on the other hand gets eliminated (Păun, 2002). Modified P-systems can have membranes with the ability to divide, possess a charge or have a varying permeability by changing membrane thickness. Symbols Symbols represent the chemicals that react with other chemicals to form products guided by the rules present in the membrane. The symbols are represented by letters. Therefore, all the symbols that are present in the membrane are represented in the form of strings. There are some special cases in which special symbols are used. A lowercase delta (δ) is often used to initiate the dissolving of a membrane, so this is only found in an output of a rule and always invokes a reaction with the colliding chemical. Catalysts As the name suggests, the catalysts here work just like biological enzymes and are never used up in the process of the chemical reaction. They are represented in the same manner as the symbols. They are simply the requirement for a reaction to happen. Rules Rules are the reason for the specific chemical reactions that must happen for the rules to be applied. The rules are embedded in the membranes, and if the required objects are present, it consumes them and carries out the product from them. There can be more than one rules present in the membrane, for which the most dominant rule is carried out. If the most dominant rule cannot be applied, then the rule less dominant to it is carried out. The output objects are handled by the rules in three distinct ways. Generally, the output objects are passed in the same membrane in which its

130

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

content and rules are present; this is known as the here rule. Certain modifiers can be applied, which can alter this behavior of normal object output transfer, that is, the In and out. Output objects can be sent inwards to the current membrane’s children via the in rule chosen at random during the computation. The out modifier causes the object to pass out of the current membrane and into either its parent membrane or to a sibling membrane, specified during the specification of the P-system.

Computation Process The computation proceeds from an initial starting state towards the ending state by undergoing many discrete paths. All the steps in the computation process iterate each member and pass through all the membranes in a maximally parallel and non-deterministic manner (Păun, 2006). The computation comes to a halt when no more possible reactions can take place. At this point, the output objects that are passed into the environment are designated as ‘result’ (Păun, 2006).

Rule Application The method of applying a rule within a membrane is as follows: 1. Assign symbols from a membrane’s content to the rule’s inputs. 2. If all inputs are satisfied, remove all assigned symbols from membrane. 3. Create output symbols and hold until all rule assignment for all membranes has taken place. 4. Add output symbols to targeted membranes. 5. Dissolve membranes as necessary

Non-Deterministic Application The order of rules to be applied is chosen randomly, which brings out different results on running the computation process for the same calculations. For example, if a membrane contains a rule aab and aaδ, if a “a” is passed in this membrane there will be two possible results but never both:

Bio-Inspired Computing

131

1. The membrane will go over to the next computation with one “a” and one with “b,” and again one of the two rules is randomly assigned to the "a" symbol. 2. The membrane will dissolve by undergoing the aaδ rule, and the contents of this membrane will fall into its containing membrane.

Maximally Parallel Application This is a property of rule application, according to which all the possible rules should be applied with each step of the computation. For example, considering the rule aaa, it has the effect of doubling the number of "a" symbols in its containing membrane each step because the rule is applied to every occurrence of an "a" symbol present.

As a Computation Model Most P-systems are computationally universal (Păun, 2006). This means that this system can recognize or decide other data-manipulation rule sets (Freund et al., 2005). P-system offers solving of NP-complete problems in less than exponential time (Păun, 2006). Some are known for solving the satisfiability problem in linear time. It has been proven that any deterministic P system may be simulated on a Turning Machine in polynomial time (Păun, 2001).

Figure 1. An example of computation.

The above image shows the initial state of a P-system. The structure of the P-system is hierarchical, and when a diagram is made, they generally resemble the Venn diagram or David Harel's higraph.

132

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

The outermost membrane, 1, is known as the container membrane and contains one out rule. Membrane 2 contains four out rules with a priority of cccover ccδ. If there is no possibility of the ccc rule than the less dominant ccδ rule is applied, upon which the membrane will dissolve. Membrane 3 has three sets of rules of type “ac” and three rules of type here. As we can see that there are no rules for objects outside of the three membranes, so no rules can be applied outside it in the initial state.

Computation As the P-system is non-deterministic in nature, the results it generates vary upon each computation because of the different paths that it will take during each different computation. Following are the possible paths and results that the P system can generate:

Step 1 From the initial configuration, only membrane 3 has any object content: "ac" • •

"c" is assigned to c → cc "a" is assigned to a → ab

Step 2 Membrane 3 now contains: "abcc” • • •

"a" is assigned to a → bδ "c" is assigned to c → cc "c" is assigned to c → cc

The maximally parallel rule can be seen here as we can see that c → cc can occur twice simultaneously. The non-deterministic behavior can also be seen as different rules can be applied each time. Membrane 3 now dissolves, as the dissolve symbol (δ) has been encountered and all object content from this membrane passes into membrane 2.

Step 3 Membrane 2 now contains: "bbcccc"

Bio-Inspired Computing

• • • •

133

"b" is assigned to b → d "b" is assigned to b → d "cc" is assigned to cc → c "cc" is assigned to cc → c

Step 4 Membrane 2 now contains: "ddcc" • • •

"d" is assigned to d → de "d" is assigned to d → de "cc" is assigned to cc → c

Step 5 Membrane 2 now contains: "dedec" • • •

"d" is assigned to d → de "d" is assigned to d → de "c" is assigned to c → δ

Here we can see that as there, that priority from cc → c has been lifted because there are required inputs for this rule are present. Membrane 2 now dissolves, and all object content passes to membrane 1.

Step 6 Membrane 1 now contains: "deedee" • • • •

"e” is assigned to e → eout "e” is assigned to e → eout "e” is assigned to e → eout "e” is assigned to e → eout

Computation Halts Membrane 1 now contains: "dd" and, due to the out-rule e → eout, the environment contains: "eeee." At this point, the computation halts as no further assignments of objects to rules are possible. The result of the computation is four "e" symbols.

134

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

The only non-deterministic choices occurred during steps 1 and 2, when choosing where to assign the solitary "a" symbol. Consider the case where "a" is assigned to a → bδ during step 1: upon membrane 3 dissolving, only a single "b" and two "c" objects would exist, leading to the creation of only a single "e" object to eventually be passed out as the computation's result.

Bio-Inspired Swarm Optimization Algorithms Genetic Bee Colony (GBC) Algorithm aGBC is an optimized algorithm which integrates the advantages of genetic Algorithms (GA) and Artificial Bee Colony (ABC) -for optimizing numerical problems. In the ABC algorithm (Nseef et al., 2016) the colony of bees is divided into three types of bees: the employed artificial bee, the onlooker’s bees, and scouts’ artificial bees. The basic ABC has the following steps (Magalhaes-Mendes., 2013 and Celal et al., 2015).

Setting ABC Parameter The main parameter of this algorithm, like population size (PS) or solution, the number of bees that are supposed to be as twice the size of PS, and the limit parameter (L) should be first initialized.

Initialization of the Population of Solutions The solutions with size equal to PS are generated randomly by the following equation (1): u=ui ,min j + rand[0,1] (ui ,maxj -ui ,min j)

(1)

where i is the solution index, j is defined to be the decision variable, rand [0,1] generates a random value between 0 and denotes the lower and upper limits of the j-th decision variable.

Bio-Inspired Computing

135

Evaluation of the Population Solutions The objective functions can be used to determine the obtained generated solutions.

Employer Bee Each bee has a specific task to find a new food source in the surrounding area. The bee's food source is then evaluated by the amount of nectar it contains. The bee's food source is then memorized on the bases of current versus the amount of the nectar of the detected source of food. A neighborhood solution, v can be obtained by the modification of the i-th solution, x is proposed as in the following equation (2): vi,j =ui,j +Ɵi,j (ui,j – uk,j)

(2)

where k is a solution which is selected randomly from PS and is a randomly selected between [−1, 1].

Onlooker Bee The information gathered from the employed bees is used in the exploitation process to find the new food source of the same neighborhood by the onlooker bee, and the qualified sources can be marked. Onlooker bees and employees try to improve their solutions by exploring their neighborhoods using equation (2). The (fit) values can be exploited by onlooker bees to select the solutions according to the following equation (3): 𝑓𝑖𝑡𝑖

𝑝𝑖 = ∑𝑝𝑠

𝑗=1 𝑓𝑖𝑡𝑗

(3)

i=1

Scout Bee When an employee bee senses the presence of food in the environment, it automatically identifies the new source of food and becomes a scout bee. The

136

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

number of scout bees can be controlled by a parameter called limit. Then the goal is to find a new food source when the source cannot be improved. This process can be carried out by exploitation and exploration in the search space.

Genetic Operators Since the ABC algorithm does not have sufficient parameters to be preferred in binary optimization, there is an integration of genetic operators such as swap and crossover to find the solution to binary optimization problems. So, the previous equations (1) and (2) are changed and then the initial solutions can be generated by the following equation (4) instead of equation (1). 0,𝑖𝑓𝐺(0,1)≤0.5

𝑢𝑖 = 1, … . , 𝑆𝑁𝑢𝑖 = ∫1,𝑖𝑓𝐺(1,0)>0.5

(4)

where G(0,1) is a generated uniform value. The integration of the searching mechanism of the basic ABC algorithm and GA to the neighborhood will be performed in the following four steps: 1. In the neighborhood of a food source (current), we can randomly select two other sources of food from the population and then can find a proposed solution. 2. Apply the first operator and two-point crossover operator between the current two neighborhoods, best and zero food sources, to generate the sources of children food; 3. Apply the second operator, the swap operator, to the sources of children of food to find grandchildren’s sources of food. 4. The best source of food can be selected as a neighborhood source of food from the obtained solution among the food sources. Therefore, the performance of the basic ABC algorithm can be improved in binary optimization problems.

Cat Swarm Optimization (CSO) CSO algorithm is developed by mimicking the nature of the cats (Pradhan et al., 2012). There are two modes of this algorithm. Seeking mode and tracing

Bio-Inspired Computing

137

mode. The mode seeking is applied during the resting period of the cats, but they are alert, while the tracing mode is used for the local search method to obtain the optimal solution to the problem.

Seeking Mode The seeking behavior is based on four factors: seeking memory pool, which defines the pool size of the seeking memory; seeking a range of the selected dimension (SRD), which defines the minima and maxima of the seeking range; counts of dimension to change (CDC), which represent the dimension number that can be changed in seeking mode; self-position consideration (SPC), which is a Boolean valued variable; mixture ration (MR) defined as the population as a small to ensure that cats usually spend most of their time in the case of observing and resting (Seyedali, 2016). The seeking process is briefly described below: 1. The MR can be selected randomly as a fraction of the population for seeking cats. 2. Make Seeking Memory Pool copies for the ith cat. 3. Update the position of each copy as a plus or minus Seeking Range Dimension (SRD) fraction of the current position value randomly and then replaces them. 4. The values of fitness of all copies can be evaluated. 5. The probability of each candidate from all copies can be calculated, and then choose the best one of them to place at the position of ith seeking cat. 6. Repeat step 2 to involve all seeking cats.

The Tracing Mode This mode is the exploration mode of optimization. In this phase, the cat is having high energy. The quick chase of the cat can be modeled in a mathematical form by changing its position. So, the position and velocity of ith cat in the D-dimensional space can be defined by equation (5): Pi = (Pi1, Pi2,……, PiD) , ^ Vi1, Vi2, …..ViD), ^1≤ d≤ D

(5)

138

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

The best position of the cat swarm can be defined as equation (6): Xgbp =(Xgbp1 , Xgbp2,…...XgbpD)

(6)

Steps involved in the tracing mode: i.

Use the following method form to calculate the new velocity of the ith cat (7). V id =iw. V id + ac.rn. (Xgbpd -Xid)

(7)

Where, iw represents the inertia weight, ac is the acceleration constant and rn is a number that can be randomly selected in the interval [0, 1]. Then, the global best can be randomly selected from the external archive. ii. Evaluate the updated position of an ith cat by using the following equation (8): Xid = X id + Vid iii. iv. v.

(8)

The corresponding boundary value is selected to be a new dimension. Evaluate the fitness of each cat. Finally, the contents of the archive can be updated with the position of the cats.

Artificial Algae Algorithm (AAA) AAA is a recent algorithm the mimics the living lifestyle and behavior of microalgae (Phelps et al., 2010). This algorithm is stimulated by the algal tendency, movement, evolution reproduction, and adaptation to the algal environment. Which gives algae three main processes called, evolutionary process, helical movement, and adaptation. The algorithm consists of the population of algal colonies. If the algal cell in the algal colonies gets enough light, then they will grow and eventually increase the size of the colony. If, for some reason, there is a lack of light, then the algal cells cannot grow. So, in the helical movement, the algal cells move to the region of lighter where the algal colonies can grow better.

Bio-Inspired Computing

139

To describe the main process of AAA, let 𝑥𝑖 = Where i=1, 2,,…n, and xi describe the solution in the search space of solution. Consider that the population of algae is represented by the following matrix [A]: 𝑥11 𝑥12 PAC=[ ⋮ 𝑥𝑛1 𝑥𝑛2

⋯ 𝑥1𝑎 ⋱ ⋮ ] ⋯ 𝑥𝑛𝑎

[A]

Let the algal colony size of the ith algal colony is Si, where i= 1,2,…, n and the objective function is f(xi), and Si will be updated according to the following mathematical equations (9, 10, 11): 𝑆𝑖 = 𝑥𝑖

(9)

𝑆 +4𝑓(𝑥 )

𝜇𝑖 = 𝑆𝑖 +2𝑓(𝑥𝑖 )

(10)

𝑆𝑖𝑡+1 = 𝜇𝑖 𝑆𝑖𝑡 , 𝑖 − 1,2, … , 𝑛

(11)

𝑖

𝑖

Where 𝜇𝑖 represents the update coefficient of 𝑆𝑖 and t described the current generation.

Helical Movement Phase The movement of the algal colony in 3D can be represented in the following equation: 𝑡+1 𝑡+1 𝑡 𝑡 𝑋𝑖ℎ = 𝑋𝑖ℎ + (𝑋𝑗ℎ − 𝑋𝑖ℎ )(𝑠𝑓 − 𝜎𝑖 )𝑝

(12)

𝑡+1 𝑡 𝑡 𝑡 𝑋𝑖𝑘 = 𝑋𝑖𝑘 + (𝑋𝑗𝑘 − 𝑋𝑖𝑘 )(𝑠𝑘 − 𝜎𝑖 )𝑐𝑜𝑠𝛼

(13)

𝑋𝑖𝑙𝑡+1 = 𝑋𝑖𝑙𝑡 + (𝑋𝑗𝑙𝑡 − 𝑋𝑖𝑙𝑡 )(𝑠𝑓 − 𝜎𝑖 )𝑠𝑖𝑛𝛽

(14)

140

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Where equation (12) describes the movement in the 1D, say x, and equations (13) and (14) represent the movement in the two other dimensions, say y, z, k, h, and l represent random integers uniformly generated between 1 and d, Xih, Xik, Xil simulate x, y, z, coordinates of the ith algal colony, j indicates the index of a neighbor algal colony, p is an independent random number in (−1, 1), α and β are random degrees between 0 and 2π, sf is the shear force, and finally σi represents the friction surface area of ith algal colony and is calculated by the following equation (15, 16): 𝜎𝑖 = 2 ∋ 𝜋𝑟𝑖2

(15)

𝑟𝑖 =)

(16)

Where ri describes the radius of the hemisphere of the ith algal colony, and Si represents its size.

Evolutionary Process Phase When the algal population (Xi) moves towards a more optimal area it becomes bigger. The following equations (17, 18, 19) describe the simulation of this process: 𝐵𝑖𝑔𝑔𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥⁡{𝑋𝑖 }, 𝑖 = 1,2, … , 𝑛

(17)

𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑖𝑛⁡{𝑋𝑖 }, 𝑖 = 1,2, … , 𝑛

(18)

𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡𝑗 = 𝐵𝑖𝑔𝑔𝑒𝑠𝑡𝑗 . 𝑗 = 1,2, … , 𝑑

(19)

Where Biggest and Smallest describe the biggest and smallest algal colony and j is a random value which represents the index of a selected algal cell randomly.

Adaptation Phase The algal colony, which is not growing sufficiently, can adapt itself to the surrounding environment. The value of the objective function is considered

Bio-Inspired Computing

141

inferior or superior to the value after movement. After the completion of the algal colony movement, the algal colony that has the highest starvation value, as described by equation (20), adapts itself to the biggest algal colony with adaptation probability Ap. 𝑋𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥⁡{𝑠𝑡𝑎𝑟𝑣𝑎𝑡𝑖𝑜𝑛(𝑋𝑖 ), 𝑖 = 1,2, … , 𝑛

(20)

The adaptation phase of AAA of the algal colony process can be described by the following equations (21): 𝑡 𝑡 𝑋𝑠𝑗 + (𝐵𝑖𝑔𝑔𝑒𝑠𝑡𝑗 − 𝑋𝑠𝑗 ). 𝑅𝑎𝑛𝑑1, 𝑖𝑓𝑅𝑎𝑛𝑑2 < 𝐴𝑝. 𝑗 = 1,2, … , 𝑑 𝑡+1 𝑋𝑠𝑗 ={ 𝑡 𝑋𝑠𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (21)

where s is the index of the algal colony with the highest starvation value, and starvation (Xi) is used to measure the starvation the level of algal colony Xi, j is the index of the algal cell, Rand1 and Rand2 generate a random value between 0 and 1, and Ap is the adaptation probability and proposed to take values between 0.3 and 0.7.

Elephant Search Algorithm (ESA) As the name suggests, the algorithm is based on the characteristics and behavior of elephants. The ESA basically mimics the main characteristic and features of heard of elephants. Males generally live in isolation, while the females prefer to live in family groups (Suash et al., 2016). The spatial enhancement is considered by the female elephant, while male elephants are responsible for the target of exploration. In this scenario, ESA has three main characteristics as an effective search optimization algorithm; (i) the search process iteratively refines the solution to get the optimal solution; (ii) chief female elephants lead intensive local searches at places where there is higher probability of finding the best solution is expected; (iii) the male elephants have duties of explorations out of the local optima. Elephants have several features and characteristics that make the inspiration process from an elephant's biological behavior important (Gai-Ge et al., 2015; Xian-Bing et al., 2016). The ESA is described as follows.

142

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Since elephants live together under the leadership of the oldest one, suppose that each elephant in a clan cli. The elephant j in the clan cli can be described according to the following mathematical equation (22): 𝑋𝑛𝑒𝑤,𝑐𝑙𝑖,𝑗 = 𝑋𝑐𝑙𝑖,𝑗 + 𝑐. (𝑋𝐵𝑒𝑠𝑡,𝑐𝑙𝑖 − 𝑋𝑐𝑙𝑖,𝑗 ). 𝑟

(22)

where 𝑋𝑛𝑒𝑤,𝑐𝑙𝑖,𝑗 , 𝑋𝑐𝑙𝑖,𝑗, are newly updated and old position for elephant j in clan cli, respectively, and c ϵ [0,1] is a factor that determine the influence of clan cli on 𝑋𝑐𝑙𝑖,𝑗, 𝑋𝐵𝑒𝑠𝑡,𝑐𝑙𝑖 represents the clan cli, and r ϵ [0,1]. When 𝑋𝑐𝑙𝑖,𝑗, 𝑋𝐵𝑒𝑠𝑡,𝑐𝑙𝑖 , equation (1) cannot be used, and the fittest elephant can be describing according to the following mathematical equation (23): 𝑋𝑛𝑒𝑤,𝑐𝑙𝑖,𝑗 = 𝛼𝑋𝑐𝑒𝑛𝑡𝑒𝑟,𝑐𝑙𝑖

(23)

Where 𝛼 ϵ [0,1 represents the influence of the Xcentre, cli on Xcentre,cli,j. Then the d-th dimension of new individual Xnew,cli,j is updated based on following mathematical form (24): 𝑋𝑐𝑒𝑛𝑡𝑟𝑒,𝑐𝑙𝑖,𝑑 =

1 𝑛𝑐𝑙𝑖

𝑛

𝑐𝑙𝑖 . ∑𝑗=1

𝑋𝑐𝑙𝑖,𝑗,𝑑

(24)

Where 1≤ d ≤D indicates the d-th dimension, and D is its total dimension, ncli is the number of elephants in the clan cli, and Xcli,j,d is the d-th of the elephant individual Xcli,j. As mentioned, adult male elephants leave their families and live alone in the isolated area. This situation can be simulated by separating operator to solve complex optimization problems. To improve the search ability of ESA, let us consider that the elephant individuals with the worst fitness case will implement the separating operator according to the following equation (25): 𝑋𝑊𝑜𝑟𝑠𝑡,𝑐𝑙𝑖 = 𝑋𝑀𝑖𝑛 + (𝑋𝑀𝑎𝑥 − 𝑋𝑀𝑖𝑛 + 1). 𝑅𝑎𝑛𝑑

(25)

where XMax and XMin are represent the upper and lower bound of the position of elephant individual, XWorst,cli represent the worst elephant individual in clan cli, and Rand ϵ [0,1] is stochastic distribution. Finally, ESA is developed with the description of clan updating and separating operator.

Bio-Inspired Computing

143

Chicken Swarm Optimization (CSO) Bio-inspired meta-heuristic algorithms that have shown ability of solving many optimization applications. They exploit the tolerance for imprecision and uncertainty of the optimization problems and can achieve acceptable solutions using less computing cost. Being one of the most widespread domestic animals, the chickens and their eggs are primarily kept as a source of food. Domestic chickens are gregarious birds and live together in flocks. They are cognitively sophisticated and can recognize over 100 individuals even after several months of separation. They use over 30 distinct sounds for their communication, which range from clucks, cackles, chirps, and cries, that includes a lot of information related to nesting, food discovery, mating and danger. Considering behavior besides learning through trial and error, they also learn from their previous experience and others’ for making decisions, just like humans do! A hierarchal order plays a significant role in the social lives of chickens. The preponderant chickens in a flock will dominate the weak. There exist the more dominant hens that remain near to the head roosters as well as the more submissive hens and roosters who stand at the periphery of the group. Removing or adding chickens from an existing group would causes a temporary disruption to the social order until a specific hierarchal order is established. The hierarchy of chicken swarm is given in the figure below.

Figure 2. Hierarchical order in chicken swarm.

144

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Behavioral Understanding The order of dominance among the family of swarm is as; the roosters call their group-mates to eat first when they find food. The gracious behavior also exists in the hens while raising their children. Roosters would emit a loud call on observing other chickens from a different group invade their territory. The head rooster works as searching for food, and fight with chickens who invade the territory that the group inhabits. The dominant chickens would be consistent with the head roosters to forage for food. The submissive ones would reluctantly stand at the periphery of the group to search for food. There exist competitions between different chickens. As for the chicks, each chicken is too simple to cooperate with each other. Taken as a swarm, however, they may coordinate themselves as a team to search for food under specific hierarchal order. This swarm intelligence can be associated with the objective problem to be optimized and inspired us to design a new algorithm. We can idealize the chickens’ behaviors by the following rules. 1. In chicken swarm, there exist several groups. Each group comprises of a dominant rooster, a couple of hens, and chicks. 2. The identity of the chickens (roosters, hens, and chicks) all depend on the fitness of the chickens themselves. Moreover, this thing can be easily identified as well. The chickens with best several fitness values would be acted as roosters. The chickens with worst several fitness values would be designated as chicks. The others would be the hen. The mother-child relationship between the hens and the chicks is also randomly established. 3. The hierarchal order, dominance relationship, and mother-child relationship in a group remain constant throughout. These statuses only update every several (G) time steps. 4. Chickens follow their group-mate rooster in search of food, while they may prevent the ones from eating their own food. The chicks search for food around their mother (the hen). Obviously, the dominant individuals have advantage in competition for food.

Mathematical Understanding The number of the roosters (RH), the hens (HN), the chicks (CN), and the mother hens (MN) is assumed. The best RN chickens would be assumed to be

Bio-Inspired Computing

145

roosters, while the worst CN ones would be regarded as chicks. The remaining are treated as hens. All N virtual chickens, depicted by their positions xti,j (I Ꜫ [1……N],jꜪ [1…..D]) at time step t, search for food in a D-dimensional space. In this work, the optimization problems are the minimal ones. Thus, the best RN chickens correspond to the ones with RN minimal fitness values. xti,j = xti,j + S1 * Rand * (xtr1,j - xti,j) + S2 * Rand *(xtr2,j - xti,j). S1 = exp ((𝑓i – 𝑓r1) / (abs (𝑓i) + Ꜫ))

(26)

S2 = exp ((𝑓r2 - 𝑓i))

(27)

Where, Rand is a uniform random number over [0, 1], r1 Ꜫ [1,…,N] is an index of the rooster, which is the ith hen’s group-mate, while r2 I Ꜫ [1……N ] is an index of the chicken (rooster or hen), which is randomly chosen from the swarm. r1 ≠ r2 (26, 27). Obviously, 𝑓i> 𝑓r1 , 𝑓I > 𝑓r2 ,thus S2 1 and when 𝐴 < 1 converge towards each other so as to attack. Randomness helps to avoid getting trapped in the local minima.

Moth–Flame Optimization (MFO) Algorithm It is a metaheuristic population-based method developed by Mirjalili in 2015. It imitates the movement technique of moths in the night, called “transverse orientation for navigation.” Moths fly in the night, they depend on the moonlight, where they are maintaining a fixed angle to find their path. The behavior of moths is observed and been formulated as a novel optimization technique. As said earlier, MFO combines a population-based algorithm and local search strategy to yield an algorithm capable of global exploration as well as local exploitation. Just like other metaheuristics, MFO is simple, flexible, and easily applicable. As such, it can be utilized to solve a wide range of problems. On noticing these merits, MFO was successfully applied to various optimization problems, for instance: • • • • • • •

Scheduling, Inverse problem and parameter estimation, Classification, Economic, Medical, Power energy, and Image processing.

In nature, over 160,000 different species of moths have been already documented, which resemble butterflies in their life cycle (i.e., moth consists of two-level life: larva and adult, where it is converted to moth by cocoons). The special navigation method at night is the most interesting fact to be observed in moths. They have been evolved to fly in the night using the moonlight. Also, they employed a mechanism called “transverse orientation for navigation.” This mechanism allows the moth to fly by preserving a stable

Bio-Inspired Computing

149

angle with respect to the moon, this is proved to be very effective mechanism for traveling long distances in a straight path.

Figure 4. Transverse orientation of the moth.

Since the moon is far away from the moth, this mechanism guarantees flying in a straight line. The same navigation method can be done by humans. Suppose that the moon is in the south side of the sky and a human wants to go the east. If he keeps moon of his left side when walking, he would be able to move toward the east on a straight line. It can be observed that moths don’t travel in a forward path, they fly spirally around lights. This is due to the transverse orientation method efficient just for the light source is very far (moonlight). In the man-made artificial light case, the moths attempt to preserve the same angle with the light source. Consequently, moths move in spiral paths around lights.

Algorithm As said earlier, MFO algorithm was proposed by Mirjalili. It is under the population-based metaheuristic algorithms. As shown above in Figure 4, MFO starts by generating moths randomly within the solution space, then calculating the fitness values (i.e., position) of each moth, and tagging the best position by flame. After that, updating the moths’ positions depends on a spiral movement function to achieve better positions tagged by a flame, updating the

150

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

new best individual positions, and repeating the previous processes (i.e., updating the moths’ positions and generating new positions) until the termination criteria are met. Table 1 lists the characteristics of the MFO. The MFO algorithm has three main steps. These steps are shown below. Then, the pseudocode of the MFO as shown in Algorithm 1 and the summary of its parameters setting are illustrated in Table 1 (a) & (b).

Generating the Initial Population of Moths As mentioned in, Mirjalili assumed that each moth can fly in 1-D, 2-D, 3-D, or hyper dimensional space. The set of moths can be expressed: 𝑚1,1 𝑚 M = [ 2,1 𝑚𝑛,1

𝑚1,2 … … 𝑚1,𝑑 𝑚2,2 … … 𝑚2,𝑑 ] 𝑚𝑛,2 … … 𝑚𝑛,𝑑

(36)

Where n refers to the number of moths and d refers to the number of dimensions in the solution space. Also, the fitness values for all moths are memorized in an array as follows: 𝑂𝑀1 𝑂𝑀 OM = [ 2 ] 𝑂𝑀𝑛

(37)

The remaining elements in the MFO algorithm are flames. The following matrix shows the flames in the D-dimensional space followed by their fitness function vector: 𝐹1,1 F= [ 𝐹2,1 𝐹𝑛,1

𝐹1,2 … … 𝐹1,𝑑 𝐹2,2 … … 𝐹2,𝑑 ] 𝐹𝑛,2 … … 𝐹𝑛,𝑑

𝑂𝐹1 OF= [𝑂𝐹2 ] 𝑂𝐹𝑛

(38, 39)

Bio-Inspired Computing

151

Table 1(a). Characteristic of the MFO algorithm

Table 1(b). Summary of parameters setting of Moth-Flame Optimization algorithm Parameter Number of search agents Number of moths (population) Maximum number of iterations

Common value 30-50 10-30 100-10,000

(Seyedali Mirjalili, 2015)

It should be noted here that moths and flames are both solutions. The difference between them is the way we treat and update them in each iteration. The moths are actual search agents that move around the search space, whereas flames are the best position of moths that obtains so far. In other words, flames can be considered as flags or pins that are dropped by moths when searching the search space. Therefore, each moth searches around a flag (flame) and updates it in case of finding a better solution. With this mechanism, a moth never loses its best solution.

Updating the Positions of Moths MFO employs three different functions to converge the global optimal of the optimization problems. These functions are defined as follows equation (40): MFO = (I,P,T)

(40)

where I refers to the first random locations of the moths, P refers to motion of the moths in the search space, and T refers to finish the search process. The

152

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

following equation represents I function, which is used for implementing the random distribution. M(i,j) = (u(i)-lb(j)* rand()+lb(i)

(41)

where lb and ub indicate the lower and upper bounds of variables, respectively. As mentioned previously, the moths fly in the search space using the transverse orientation. The three conditions that should be abided when utilizing a logarithmic spiral subjected, are: • • •

Spiral’s initial point should start from the moth. Spiral’s final point should be the position of the flame. Fluctuation of the range of spiral should not exceed the search space.

Therefore, the logarithmic spiral for the MFO algorithm can be defined as follows: S (Mi , Fj) = Di . ebt . cos(2𝜋t) + Fj

(42)

Where Di refers to the space between the i-th moth and the j-th flame (i.e., Di = |Fj - Mi|), b indicates a fix to define the shape of the logarithmic spiral, and t indicates a random number between [-1, 1]. In MFO, the balancing between exploitation and exploration is guaranteed by the spiral motion of the moth near the flame in the search space. Also, to avoid falling in the traps of the local optima, the optimal solutions have been kept in each repetition, and the moths fly around the flames (i.e., each moth flies surrounding the nearest flame) using the matrices.

Updating the Number of Flames This section highlights enhancing the exploitation of the MFO algorithm (i.e., Updating the moths’ positions in n locations in the search space may decrease a chance of exploitation of the best promising solutions). Therefore, decreasing the number of flames helps to solve this issue based on the following equation: flame no = round (𝑁 −

𝑙∗𝑁−𝑙 𝑇

)

(43)

Bio-Inspired Computing

153

where N is the maximum number of flames, l is the current number of iterations, and T indicates the maximum number of iterations.

Different Variants of MFO The MFO was introduced in 2015 under the metaheuristic swarm-based algorithms. Various updates have been done on the MFO to comply with the different processes in the search space of the optimization problem. The brief of MFO’s variants are illustrated below.

Multi-objective Multi-objective moth–flame optimization algorithm (MOMFA) comes in role to improve the efficiency of using water resources. The method assisted and utilized the original moth–flame optimization algorithm, opposition-based learning, and indicator-based selection-efficient mechanisms to maintain the diversity and accelerate the convergence. The algorithm was tested on the Lushui River basin and many benchmarks.

Figure 5(a). Publications on Moth flame optimization algorithm.

154

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Figure 5(b). The distribution of published research articles. (https://doi.org/10.1016/j.knosys.2015.07.006).

Binary A modified MFO algorithm (MFOA) is another type to examined characteristics of the local and global search of the basic algorithm was developed. The algorithm is aimed to improve solving unit commitment (UC) problem by using the binary coded modified MFOA (BMMFOA), the basic MFO is a nature inspired heuristic search approach that mimics the traverse navigational properties of moths around artificial lights tricked for natural moonlight. The algorithm used position update of a single-based approach between corresponding flame and the moth differently than many other swarm-based approaches. The MMFOA is used to improve the exploitation search of the moths and reduces the number of flames. The research tested the four additional alternatives based on one commitment issue of operational scheduling of power system, and it also used the modified sigmoidal transformation to carry out the binary chart of actual moth value and flame positions for fixing unit commitment problem. The research examined the efficacy of the proposed approaches for different test systems, in the characteristics of convergence, execution time, and terms of solution quality.

Bio-Inspired Computing

155

Hybridization This was a newly presented optimization framework in the context of electric motor design. The framework was developed to analyze the dynamic behavior of the BLDC motor in order to solve the torque ripple problem. Therefore, the presented framework consists of three components, namely, AC voltage, voltage source inverter, integrated MFO, and fuzzy logic controller (FLC). AC voltage is used as the input source. MFO is used to control the minimized provided voltage and line stream harmonics existing in a motor system, while FLC is applied to improve the performance of the MFO algorithm by improving the updating function of MFO and minimization of torque ripple. In the experiment, the performance and applicability of the MFO framework are evaluated under three test situations of rate and torque provisions, which are the examination of stability rate and torque, examination of stability torque with rate variation, and examination of stability torque variation with stability rate. The analysis results of the proposed framework are compared with the original MFO and controller. The comparison results show far better supremacy of the proposed framework over other techniques.

Applications Many applications of MFO to benchmark optimization and real-world problems have been reported. Discussions of the MFO in various applications, such as benchmark optimization, chemical, economic, image processing, medical applications, networks, power dispatch problem, and engineering optimization.

Whale Optimization Algorithm (WOA) Whales are fancy creatures. They are considered the biggest mammals in the world. Whales are a widely distributed and diverse group of fully aquatic placental marine mammals. They are an informal grouping within the infraorder Cetacea, which usually excludes dolphins and porpoises. Whales are fully aquatic, open-ocean creatures: they can feed, mate, give birth, suckle and raise their young at sea.

156

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

An adult whale can grow up to 30 m long and 180 t weight. There are seven different main species of this giant mammal, such as killer, humpback, right, finback, and blue. Whales are mostly considered predators. They never sleep because they have to breathe from the surface of the oceans. In fact, half of the brain only sleeps. The interesting thing about whales is that they are considered highly intelligent animals with emotion. According to Hof and Van Der Gucht, whales have common cells in certain areas of their brains similar to those of humans called spindle cells (S.A. Uymaz et al., 2015). These cells are responsible for judgment, emotions, and social behaviors in humans. Whales have twice the number of these cells than adult humans, which is the main cause of their smartness. It has been proven that whales can think, learn, judge, communicate, and become even emotional as a human does, but obviously with a much lower level of smartness. It has been observed that whales (mostly killer whales) are able to develop their own dialect as well. Another interesting point is the social behavior of whales. They live alone or in groups. However, they are mostly observed in groups. Some of their species (killer whales, for instance) can live in a family over their entire life period. One of the biggest baleen whales is humpback whales (Megaptera novaeangliae). An adult humpback whale is almost the size of a school bus. Their favorite preys are krill and small fish herds. Figure 6 shows this mammal.

Figure 6. Bubble-net feeding behavior of humpback whales. (https://www.sciencedirect.com/science/article/abs/pii/S0965997816300163?via%3D ihub).

Bio-Inspired Computing

157

The most interesting thing about humpback whales is their special hunting method. This foraging behavior is called the “bubble-net feeding method.” Humpback whales prefer to hunt a school of krill or small fishes close to the surface. It has been observed that this foraging is done by creating distinctive bubbles along a circle or ‘9’-shaped path, as shown in Figure 6. Before 2011, this behavior was only investigated based on observation from surface. However, further investigated this behavior utilizing tag sensors that captured 300 tag-derived bubble-net feeding events of 9 individual humpback whales. Two maneuvers wad found associated with the bubble and ‘upward-spirals’ and ‘double-loops’. In the former maneuvers, humpback whales dive around 12 m down and then start to create a spiral shaped bubble around the prey and swim up toward the surface. The latter maneuver includes three different stages: coral loop, lob tail, and capture loop. It is worth mentioning here that bubble-net feeding is a unique behavior that can only be observed in humpback whales. In this work, the spiral bubblenet feeding maneuver is mathematically modelled in order to perform optimization.

Mathematical Model Humpback whales can recognize the location of prey and encircle them. Since the position of the optimal design in the search space is not known a priori, the WOA algorithm assumes that the current best candidate solution is the target prey or is close to the optimum. After the best search agent is defined, the other search agents will hence try to update their positions toward the best search agent. This behavior is represented by the following equations: ⃗ = 𝐶 .𝑋 ∗ (𝑡) − 𝑋(𝑡) ∨ 𝐷

(44)

⃗ 𝑋(𝑡 + 1) = 𝑋 ∗ (𝑡) − 𝐴. 𝐷

(45)

Where t indicates the current iteration, 𝐴 and 𝐶 are coefficient vectors, X ∗is the position vector of the best solution obtained so far, 𝑋 is the position vector, | | is the absolute value, and ·is an element-by-element multiplication. It is worth mentioning here that X ∗should be updated in each iteration if there is a better solution. The vectors 𝐴 and 𝐶 are calculated as follows.

158

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

𝐴 = 2𝑎. 𝑟 − 𝑎

(46)

𝐶 = 2. 𝑟

(47)

Where 𝑎 is linearly decreased from 2 to 0 over the course of iterations (in both exploration and exploitation phases) and 𝑐 is a random vector in [0, 1]. Figure 7(a) below illustrates the rationale behind the 2D problem. The position (X, Y) of a search agent can be updated according to the position of the current best record (X ∗, Y ∗). Different places around the best agent can be achieved with respect to the current position by adjusting the value of 𝐴 and 𝐶 vectors. The possible updating position of a search agent in 3D space is also depicted in Figure 7 (b). By defining the random vector (𝑟), it is possible to reach any position in the search space located between the key-points shown in figure. Therefore, Eq. (45) allows any search agent to update its position in the neighborhood of the current best solution and simulates encircling the prey. The same concept can be extended to a search space with n dimensions, and the search agents will move in hyper-cubes around the best solution obtained so far. As mentioned in the previous section, humpback whales also attack their prey with the bubble-net strategy.

Bubble-net Attacking Method (Exploitation Phase) To mathematically model the bubble-net behavior of humpback whales, two approaches are designed as follows:

Shrinking Encircling Mechanism This behavior is achieved by decreasing the value of 𝑎 in the Eq. (46). Note that the fluctuation range of 𝐴 is also decreased by 𝑎 . In other words 𝐴 is a random value in the interval [−a, a] where a is decreased from 2 to 0 over the course of iterations. Setting random values for 𝐴 in [−1, 1], the new position of a search agent can be defined anywhere in between the original position of the agent and the position of the current best agent. Figure 8 shows the possible positions from (X, Y) towards (X ∗, Y ∗) that can be achieved by 0≤A ≤1 in a 2D space.

Bio-Inspired Computing

159

Figure 7(a). 2D and 3D position vectors and their possible next locations (X ∗is the best solution obtained so far).

Figure 7(b). (a) Shrinking encircling mechanism and (b) spiral updating position. (https://www.sciencedirect.com/science/article/abs/pii/S0965997816300163?via%3D ihub).

Spiral Updating Position As can be seen in Figure 4 (b), this approach first calculates the distance between the whale located at (X, Y) and the prey located at (X ∗, Y ∗). A spiral equation is then created between the position of the whale and the prey to mimic the helix-shaped movement of humpback whales as follows: ⃗⃗⃗⃗⃗⃗⃗ (𝑡) 𝑋(𝑡 + 1) = ⃗⃗⃗⃗⃗ 𝐷 𝑟 . 𝑒 𝑏𝑙 . 𝑐𝑜𝑠(2𝜋𝑙) + 𝑋

(48)

⃗ = 𝑋 ∗ (t) - 𝑋 (t) | and indicates the distance of the ith whale to the Where 𝐷 prey (best solution obtained so far), b is a constant for defining the shape of the logarithmic spiral, l is a random number in [−1, 1], and is an element-byelement multiplication.

160

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Note that humpback whales swim around the prey within a shrinking circle and along a spiral-shaped path simultaneously. To model this simultaneous behavior, we assume that there is a probability of 50% to choose between either the shrinking encircling mechanism or the spiral model to update the position of whales during optimization. The mathematical model is as follows: 𝐴(𝑡 + 1)= {

⃗ ⁡if⁡p 1 emphasize exploration and allow the WOA algorithm to perform a global search. The mathematical model is as follows: ⃗ = |𝐶 . 𝑋 ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ 𝐷 𝑟𝑎𝑛𝑑 − 𝑋 |

(50)

⃗ 𝑋(𝑡 + 1) = ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ 𝑋𝑟𝑎𝑛𝑑 − 𝐴. 𝐷

(51)

Where 𝐴 rand is a random position vector (a random whale) chosen from the current population. Some of the possible positions around a particular solution with 𝐴 > 1 are depicted in Figure 8. The WOA algorithm starts with a set of random solutions. At each iteration, search agents update their positions with respect to either a randomly chosen search agent or the best solution obtained so far. The a parameter is decreased from 2 to 0 in order to provide exploration and exploitation, respectively. A random search agent is chosen when | 𝐴 | >

Bio-Inspired Computing

161

1, while the best solution is selected when | 𝐴 | < 1 for updating the position of the search agents.

Figure 8. Exploration mechanism implemented in WOA (X ∗is a randomly chosen search agent) (https://www.sciencedirect.com/science/article/abs/pii/S0965997816300163?via%3D ihub).

Depending on the value of p, WOA is able to switch between either a spiral or circular movement. Finally, the WOA algorithm is terminated by the satisfaction of a termination criterion. From a theoretical standpoint, WOA can be considered a global optimizer because it includes exploration/ exploitation ability. Furthermore, the proposed hyper-cube mechanism defines a search space in the neighborhood of the best solution and allows other search agents to exploit the current best record inside that domain. Adaptive variation of the search vector A allows the WOA algorithm. To smoothly transit between exploration and exploitation: by decreasing A , some iterations are devoted to exploration (| A | ≥1), and the rest is dedicated to exploitation (| A | < 1). Remarkably, WOA includes only two main internal parameters to be adjusted (A and C). Although mutation and other evolutionary operations might have been included in the WOA formulation to fully reproduce the behavior of humpback whales, hybridization with evolutionary search schemes may be the subject of future studies.

162

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

Fish Swarm Optimization Algorithm (FSOA) The fish always try to maintain their colonies and accordingly demonstrate intelligent behaviors. Searching for food, immigration, and dealing with dangers all happen in a social form, and interactions between all fish in a group will result in intelligent social behavior. The FSOA is based on fish swarms observed in nature: approximately 50% of fish species live in the swarm (i.e., present synchronous and coordinated movements) in some moment of their lives, as shown in Figure 9. (Li et al., 2004; Ma et al., 2009).

Figure 9. Biological inspiration of the Fish Swarm Optimization Algorithm (Li et al., 2002). (https://www.scielo.br/j/lajss/a/ZsdRkGWRVtDdHJP8WTDFFpB/?lang=en#:~:text= The%20FSOA%20is%20an%20optimization,food%20sources%20(design%20space

In the development of the FSOA, the following characteristics are considered: i. ii.

iii.

Each fish represents a candidate solution to the optimization problem; Food density is related to an objective function to be optimized (in an optimization problem, the amount of food in a region is inversely proportional to the value of the objective function); and The aquarium is the design space where the fish can be found.

Bio-Inspired Computing

163

As noted earlier, the fish weight in the swarm represents the accumulation of food (e.g., the objective function) received during the evolutionary process. In this case, weight is an indicator of success (Li et al., 2002; Madeiro, 2010). Basically, the FSOA presents four operators that can be classified as “search” and “movement” (Qiang et al., 2022 and Chu et al., 2007). Details on each of these operators are shown next.

Concept and Algorithm Individual Movement Operator This operator contributes to the individual and collective movements of fishes in the swarm. Each fish updates its new position by using Equation (52): 𝑥𝑖𝑡+1 = 𝑥𝑖𝑡 + 𝑟𝑎𝑛𝑑x⁡𝑠𝑖𝑛𝑑

(52)

Where xi is the final position of fish i at current generation, rand is a2random generator and sind is a weighted parameter.

Food Operator The weight of each fish is a metaphor used to measure the success of the food search. The higher the weight of a fish, the more likely these fish to be in a potentially interesting region in design space. The amount of food that a fish eats depends on the improvement in its objective function in the current generation. The weight is updated according to Equation (53): ∆𝑓

𝑖 𝑊𝑖𝑡+1 = 𝑊𝑖𝑡 + 𝑚𝑎𝑥⁡(∆𝑓)

(53)

Where Wti is the fish weight i at generation t and Δfi is the difference of the objective function between the current position and the new position of fish i. It is important to emphasize that Δfi=0 for the fish in the same position.

Instinctive Collective Movement Operator This operator is important for the individual movement of fishes when Δfi≠0. Thus, only the fish whose individual execution of the movement resulted in improvement of their fitness will influence the direction of motion of the school, resulting in instinctive collective movement. In this case, the resulting

164

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

direction (𝐼 t), calculated using the contribution of the directions taken by the fish, and the new position of the ith fish are given by equation (54, 55). 𝐼 →𝑡 =

→ ∑𝑁 𝑖=1 ∆𝑥𝑖 ∆𝑓𝑖

∑𝑁 𝑖=1 ∆𝑓𝑖

𝑥𝑖→𝑡+1 = 𝑥𝑖→𝑡 + 𝐼 →𝑡

(54) (55)

It is important to emphasize that in the application of this operator, the direction chosen by a fish that located the largest portion of food exerts the greatest influence on the swarm. Therefore, the instinctive collective movement operator tends to guide the swarm in the direction of motion chosen by fish, who found the largest portion of food in it its individual movement. The amount of food that a fish eats depends on the improvement in its objective function in the current generation. The weight is updated according to Equation (53). Where Wti is the fish weight i at generation t and Δfi is the difference of the objective function between the current position and the new position of fish i. It is important to emphasize that Δfi=0 for the fish in same position. It is important to emphasize that in the application of this operator, the direction chosen by a fish that located the largest portion of food to exert the greatest influence on the swarm. Therefore, the instinctive collective movement operator tends to guide the swarm in the direction of motion chosen by fish, who found the largest portion of food in its individual movement.

Non-Instinctive Collective Movement Operator As noted earlier, the fish weight is a good indication of the search success for food. In this way, when the swarm weight is increasing, it means that the search process is performing successfully. So, the “radius”' of the swarm must decrease so that other regions can be explored. If in case, the swarm weight remains constant; the radius should increase to allow the exploration of new regions. For the swarm contraction, the centroid concept is used. This is obtained by means of an average position of all fish weighted with the respective fish weights, according to Equation (54) 𝐵→𝑡 =

→ 𝑡 ∑𝑁 𝑖=1 𝑥𝑖 𝑊𝑖 𝑡 ∑𝑁 𝑖=1 𝑊𝑖

(56)

Bio-Inspired Computing

165

If the swarm weight remains constant in the current iteration, all fish must update their positions by using Equation (57): 𝑥 →𝑡 −𝐵→𝑡

𝑥 →𝑡+1 = 𝑥 →𝑡 − 𝑠𝑣𝑜𝑙 x 𝑑(𝑥 →𝑡 .𝐵→𝑡)

(57)

Where d is a function that calculates the Euclidean distance between the centroid and the current position of the fish, and svol is the step size used to control fish displacements. FSOA, based on the social behavior of fish colonies, was applied to solve different design problems. The simulation results were compared with those obtained from other competing evolutionary algorithms. Besides, the results showed that the methodology is configured as a promising alternative for a number of engineering applications. However, in terms of the number of objective function evaluations, this approach needs yet to be better studied so that more definitive conclusions can be drawn. This characteristic, i.e., the number of objective function evaluations, is inherent to this methodology due to the number of loops that are required by the algorithm. Consequently, it is normally expected a high number of objective function evaluations in the present version of the FSOA algorithm. Further research work will be focused on the influence of the parameter values required by FSOA on the quality of the optimal solutions.

Artificial Neural Network There are around 1000 billion neurons in the human brain. Each neuron has an association point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to be distributed, and we can extract more than one piece of this data when necessary from our memory parallel. We can say that the human brain is made up of incredibly amazing parallel processors. The term "Artificial Neural Network" is derived from biological neural networks that develop the structure of a human brain. Similar to the human brain, which has neurons interconnected to one another, artificial neural networks (ANN) also have neurons that are interconnected to one another in various layers of the networks. These neurons are known as “nodes.” ANN attempts to mimic the network of neurons that make up a human brain so that computers will have the option to understand things and make decisions in a

166

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

human-like manner. The artificial neural network is designed by programming computers to behave simply like interconnected brain cells. We can understand the artificial neural network with an example, considering an example of a digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs are "On," then we get "On" in the output. If both the inputs are "Off," then we get "Off" in the output. Here the output depends upon the input. Our brain does not perform the same task. The outputs to inputs relationship keep changing because of the neurons in our brain, which are learning. An artificial neural network (or simply neural network) consists of an input layer of neurons (or nodes, units). Training an artificial neural network is an optimization task since it is desired to find the optimal weight set of a neural network in the training process. Since ANNs are quite successful in modeling nonlinearity and have characteristics, such as being capable of generalizing, adaptability, self-organizing, real-time operation, and fault tolerance, they are involved in so many applications in research fields. Finding a suitable network structure and finding optimal weight values make the design of ANNs difficult optimization problems. In other words, the success of ANNs largely depends on the architecture, the training algorithm, and the choice of features used in training. Traditional training algorithms have some drawbacks, such as being stuck in local minima and computational complexity. Therefore, evolutionary algorithms are employed to train neural networks to overcome these issues. Figure 10 shows a typical architecture, where lines connecting neurons are also shown. Each connection is associated with a numeric number called weight. The output, hi, of neuron i in the hidden layer is, ℎ𝑖𝑑 ℎ𝑖 = 𝜎(∑𝑁 ) 𝑗=1 𝑉𝑖𝑗 𝑥𝑗 + 𝑇𝑖

(58)

where σ () is called the activation (or transfer) function, N is the number of input neurons, Vij is the weights, x j is inputs to the input neurons, and Thidi is the threshold terms of the hidden neurons. The purpose of the activation function is, besides introducing nonlinearity into the neural network, to bind the value of the neuron so that the neural network is not paralyzed by divergent neurons. A common example of the activation function is the sigmoid (or logistic) function defined as (Figure 10),

Bio-Inspired Computing

167

Other possible activation functions are arc tangent and hyperbolic tangent. They have a similar response to the inputs as the sigmoid function but differ in the output ranges.

Figure 10. Architecture of a neural network. (Moscow.sci-hub.se).

A neural network constructed the way above can approximate any computable function to arbitrary precision. Numbers given to the input neurons are independent variables, and those returning from the output neurons are dependent variables to the function being approximated by the neural network. Inputs to and outputs from a neural network can be binary (such as yes or no) or even symbols (green, red) when data are appropriately encoded. This feature confers a wide range of applicability to neural networks.

Artificial Bee Colony Algorithm It was proposed by Karaboga for optimizing numerical problems in 2005. The algorithm simulates the intelligent foraging behaviour of honeybee swarms. It is a very simple, robust, and population-based stochastic optimization algorithm. Karaboga and Basturk have compared the performance of the ABC algorithm with those of other well-known modern heuristic algorithms such as GA, Differential Evolution (DE), and Particle Swarm Optimization (PSO) on unconstrained problems.

168

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

ABC Algorithm, which has good exploration and exploitation capabilities in searching optimal weight set is used in training neural networks. ABC algorithm is based on the foraging behavior of honeybees for numerical optimization problems and the performance of the ABC algorithm with those of other well-known modern heuristic algorithms such as GA, DE, and PSO on unconstrained problems. In this work, the ABC algorithm is employed in training feed-forward neural networks, and the performance of the algorithm is compared with GA from evolutionary algorithms and BackPropagation (BP) Algorithm. Detailed pseudo-code of the ABC algorithm is given below: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Initialize the population of solutions xi, i = 1. . .SN Evaluate the population Cycle=1 Repeat Produce new solutions υi for the employed bees by using (4) and evaluate them Apply the greedy selection process Calculate the probability values pi for the solutions xi by (3) Produce the new solutions υi for the onlookers from the solutions xi selected depending on the pi and evaluate them Apply the greedy selection process Determine the abandoned solution for the scout, if it exists, and replace it with a new randomly produced solution xi by (5) Memorize the best solution achieved so far Cycle=cycle+1 Until cycle=MCN

In the ABC algorithm, the position of a food source represents a possible solution to the optimization problem, and the nectar amount of a food source corresponds to the quality (fitness) of the associated solution. The number of employed bees or onlooker bees is equal to the number of solutions in the population. In the first step, the ABC generates a randomly distributed initial population P(G = 0) of SN solutions (food source positions), where SN denotes the size of the population. Each solution xi (i = 1, 2, ..., SN) is a D-dimensional vector. Here, D is the number of optimization parameters. After initialization, the population of the positions (solutions) is subjected to repeated cycles, C = 1, 2, MCN, of the search processes of the employed bees, the onlooker bees,

Bio-Inspired Computing

169

and the scout bees. An employed bee produces a modification on the position (solution) in her memory depending on the local information (visual information) and tests the nectar amount (fitness value) of the new source (new solution). Provided that the nectar amount of the new one is higher than that of the previous one, the bee memorizes the new position and forgets the old one. Otherwise, she keeps the position of the previous one in her memory. After all employed bees complete the search process, they share the nectar information of the food sources and their position information with the onlooker bees in the dance area. An onlooker bee evaluates the nectar information taken from all employed bees and chooses a food source with a probability related to its nectar amount. As in the case of the employed bee, she produces a modification on the position in her memory and checks the nectar amount of the candidate source. Providing that its nectar is higher than that of the previous one, the bee memorizes the new position and forgets the old one. An artificial onlooker bee chooses”a fo’d source depending on the probability value associated with that food source, pi, calculated by the following expression value of the solution i, which is proportional to the nectar amount of the food source in the position i and SN is the number of food sources which is equal to the number of employed bees (BN). In order to produce a candidate food position from the old one in memory, the ABC uses the following expression (59, 60): 𝑓𝑖𝑡𝑖

𝑝𝑖 = ∑𝑆𝑁

𝑛=1 𝑓𝑖𝑡𝑛

(59)

where fiti is the fitness va vij = xij + φij (xij − xkj)

(60)

where k ∈ {1, 2,..., SN} and j ∈ {1, 2,...,D} are randomly chosen indexes. Although k is determined randomly, it has to be different from i. φi,j is a random number between [-1, 1]. It controls the production of neighbor food sources around xi,j and represents the comparison of two food positions visually by a bee. As can be seen, as the difference between the parameters of the xi,ji, and xk,j decreases, the perturbation on the position xi,j gets decrease, too. Thus, as the search approaches to the optimum solution in the search space, the step length is adaptively reduced.

170

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

The food source of which the nectar is abandoned by the bees is replaced with a new food source by the scouts. In ABC, this is simulated by producing a position randomly and replacing it with the abandoned one. In ABC, providing that a position can not be improved further through a predetermined number of cycles, then that food source is assumed to be abandoned. ABC algorithm, which is a new, simple and robust optimization algorithm, has been used to train feed-forward ANNs for classification purpose. The performance of the algorithm has been compared with the traditional back propagation algorithm and the genetic algorithm, which is a well-known evolutionary algorithm. ABC algorithm can be successfully applied to train feed-forward neural networks. The application of ABC to other classification test problems such as iris, diabetes, and cancer classification and the implementation of the algorithm for optimizing the network structure as well as optimizing weights remain as future works.

Cuckoo Optimization Algorithm (COA) The nature of cuckoos is to typically approach nests stealthily and quickly drop their thick-shelled eggs onto host eggs to increase chances of breaking host eggs. They find nests by watching hosts from hidden perches, and laying is synchronized with that of the hosts. Most cuckoos lay one egg per nest, except in a few species in which cuckoo nestlings do not kill those of the host. Cuckoos defend territories against other cuckoos and usually remove a host egg before laying one of their own. Some of these eggs, which look similar to the host bird’s eggs, have the opportunity to grow up and become adult cuckoos. In other cases, the eggs are discovered by host birds, and the host birds will throw them away or leave their nests and find other places to build new ones. This algorithm was inspired by the special lifestyle of the cuckoo species. The aim of the COA is to maximize the survival rate of the eggs. Each egg in a nest representing a solution, and a cuckoo egg stands for a new solution. The COA uses new and potentially better solutions to replace inadequate solutions in the nests. The COA is based on the following rules: 1. Each cuckoo lays one egg at a time and dumps this egg in a randomly chosen nest; 2. The best nests with high-quality eggs (solutions) will carry over to the next generation;

Bio-Inspired Computing

171

3. The number of available host nests is fixed, and a host bird can detect an alien egg with the probability of pa ϵ =[0, 1]. Cuckoos lay eggs within a maximum distance from their habitat. This range is called the Egg Laying Radius (ELR). In the algorithm, ELR is defined as: ELR = 𝛼x⁡

𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑐𝑢𝑐𝑘𝑜𝑜′ 𝑠𝑒𝑔𝑔𝑠 𝑇𝑜𝑡𝑎𝑙𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑒𝑔𝑔𝑠

x⁡(varhi⁡‐⁡varlow)

(61)

Where α is an integer to handle the maximum value of ELR, and varhi and varlow are the upper limit and lower limit for variables, respectively. When the cuckoo groups are formed in different areas, the society with the best profit value is then selected as the goal point for other cuckoos to immigrate to. In order to recognize which cuckoo belongs to which group, cuckoos are grouped by the K-means clustering method. When moving toward the goal point, the cuckoos only fly a part of the way and have a deviation. Specifically, each cuckoo only flies λ% of all distances and has a deviation of φ radians. The parameters for each cuckoo are defined as follows: λ ~ U (0, 1) φ ~ U (−ω, ω) Where λ ~ U (0, 1) represents that λ is a random number (uniform distribution) between 0 and 1. ω is a parameter to constrain the deviation from the goal habitat. A ω of π/6 is supposed to be enough for good convergence. The lifestyle of the cuckoo species and their characteristics has also been the basic motivation for the development of this evolutionary optimization algorithm.

Bacterial Foraging Optimization Algorithm (BFOA) During foraging of the real bacteria, locomotion is achieved by a set of tensile flagella. Flagella help the bacterium Escherichia coli to tumble or swim, which are two basic operations performed by a bacterium at the time of foraging. When they rotate the flagella in the clockwise direction, each flagellum pulls on the cell. That results in the moving of flagella independently, and finally, the bacterium tumbles with a lesser number of tumbling, whereas in a harmful place, it tumbles frequently to find a nutrient

172

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

gradient. Moving the flagella in the counter-clockwise direction helps the bacterium to swim at a very fast rate. In the above-mentioned algorithm, the bacteria undergo chemotaxis, where they like to move towards a nutrient gradient and avoid a noxious environment. Generally, the bacteria move for a longer distance in a friendly environment. Figure 11.1 depicts how the clockwise and counter clockwise movement of a bacterium takes place in a nutrient solution. When they get food in sufficient, they are increased in length, and in the presence of a suitable temperature, they break in the middle to form an exact replica of themselves.

Figure 11. Swim and tumble of a bacterium. (https://www.genome.gov/geneticsglossary/Bacteria).

This phenomenon inspired to introduce an event of reproduction in BFOA. Due to the occurrence of sudden environmental changes or attacks, the chemotactic progress may be destroyed, and a group of bacteria may move to some other places, or some other may be introduced in the swarm of concern. This constitutes the event of elimination-dispersal in the real bacterial population, where all the bacteria in a region are killed, or a group is dispersed into a new part of the environment. BFOA is inspired by the social foraging behavior of E. coli. BFOA has already drawn the attention of researchers because of its efficiency in solving real-world optimization problems arising in several application domains. The bacterial foraging optimization algorithm (BFOA) has been widely accepted as a global optimization algorithm of current interest for distributed optimization and control. The underlying biology behind the foraging strategy

Bio-Inspired Computing

173

of E. coli is emulated in an extraordinary manner and used as a simple optimization algorithm. Its application includes a training neural network for short-term electric load forecast, image enhancement, tuning adaptive media filters, improve the peak signal-to-noise ratio of the highly corrupted image.

Flower Pollination Algorithm (FPA) Pollination can take two major forms: abiotic and biotic. About 90% of flowering plants belong to biotic pollination; that is, pollen is transferred by a pollinator such as insects and animals. About 10% of pollination takes the abiotic form, which does not require any pollinators. Wind and diffusion in water help the pollination of such flowering plants, and the grass is a good example. Pollinators, or sometimes called pollen vectors, can be very diverse. There are at least 200,000 estimated varieties of pollinators, such as insects, bats, and birds. Pollination can be achieved by self-pollination or crosspollination. Cross-pollination, or allogamy, means pollination can occur from the pollen of a flower of a different plant, while self-pollination is the fertilization of one flower, such as peach flowers, from the pollen of the same flower or different flowers of the same plant, which often occurs when there is no reliable pollinator available. Biotic cross-pollination may occur at long distances, and the pollinators such as bees, bats, birds, and flies can fly a long distance; thus, they can be considered global pollination. In addition, bees and birds may behave as L’evy flight behavior, with jump or fly distance steps obeying a L´evy distribution. Furthermore, flower constancy can be used as an increment step using the similarity or difference between two flowers. It is estimated that there are over a quarter of a million types of flowering plants in nature and that about 80% of all plant species are flowering species. It still remains partly a mystery how flowering plants came to dominate the landscape from the Cretaceous period. The flowering plant has been evolving for more than 125 million years, and flowers have become so influential in evolution, we cannot imagine how the plant world would be without flowers. The main purpose of a flower is ultimately reproduction via pollination. Flower pollination is typically associated with the transfer of pollen, and such transfer is often linked with pollinators such as insects, birds, bats, and other animals. In fact, some flowers and insects have co-evolved into a very specialized flower-pollinator partnership. For example, some flowers can only

174

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

attract and can only depend on a specific species of insects for successful pollination. Now by idealizing the above characteristics of the pollination process, flower constancy and pollinator behavior as the following rules: 1. Biotic and cross-pollination are considered as global pollination process with pollen-carrying pollinators performing L´evy flights. 2. Abiotic and self-pollination are considered local pollination. 3. Flower constancy can be considered as the reproduction probability is proportional to the similarity of the two flowers involved. 4. Local pollination and global pollination are controlled by a switch probability p ∈. Due to the physical proximity and other factors such as wind, local pollination can have a significant fraction p in the overall pollination activities. From the beforementioned idealized characteristics, designing a flowerbased on the algorithm, namely, the flower pollination algorithm (FPA) is possible. There are two key steps in this algorithm; they are global pollination and local pollination. In the global pollination step, flower pollens are carried by pollinators such as insects, and pollens can travel over a long distance because insects can often fly and move in a much longer range. This ensures the pollination and reproduction of the fittest, and thus we represent the fittest as g*. The first rule plus flower constancy can be represented mathematically as equation (62) xit+1 = xit + L(xit – g*)

(62)

where xit is the pollen i or solution vector xit iteration t, and g* is the current best solution found among all solutions at the current generation/iteration. The parameter L is the strength of the pollination, which essentially is a step size. Since insects may move over a long distance with various distance steps, we can use a L´evy flight to mimic this characteristic efficiently. Flowering plants have evolved some interesting features of flower pollination, leading to the successful development of a new flower algorithm to mimic these characteristics. The proposed flower pollination algorithm is very efficient and can outperform both GA and PSO. The convergence rate is essentially exponential, as seen from the convergence comparison in the previous section. The reasons that FPA is efficient can be twofold: longdistance pollinators and flower consistency. Pollinators such as insects can

Bio-Inspired Computing

175

travel a long distance, and thus they introduce the ability (into the algorithm) that they can escape any local landscape and subsequently explore larger search space. This acts as exploration moves. On the other hand, flower consistency ensures that the same species of flowers (thus similar solutions) are chosen more frequently and thus guarantee convergence more quickly.

Figure 12. Pseudo code of the proposed Flower Pollination Algorithm (FPA).

This step is essentially an exploitation step. The interplay and interaction of these key components and the selection of the best solution g* ensure that the algorithm is very efficient. Furthermore, it is possible to extend the flower algorithm to a discrete version so that it can solve combinatorial optimization problems.

Neuromorphic Engineering Neuromorphic engineering is the science of building analogy circuits using very large scale integration (VLSI) that mimics the functioning of biological neurons. Such neuromorphic computation devices can be engineered using oxide-based memristors, spintronic memories, threshold switches, and transistors. A key aspect of neuromorphic engineering is the understanding of how an individual neuron works. And based on that, the morphology of a

176

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

single circuit, application, and overall architecture can be decided. Which further decides the computation strategies of the system (Mead.,1990; Maan et al., 2016). Neuromorphic engineering is an interdisciplinary subject that takes inspiration from biology, physics, mathematics, computer science, and electronic engineering to design artificial neural systems, such as vision systems, head-eye systems, auditory processors, and autonomous robots, whose physical architecture and design principles are based on those of biological nervous systems (Boddhu et al., 2012) It was developed by Carver Mead in the late 1980s.

Neurological Inspiration The inspiration is completely based on the working and organization of the biological brain. And such machines are completely focused on replicating the analog nature of biological computation and the role of neurons in cognition. In a biological brain, all the functions are controlled by the analog chemical signals running through the neurons. Such complexity is hard to mimic as all modern computers are completely digital. However, the characteristics of these parts can be abstracted into mathematical functions that closely capture the essence of the neuron’s operations (Furber, 2016)

Neuromorphic Prototypes The first work ever done on regarding neuromorphic engineering was in 2006 by researchers at Georgia Tech. They published a field-programmable neural array (Farquhar et al., 2006). They developed a chip having increasingly complex arrays of floating gate transistors that allowed the programmability of charge on the gates of MOSFETs to model the channel-ion characteristics of neurons in the brain. In November 2011, researchers at MIT created a chip that mimics the ionchannel analog communication between the synapse of neurons using 400 transistors and CMOS manufacturing techniques (Poon et al., 2011). In June 2012, researchers at Purdue University published a design for a neuromorphic chip that uses lateral spin valves and memristors. They stated that the architecture of the chip is like the brain and, therefore, can be used to

Bio-Inspired Computing

177

perform tests by reproducing the brain’s processing. They also stated the power efficiency of these chips that they will be less power hungry than the more powerful conventional chips (Sharad et al., 2012). The European Union funded a series of projects at the University of Heidelberg, which led to the development of BrainScaleS. BrainScaleS utilizes above-threshold analog circuits to implement physical models of neuronal processes. The circuits used here run at 10,000 times biological speeds. BrainScaleS also uses wafer-scale integration to accommodate the increased speed of the interconnected analog circuits. The Blue Brain Project, led by Henry Markram, aims to build biologically detailed digital reconstructions and simulations of the mouse brain. The Blue Brain Project has created in silico models of rodent brains while attempting to replicate as many details about its biology as possible. The supercomputerbased simulations offer new perspectives on understanding the structure and functions of the brain.

Neuromorphic Sensors The event camera, also known as a neuromorphic camera, silicon retina, or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur and staying silent otherwise.

Conclusion CPU chips are getting more and more power efficient without compromising the ability to handle multiple tasks at the same time. But these modern computing chips can perform some specific tasks and some specific computing only. These modern chips do not have the ability to develop their own ways of solving problems. Neuromorphic chips hold the promise of future computing with more power efficiency and with a great tendency to solve complex problems fast. Furthermore, advancing the biologically inspired algorithms for computing needs is getting more better day by day. The wide range of genetic algorithms, neural networks, and biological activities are grabbing the

178

Raghav Mishra, Tanishq Mandloi and Anjali Priyadarshini

attention to look after and derive applications. Brain-inspired chips are getting more complex in terms of processing. Either switching from AI, GA, the immune system, or the human life itself, this wide scope of concepts are totally going to evolve the lifestyle of people.

References Boddhu, S. K., Gallagher, J., C., (2012). Qualitative Functional Decomposition Analysis of Evolved Neuromorphic Flight Controllers, Applied Computational Intelligence and Soft Computing, 1–21. doi:10.1155/2012/705483. Chu, S. C., P. W. Tsai., (2007). Computational intelligence based on the behavior of cats, Int J Innov Comput Inf Contr, 3, 163-173. Deb, Suash, Simon, Fong., Zhonghuan, Tian., Raymond, K., Wong, Sabah., Mohammed, Jinan Fiaidhi., (2016). Finding approximate solutions to NP-hard optimization and TSP problems using elephant search algorithm. J Supercomput. 10.1007/s11227-0161739-2. Farquhar, Ethan., Hasler, Paul., (2006). A field programmable neural array. IEEE International Symposium on Circuits and Systems, 4114–4117. Freund, Rudolf; Kari, Lila; Oswald, Marion; Sosík, Petr., (2005). Computationally universal P systems without priorities: two catalysts are sufficient. Theoretical Computer Science. 330(2), 251– 266.doi:10.1016/j.tcs.2004.06.029. ISSN 0304-3975. Furber, Steve., (2016). “Large-scale neuromorphic computing systems.” Journal of Neural Engineering. 13 (5),1–15. Ghalehtaki, Razieh Abbasi H., Khotanlou., M., Esmaeilpour., (2016). “Fuzzy evolutionary cellular learning automata model for text summarization.” Swarm Evol. Comput. 30 (2016): 11-26. He Q., Hu X., Ren H., Zhang H. A novel artificial fish swarm algorithm for solving largescale reliability-redundancy application problem. ISA Trans. (2015) Nov;59:105-13. doi: 10.1016/j.isatra.2015.09.015. Epub 2015 Oct 23. PMID: 26474934. Li, X. L., F, Lu,. G., H. Tian., J. X. Qian., (2004). Applications of artificial fish school algorithm in combinatorial optimization problems, Journal of Jiangdong University, 34(5), 64-67. Maan, A. K., Jayadevi, D, A., James, A. P., (2016). A Survey of Memristive Threshold Logic Circuits. IEEE Transactions on Neural Networks and Learning Systems, 99, 1734–1746. Magalhães-Mendes, Jorge, (2013). "A Comparative Study of Crossover Operators for Genetic Algorithms to Solve the Job Shop Scheduling Problem," WSEAS Transactions on Computers, vol. 12, pp. Mead, Carver., (1990). "Neuromorphic electronic systems" (PDF). Proceedings of the IEEE. 78 (10): 1629–1636. doi:10.1109/5.58356. Meng, Xian-Bing, X. Z. Gao, Lihua Lu, Yu Liu & Hengzhen Zhang (2016). A new bioinspired optimisation algorithm: Bird Swarm Algorithm. Journal of Experimental &

Bio-Inspired Computing

179

Theoretical Artificial Intelligence. 28:4, 673-687, DOI: 10.1080/0952813X.2015. 1042530. Mirjalili, Seyedali, Andrew, Lewis., (2016). The whale optimization algorithm. Adv Eng Software. 95, 51-67. Nseef, S. K., Abdullah, S., Turky, A., & Kendall, G. (2016). An adaptive multi-population artificial bee colony algorithm for dynamic optimisation problems. Knowledge-Based Systems, 104, https://doi.org/10.1016/j.knosys.2016.04.005. Ozturk, Celal, Emrah, Hancer, Dervis, Karaboga. (2015). A novel binary artificial bee colony algorithm based on genetic operators. Inf Sci. 297154-170. Păun, Gheorghe. (2001). P systems with active membranes: attacking NP-complete problems. Automata, Languages and Combinatorics. 6 (1): 75–90. Păun, Gheorghe. (1998). Computing with Membranes, TUCS Report 208. Turku Centre for Computer Science. ISBN 978-952-12-0303-9. Păun, Gheorghe. (2006). Introduction to Membrane Computing, Applications of Membrane Computing. Springer Berlin Heidelberg. pp. 1–42. ISBN 978-3-540-29937-0. Păun, Gheorghe; Grzegorz Rozenberg., (2002). A guide to membrane computing. Theoretical Computer Science. 287 (1),73–100. Phelps, S., McBurney, P. & Parsons, S. Evolutionary mechanism design: a review. Auton Agent Multi- Agent Syst. 21, 237–264 (2010). https://doi.org/10.1007/s10458-0099108-7. Poon C. S., Zhou K., (2011). Neuromorphic silicon neurons and large-scale neural networks: challenges and opportunities. Front Neurosci. 22;5:108. doi: 10.3389/fnins.2011.00108. PMID: 21991244; PMCID: PMC3181466. Pradhan, Pyari Mohan, Ganapati Panda, Solving multiobjective problems using cat swarm optimization, Expert Systems with Applications, Volume 39, Issue 3, 2012,Pages 2956-2964, ISSN 0957- 4174, https://doi.org/10.1016/j.eswa.2011.08.157. Sharad, M., Augustine, C., Panagopoulos, G. and Roy, K., 2012. Proposal for neuromorphic hardware using spin devices. arXiv preprint arXiv:1206.3227. Uymaz, Sait Ali, Gulay Tezel., Esra Yel., (2015). Artificial algae algorithm (AAA) for nonlinear global optimization, Applied Soft Computing, Volume 31, 153-171, ISSN 1568-4946, https://doi.org/10.1016/j.asoc.2015.03.003. Wang, Gai-Ge, Suash, Deb., Leandro Coelho., (2015). Elephant herding optimization 3rd International Symposium on computational and business intelligence. IEEE. 10.1109/ISCBI.2015.8.

Chapter 6

Feature Selection and Classification of Microarray Cancer Dataset: Review and Challenges Santosini Bhutia* and Bichitrananda Patra Department of Computer Science and Engineering, Siksha O Anusandhan (Deemed to be) University, Bhubaneswar, India

Abstract Cancer is becoming a serious public health problem due to its increasing prevalence and fatality rate around the world. To diagnose such critical diseases, microarray technology has become a trend. It is necessary to find a fast and accurate method for cancer diagnosis and drug discovery that helps in eradicating the disease from the body. The raw microarray gene expression data contains an enormous number of features with a small sample size, making the classification of the dataset into an accurate class a challenging task. These microarray genes also contain a noisy, irrelevant, and redundant gene that results in poor diagnosis and classification. Hence, researchers employed various Machine Learning algorithms to retrieve the most relevant features from the gene expression data to achieve the objective. Thus, this chapter gives a comprehensive study of microarray gene expression data with feature selection and classification algorithms, and finally, future challenges are discussed.

Keywords: microarray data, feature selection, classification



Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

182

Santosini Bhutia and Bichitrananda Patra

Introduction The genetic information of the organs is represented by deoxyribonucleic acid (DNA), which is contained in every cell. The coding segment of this DNA is known as a gene. DNA microarray technology gives an overall view of the cell so that it can be easily distinguished between a normal cell and a cancerous cell (Patra, 2011). The way cancer is diagnosed is changing now that molecular biomarkers are being used as part of a routine diagnostic panel. (Lopez-Rincon, Alejandro, et al., 2018). Cancer is the leading cause of mortality, according to the World Health Organization (WHO) (Piscaglia and Ogasawara, 2018). Men are most commonly diagnosed with lung, prostate, colorectal, stomach, and liver cancers, while women are most commonly diagnosed with breast, colorectal, lung, cervical, and thyroid cancers (Bray, Freddie, et al., 2018). In 2012, around 14 million new cases of cancer were recorded, and that number is expected that in the next two decades, it is likely to rise by around 70% (Hambali, Oladele and Adewole, 2020). because cancer is a deadly disease that almost always leads to death. But the cancer mortality rate can be reduced by early diagnosis and prognosis. There are some traditional methods used for the diagnosis of cancer. But these traditional approaches are time-consuming, not cost-effective, and sometimes give erroneous outcomes. Gene expression profiles with microarray datasets have recently been used to investigate features and predict the most accurate result (Qiu, Wang and Liu, 2005). The representation of gene expression data is done by this microarray technology, which helps in the diagnosis of cancer. Microarray technology is a useful tool for the diagnosis of a disease like cancer. It can experiment with ten thousand DNA microarray data points simultaneously, providing the highest throughput. The raw data are first collected and thereafter pre-processed to eliminate redundant and noisy data. Finally, data mining algorithms are used to extract biological information from the data (Agapito, Guzzi, and Cannataro, 2017). These microarray data consist of hundreds of thousands of genes in comparison to the number of samples, which is the problem of the “curse of dimensionality.” However, all these genes and features are not correlated in the classification of cancer. Among these features, some are highly correlated with improved classification accuracy. Hence, efficient feature selection approaches are used to identify highly correlated features before the classification of cancer (Dash, Patra, and Tripathy, 2012). Due to a small number of samples and a large number of features for each sample, feature selection becomes incredibly challenging (Dash and Patra,

Feature Selection and Classification of Microarray Cancer Dataset

183

2014). The majority of the genes are ineffective at distinguishing between class labels and thus have no practical application (Golub, Todd R., et al., 1999). Feature selection is the process of obtaining a subset of relevant features from a larger feature set according to some feature selection criteria. It aids in data processing scale compression by removing redundant and irrelevant genes. The better feature selection results in improved learning accuracy and reduced learning time (Zhao, Zheng, et al., 2010). Feature selection methods are often divided into four types (Almugren and Alshamlan, 2019). Firstly, the filter technique, includes evaluating each characteristic independently by utilising its statistical features. Secondly, the wrapper technique selects the best feature subset using machine learning methods. The accuracy of the specified classifier is used to assess the wrapper technique’s quality. Thirdly, an embedded technology that is built into the classifier searches for the best feature subset, the search space is united in the hypothesis space. Lastly, a hybrid technique takes advantage of both the filter and wrapper techniques. As a result, it combines the filter technique’s computational productivity with the wrapper technique’s higher performance. The purpose of the feature selection technique is to discover the smallest and most informative collection of characteristics with the least amount of time complexity, which will result in greater classification accuracy than the whole set of features (Guyon and Elisseeff, 2006). Several studies have predicted the region’s discriminative prognosis. Cancer disease prognosis and prediction rely heavily on potential biomarkers. In comparison to pathologists, machine learning (ML) offers substantial benefits. Machine learning is an intelligent and automatic learning technique that enables machines to learn without being explicitly programmed. ML techniques have been accepted as effective in the analysis of gene expression data and are widely used in the solution of many complex real-world problems. ML increases the efficiency of predicting cancer susceptibility, fatality, and recurrence by about 15:25 (Kourou, Exarchos, Karamouzis and Fotiadis, 2015). The best future cancer diagnosis technique is machine learning, which reduces manual labor. In cancer, machine learning (ML) algorithms are being used to entrust doctors with better prevention, diagnosis, treatment, and care. ML algorithms have consistently surpassed traditional statistical approaches for more accurate cancer disease prediction and cancer disease prognosis modeling. ML-based algorithms have the advantage of automating the hypothesis formation and evaluation processes, as well as providing parameter weights to predictors based on their correlation with the outcome prediction.

184

Santosini Bhutia and Bichitrananda Patra

This chapter focuses on machine learning-based cancer research and medical oncology applications. We discuss significant papers over the last five years (2017-2021) on the advancement of robust machine-learning models for patient diagnosis, classification, and prognosis. The material was chosen based on feature selection and classification of a microarray cancer dataset using ML techniques. The objective of the research was to: • • • • • •

To analyze the concept of microarray technology in cancer classification To analyze the concept of feature selection in Machine Learning To discuss various Classifiers in Machine Learning To analyze some existing methodologies of cancer classification using Machine Learning To discuss various evaluation measures used for performance calculation Different cancer datasets (both binary and multi-class) with varying dimensions are considered for classification

In the following sections, we will introduce the background study, including microarray technology, feature selection, classifiers, and datasets in Section 2, and then focus on some existing proposed work related to cancer classification using feature selection and classification techniques in Section 3. Section 4 and Section 5 discuss various evaluation measures and analyses based on reduced features, datasets, and classifiers respectively. Section 6 concludes the paper by outlining some potential future directions.

Microarray Technology Microarray technology is commonly used to diagnose cancer, and it is an effective tool for this purpose. A vast number of gene expression microarray data has been produced using DNA microarray technology. These microarray data are used to analyze genes and aid in the diagnosis of diseases such as cancer (Li, Xie and Liu, 2018). However, the dataset’s features of small sample size, high dimension, and imbalance pose a challenge. Hence, to resolve this problem, a feature selection task is utilized to identify strongly correlated features that can be used to diagnose a certain condition (Patra and Bisoyi, 2018).

Feature Selection and Classification of Microarray Cancer Dataset

185

The classification of diseases is a challenge in itself. The task of feature selection is crucial in classification, and the quality of feature selection determines the classification’s performance. The classification model developed in the training dataset with its small sample size and highdimensional features does not provide a better accuracy model. As a result, feature selection is essential for microarray data dimension reduction (Elkhani and Muniyandi, 2016). Initially, data cleaning is performed during the preprocessing stage, followed by feature selection techniques to obtain only informative ones, and then the dataset is split into two sets: the training dataset and the testing dataset. The learning model is then trained to diagnose cancer subtypes using the training subset, as shown in Figure 1.

Figure 1. Block diagram of microarray technology.

Feature Selection The management of a high-dimensional dataset is a challenging task (Dash and Patra, 2020) (Salem, Attiya and El-Fishawy, 2017). The data in the dataset is noisy and redundant, which reduces the performance of the task.

186

Santosini Bhutia and Bichitrananda Patra

Table 1. Different types of feature selection techniques Method Filter

Wrapper

Embedded

Hybrid

Feature Variance Chi-square Correlation Information value Mutual information filter Genetic algorithm Forward selection Backward elimination Exhaustive feature selection Lasso (L1) Random forest importance Gradient boosted trees importance Recursive feature elimination Recursive feature addition

Advantages Independent of specific algorithm Fast Simple Less error compared to other methods Selects a nearest best subset Interaction with the classification model Less computationally intensive Takes the advantage of various methods

Disadvantages Performance is poor No interaction with the classification model

Overfitting Dependent on a specific algorithm on which it has been tested Overfitting

Time complexity may increase

But this issue can be resolved with the help of data mining and machine learning techniques. In the last two decades, feature selection has been considered of significant importance in classification by many researchers (Patra, Bhutia and Panda, 2020). Feature selection is to find a group of features from the original feature set without changing the original meaning of the feature, and feature selection is mainly based on relevance and redundancy. Relevance selects a subset of relevant features from among those that are strongly relevant, weakly relevant, and irrelevant. Redundancy determines and eliminates redundant features from the subset of relevant features by producing a final subset of features (Sahu, Dehuri and Jagadev, 2018) (Aziz, Verma and Srivastava, 2016). Feature selection is more important than the classifier used, though the classifiers are the main component in microarray data analysis. Selecting important features has many advantages, such as reducing memory usage, maximizing accuracy, reducing computational cost, and eliminating overfitting problems. For different models different features are effective. For generating the best model, data processing and the feature section are the two important steps in a task (Bączkiewicz, Wątróbski, Sałabun and Kołodziejczyk, 2021). Feature selection is a technique used to select the best subset of features that aids in enhancing the prediction accuracy in

Feature Selection and Classification of Microarray Cancer Dataset

187

classification (Tai, Shao-Kuo, et al., 2020). Different types of feature selection techniques are: filter, wrapper, embedded, and hybrid, and their features are presented in Table 1. Filter methods prioritize data characteristics based on specific criteria. They do not affect any learning algorithm. As a result, filter approaches are computationally faster than wrapper methods, but because the learning procedure is ignored, they generally result in lower classification accuracy. Hybrid techniques aim to combine the best of both worlds.

Methods Filter In the filter method, the performance of each feature is measured by evaluation functions such as distance, information, and dependency and is independent of the classification method used. Due to its independence from the classification model, the computational cost is lower. The filter method is simple and efficient, and it is accomplished before classification. All of the features are rated according to the criteria, and the top-ranking features are chosen to form a subset of features that are fed into the classification algorithm (Patra, Jena, Bhutia and Nayak, 2021) (Rani and Ramyachitra, 2018).

Figure 2. Filter method.

Wrapper In the wrapper method, all the possible combinations of features are considered with a classifier as the evaluation function (Jain, I., Jain, V. K. and Jain, R., 2018) (Begum, Sarkar, Chakraborty, Sen and Maulik, 2021). It is a feedback method that finally generates a subset of optimal features. It employs a greedy search strategy, weighing all possible feature combinations against the evaluation criterion. The evaluation criterion is simply a performance measure that varies depending on the type of problem, such as the evaluation criterion for regression can be p-values, R-squared, Adjusted R-squared whereas the evaluation criterion for classification can be accuracy, precision, recall, F1-score, and so on. Finally, it chooses the feature combination that produces the best result for the specified machine learning algorithm.

188

Santosini Bhutia and Bichitrananda Patra

Figure 3. Wrapper method.

Embedded In embedded methods, filter and wrapper methods are combined. Algorithms with built-in feature selection methods are used to implement it. LASSO and RIDGE regression are two popular examples of these approaches, both of which have built-in penalization factors to reduce overfitting.

Figure 4. Embedded method.

Hybrid When the number of features is large, the wrapper method’s computational cost makes it impractical and the filter method’s performance is less than satisfactory. Hence, the hybrid feature selection method is employed, which takes advantage of both the filter and wrapper methods. In hybrid approaches, a filter is used to obtain a ranked list of features. Thereafter, a learning machine is used by the wrapper method to produce and compute a nested subset of features based on the order defined thus far.

Classification Supervised machine learning and unsupervised machine learning are the two main types of machine learning. In supervised learning, the learner has some prior knowledge of the material. Examples of the algorithms that have been

Feature Selection and Classification of Microarray Cancer Dataset

189

successfully used to classify various cells are Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Bayesian Networks (BN), Multi-Layer Perceptron’s (MLP), Random Forests and Decision Trees (DT). In unsupervised learning, the learner has no prior knowledge about the input data or the result. Some examples of unsupervised approaches are Clustering, SelfOrganizing Maps (SOM), and others that were initially employed to investigate the correlations between various genes. There are two primary groups of supervised techniques: classification and regression. The output variable in classification takes class labels, whereas the output variable in regression accepts continuous values. The use of microarray data to classify cancer types has become increasingly widespread. Due to their learning ability to build classifiers that can clarify complex correlations within the data, ML approaches are well suited for microarray gene expression databases (Tan and Gilbert, 2003). In ML, there are six well-known classifiers, which are listed below. • • • • • •

Logistic Regression Naïve Bayes K-Nearest Neighbor (KNN) Support Vector Machine Random Forest Decision Trees

Logistic Regression It is a regression model that may be used to predict the likelihood of a given set of data. Its operation is based on a well-defined model called the logistic function, which is also known as the sigmoid function. The probabilities define the possible outcomes of a single trial in this model, which are represented by the sigmoid function. The main benefit of this classification model is that it can be easily implemented on independent variables, but finding the independent variable in high-dimensional data is difficult.

Naïve Bayes The Naïve Bayes (NB) classifier is based on the Bayes conditional probability theorem. Probability refers to the degree of belief in this context. The

190

Santosini Bhutia and Bichitrananda Patra

conditional probability is used to categorize the data. The most essential quality of this algorithm is that it is based on the assumption that all features are independent. Gaussian NB, Multinominal NB, and Bernoulli NB are the three types of NB-based algorithms. The benefit of this algorithm is that it only requires a small amount of training data to estimate the conditional parameters, but the estimation time is proportional to the size of the training data. If the dimension of the dataset is very large, the NB will be a poor estimator because estimation time and cost increase with the size of the dataset.

K-Nearest Neighbor (KNN) Neighbor-based order is a type of languid learning in that it does not build an overall inside model, but rather stores examples of the preparation data. A larger part vote of each point’s K closest neighbors is used for grouping. The main advantages of this classification algorithm are that it is simple to implement and that it is resistant to noise in the training data. The model’s drawback is determining the K value as an incorrect K value can lead to poor performance.

Support Vector Machine It is a supervised machine algorithm that builds a hyperplane in a multidimensional space. The goal of SVM is to find the best hyperplane so that the two classes can be separated. This optimal hyperplane separates the two classes while also maximizing the margin between them. The margin is the distance between the hyperplane and the support vectors. It is very popular because it can be very effective even in high-dimensional space, but the main flaw in this approach is that it does not provide probabilistic estimations. High accuracy can be achieved by tuning hyperparameters such as gamma, coat, and kernel level, but in reality, defining the exact hyperparameters increases the computational cost and overhead.

Feature Selection and Classification of Microarray Cancer Dataset .

191

Random Forest Random Forest is one of the most well-known ensemble techniques that makes use of multiple decision trees. The diversity is achieved by randomizing the node spilled selection criteria. For the sake of simplicity, any random feature is chosen rather than the best feature in this case. It is a bagging method in which deep trees are combined to produce a low-variance output. When it receives an input x, where x is a vector containing various features, the RF the n-time Decision Tree and then takes advantage of the results to determine the actual prediction.

Decision Trees The decision tree generates a rule based on classification with a given set of attributes. The decision tree procedure, in general, starts with a root and splits data attributes based on the information gain score. The root node, the test node, and the decision node will be the three sorts of nodes in the decision tree. A decision tree is constructed by using a classifier, such as ID3, CART, CHAID, and others, which automatically defines a tree. However, the classification algorithm’s main goal is to create an efficient decision tree for the given dataset. It employs a variety of techniques to determine if a single node can be divided into two or more sub-nodes. The main benefit of the Decision Tree is that it requires very little data pre-processing before implementation and can handle both numerical and categorical data. However, the fundamental problem with this classification system is overfitting.

Dataset The classification is done by using gene expression microarray datasets, which are available online and can be downloaded from different sources. These datasets are either binary or multiclass. The problem in the real-world dataset is multiclass imbalance. The number of samples in the dataset is greater for one or more classes than others, resulting in poor classification. Table 2 describes some commonly used datasets that include both binary and multiclass datasets.

192

Santosini Bhutia and Bichitrananda Patra

Table 2. Description of the dataset Dataset Brain Tumour Breast Cancer Cervical Cancer Leukaemia Lung Lymphoma Pancreatic Prostate SRBCT

No of Instances 90 569 858 72 72 203 148 590 102 83

No of Features 5921 32 36 5148 11226 12601 19 14 10510 2309

No of Class 5 2 2 2 3 5 3 3 2 3

Related Work Many data mining and machine learning applications use feature selection. The fundamental objective of feature selection is to find a subset of features that minimizes classifier prediction errors. A plethora of studies on classification based on feature selection were carried out by researchers, scientists, and medical professionals, and various types of models for cancer prediction were proposed. As shown in Table 3, some of these existing publications are addressed in this section. Haq, Amin Ul, et al. (2021) proposed supervised (Relief algorithm) and unsupervised (Autoencoder, PCA algorithms) methods for feature selection, which were then used to train and evaluate a classifier support vector machine for effective and timely breast cancer diagnosis. In this study, two breast cancer datasets were used: Breast Cancer Wisconsin (WBC) and Breast Cancer Wisconsin Diagnosis (WDBC). The k-fold cross-validation method was used in the proposed methodology for model validation and optimal hyperparameter selection. The model’s performance was evaluated using the model performance evaluation metrics. The features obtained by the Relief algorithm were more closely associated with the correct identification of breast cancer than the features selected by the Autoencoder and PCA algorithms, according to the author. The proposed method outperformed the Relief algorithm in terms of accuracy, reaching 99.91 percent.

Breast Dermatology Lung Brain1 Brain2 Leukemia Breast Brain Tumor1 11 Tumors SRBCT Leukemia Colon Carcinomas Prostate Sonar Breast Arrhythmia Ovarian

Colon

2021

2019

2021

2020

2018

2018 2016

Dataset

Year

62 36

No of Instances 699 569 366 203 90 50 72 569 90 174 83 72 62 174 102 208 569 452 253

2000 7457

No of Features 11 32 34 12600 5920 12000 11225 32 5920 12533 2308 7129 2000 9182 10509 60 30 278 15154

2

No of Class 2 2 6 5 5 4 3 2 5 11 4 2 2 11 2 2 2 16 2 Correlation Coefficient T-Statistics Kruskal-Wallis IG+GA mRMR

score-based criteria fusion

PCA Relief-F

Feature Selection Techniques Relief Auto-encoder PCA CFA

GBCO AAO

Optimisations

NB

SVM LR

KNN

SVM

ANN DECORATE

SVM (Linear) SVM

Classification

Table 3. Related work based on different methods with classification accuracy

94 100

98.28 94.09 84.44 76.00 88.89 97 94.44 91.95 100 99.98 99.85 99.95 99.95 99.86 99.95 99.65 99.48 99.48

99.91

Acc

62 60 47 16772 200 1097 528 163 34 62 34 149 19 63

Colon Nervous System Lymphoma P53 Mutants Arcene Breast GBM TSP Leukemia Colon Prostate Lung Breast SRBCT

2019

2017

2021

Arrhythmia Leukemia Lymphoma Prostate Breast Lymphoma

2021

No of Instances 452 72 77 102 286 148

Dataset

Year

4026 5409 10000 21548 18348 319 7130 2000 12600 12535 24482 2309

2000 7129

No of Features 279 5147 7070 12532 9 18

2 2 2 2 2 2 2 2 2 2 2 4

2 2

2 4

No of Class

MIM AGA

Feature Selection Techniques Mutual Information Relief Chi-square Xvariance Information Gain Gain Ratio Chi-Squared Relief-F FSBRR MI

Table 3. (Continued)

WOA

Optimisations

ELM

RF

SVM

KNN

NB KNN PSO-SVM

SVM

Classification

82.99 94.31 85.67 86.26 80.95 78.63 97.62 89.09 96.54 97.80 82.47 94.66

92.01 80.17

94.50 100 100 100 96.15 96.62

Acc

Leukemia Colon CNS Lung-Ontario Lung-Michigan Lymphoma Prostate Colon Leukemia Prostate Brain Lung Lymphoma Dermatomyositis Hepatitis-C Ovarian Lung Prostate Lymphoma Leukemia ALLGSE 412

2016

2021

2018

2016

Dataset

Year

No of Instances 72 62 60 39 96 77 136 62 72 102 50 181 220 366 123 283 181 102 77 72 110

No of Features 7129 2000 7129 2880 7129 7129 12600 2000 7129 12600 12625 12533 22284 34 22278 54622 2533 12533 7070 7129 8280

No of Class 2 2 2 2 2 2 2 2 2 2 2 2 3 6 4 3 2 2 2 2 2 SU

SMO

FBFE

IG SGA

Feature Selection Techniques

NB

ICA

ASVM

SVM

SVM

Genetic Programming

Classification

PCA

Optimisations 97.06 85.48 86.67 74.4 100 94.80 100 90.09 95.12 88.12 79.21 95.42 98.99 98.79 98.86 98.65 99.89 93.86 93.78 93.56 85.89

Acc

196

Santosini Bhutia and Bichitrananda Patra

Dash, Thulasiram and Thulasiraman (2019) proposed a Hybrid Swarm Intelligence Based Meta-Search Algorithm based on Conditional Mutual Information Maximization and Firefly Algorithm. Fireflies iteratively share information, boosting the search effectiveness of the chaos-based firefly algorithm and reducing the computational complexity of feature selection. As a result, SVM was put to work as a classifier, and the model’s performance was evaluated against a variety of high-dimensional disease datasets. A computer-aided diagnostic (CAD) method for the Wisconsin breast cancer dataset (WBCD) was proposed by (Sahu, Panigrahi and Rout, 2020). Before classification, the classification rate was increased by using PCA/LDA to reduce the dimensions. The classification was performed using multiple layers of neural networks, and the results were evaluated for accuracy, sensitivity, specificity, precision, and recall. PCA-ANN has a 97% accuracy rate, which is higher than other state-of-the-art methods. Dash (2016) using a meta-learning approach called ‘decorate’ (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples) and a sampling technique, the author came up with an ensemble framework for multiclass imbalanced classification. Because it uses a sampling strategy with biases in the distribution of classes to generate a uniform distribution, the meta-learning algorithm was able to achieve significant improvements in classification accuracy over other traditional ensembles. Ke, Wenjun et al. (2018) proposed a score-based criteria fusion feature selection approach (SCF) for cancer prediction to advance the prediction performance of the classification method. This method was a combination of two feature selection methods i.e., Symmetric Uncertainty (SU) and Relief-F. In this study, five open gene microarray datasets (SRBCT, Leukaemia, Colon, Carcinomas, Prostate) and three low-dimensional datasets (Sonar, Breast, and Arrhythmia) were used. The result shows that two classifiers, SVM and KNN (k=3), to test the selected features outperform several well-known feature selection approaches, which are Relief-F, SU, FCBF, MRMR, DISR, JMIM, and NMIFS. Prabhakar and Lee (2020) proposed an integrated approach to feature selection that was carried out in two steps. Initially, the features were initially chosen using standard gene selection methods such as the Correlation Coefficient, T-Statistics, and the Kruskal-Wallis test. The selected features were further optimized by four stochastic optimization algorithms: Central Force Optimization (CFO), Lighting Attachment Procedure Optimization (LAPO), Genetic Bee Colony Optimization (GBCO), and Artificial Algae

Feature Selection and Classification of Microarray Cancer Dataset

197

Optimization (AAO). Certainly, five different classifiers were used to analyze ovarian cancer classification. The best result was predicted when the KruskalWallis test with GBCO is conducted and classified with SVM, resulting in an accuracy of 99.48%, and the Correlation Coefficient test with AAO is conducted and classified with Logistic Regression, resulting in an accuracy of 99.48%. Al-Rajab, Lu and Xu (2021) proposed a two-stage multi-filter hybrid method of feature selection for Colon cancer that was a combination of Information Gain and a Genetic Algorithm. The next step was to filter and rank the genes discovered using this method using the minimal Redundancy Maximum Relevance (mRMR) strategy. Finally, the Decision Tree, K-Nearest Neighbour, and Naive Bayes classifiers produced better results when applied to this hybrid framework model. Mandal, Singh, Ijaz, Shafi and Sarkar (2021) proposed a tri-stage wrapper-filter-based feature selection method. In the first stage, four filter methods- Mutual Information, ReliefF, ChiSquare, and Xvariance were used to create an ensemble, and then each feature from the union set was evaluated using three classification algorithms- Support Vector Machine, Nave Bayes, and K-Nearest Neighbors- and an average accuracy was calculated. To generate a subset of optimal features, the features with better accuracy were chosen. In the second stage, Pearson correlation was employed to exclude strongly associated features. In these two steps, the XGBoost classification method was employed to find the most contributing features. In the final stage, the obtained feature subset was put into a meta-heuristic method known as the whale optimization algorithm to further reduce the feature set and improve accuracy. In this study, four publicly available disease datasets: Arrhythmia, Leukaemia, DLBCL, and Prostate were used. The obtained results show that the proposed method outperforms numerous state-of-the-art algorithms. Ab Hamid, Tengku Mazlin Tengku, et al. (2021) proposed a novel ensemble filter feature selection with the harmonized classification of Particle Swarm Optimization (PSO) and Support Vector Machine (SVM) (EnsemblePSO-SVM). They suggested an ensemble of multi-filters that incorporates the inter-correlation between features, such as Information Gain (IG), Gain Ratio (GR), Chi-squared (CS), and Relief-F (RF). To optimize the search for optimal significant features and kernel parameters simultaneously without sacrificing accuracy, a hybrid classification technique using PSO and SVM was used. The proposed method was examined on the Breast cancer and Lymphography dataset, obtaining classification accuracy of 96.15% and 96.62%, respectively,

198

Santosini Bhutia and Bichitrananda Patra

which outperformed the existing methods such as PSO-SVM and classical SVM. Zhang and Cao (2019) proposed a filter feature selection technique based on redundant removal (FSBRR). First, vertical relevance (the relationship between a feature and a class attribute) and horizontal relevance (the relationship between a feature and a feature) were used as two redundant criteria. Second, an approximate redundancy feature framework based on mutual information (MI) was designed to eliminate redundant and irrelevant features and quantify redundant criteria. The proposed model is examined on the Colon, Nervous system, DLBCL, p53 Mutants, Arcene, BRCA, GBM, and TSP datasets and to evaluate the efficiency of three distinct classifiers: KNearest Neighbor, Support Vector Machine and Random Forest were used. The experimental result shows that the FSBRR algorithm can successfully eliminate the feature dimension and enhance classification accuracy. Lu, Huijuan, et al. (2017) proposed a hybrid feature selection technique that combines mutual information maximization (MIM) and adaptive genetic algorithms (AGA). The proposed MIMAGA selection approach considerably reduces the dimension of the gene expression data by eliminating redundancies. The experiment analysis was done on Leukaemia, Colon, Prostate, Lung, Breast, and SRBCT cancer datasets. Four different classifiers: backpropagation neural network (BP), support vector machine (SVM), extreme learning machine (ELM), and regularized ELM (RELM) were tried on reduced gene expression datasets and compared with some traditional feature selection algorithms. According to the experimental results, the reduced gene expression data delivers the highest classification accuracy. Salem, Attiya and El-Fishawy (2017) proposed a hybrid methodology combining Information Gain (IG) and Standard Genetic Algorithm (SGA). For feature selection, IG was used while for feature reduction Genetic Algorithm (GA). Finally, Genetic Programming was used for cancer classification on seven microarray datasets: Leukaemia, Colon, CNS, Lung-Ontario, LungMichigan, DLBCL, and Prostate. The results were compared to the other approaches and an improved classification performance was obtained. Aziz, Verma and Srivastava (2016) proposed a feature selection algorithm by integrating independent component analysis (ICA) and fuzzy backward feature elimination (FBFE). FBFE selects the DNA microarray’s independent components to increase the performance of the support vector machine (SVM) and Naive Bayes (NB) classifiers while keeping computing costs low. The experiment was done on five microarray datasets: Colon, acute Leukaemia, Prostate, Lung, and high-grade glioma. The proposed approach was compared

Feature Selection and Classification of Microarray Cancer Dataset

199

to principal component analysis (PCA), a standard algorithm and effectively improved the performance of SVM and NB classifiers in terms of accuracy. Rani and Ramyachitra (2018) reduced the number of features, the authors also proposed a new swarm intelligence technique called the Spider Monkey Optimization (SMO) algorithm. Thereafter, the dataset’s initial population was provided and the fitness calculation was evaluated using SVM classification accuracy. To continue or stop the process, the stopping criterion was checked and the best subset of features with high classification accuracy was obtained. Begum, Sarkar, Chakraborty, Sen and Maulik (2021) proposed an active learning model that used a support vector machine (SVM) in association with symmetrical uncertainty (SU) as feature selection to predict the cancer of four gene expression datasets: Prostate, DLBCL, Leukaemia, ALLGSE412. To evaluate the efficiency of the proposed model, the author employed two other feature selection approaches, namely CBAE and GRAE, and the proposed method outperformed.

Performance Evaluation Measures Different measuring parameters, such as Accuracy, Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, and F-measure are used to evaluate the performance of all classifiers. Accuracy is the ratio of the sum of true positive and true negative to the total population. Accuracy =

TP+TN TP+FP+FN+TN

× 100

(1)

Sensitivity is the ratio of the predicted true value to the actual positive value. TP

Sensitivity =

TP+FN

× 100

(2)

Specificity is the ratio of the predicted true value to the actual positive value. Specificity =

TN

TN+FP

×100

(3)

200

Santosini Bhutia and Bichitrananda Patra

where TP: True Positive means we investigate the cancerous cells and predict that they are cancerous. TN: True Negative means we investigate the cancerous cells and predict that they are normal. FP: False Positive means we investigate the cells and predict that they are cancerous even if they are non-cancerous. FN: False Negative means we investigate the cells and predict that they are non-cancerous.

Result and Analysis Based on the Following Feature Selection In this literature, several cancer datasets were taken for investigation. The literature reported in this study claimed that reducing irrelevant genes would enhance classification accuracy, as shown in Table 3. In Table 3, we reported the cancer datasets, feature selection techniques, classifiers, and the classification accuracy of the existing proposed models for better understanding. The number of reduced features after feature reduction techniques has been summarized in Table 4. A good feature selection approach should have high learning accuracy while requiring few processing resources. According to the survey, the prediction accuracy of cancer detection techniques needs further improvement for efficient and accurate detection at early stages for better treatment and recovery. The main goal is to improve classification accuracy and reduce computational time. Dataset The datasets used in this analysis were obtained from a variety of online repositories. The datasets were both binary and multi-class. Figure 5 illustrates the ten microarray datasets reviewed in this survey, along with their average classification accuracy. Except for CNS and Brain tumors, all datasets have an average classification accuracy of greater than 90%.

Feature Selection and Classification of Microarray Cancer Dataset

201

Table 4. Comparison of the original number of features with a reduced number of features Year 2021

Dataset Breast

2019

Dermatology Lung Brain1 Brain2 Leukemia Breast Brain Tumor1 11 Tumors SRBCT Leukaemia Colon Carcinomas Prostate Ovarian Colon

2018 2016 2018

2020 2021 2021

2021 2019

2017

2016

Arrhythmia Leukaemia Lymphoma Prostate Breast Lymphoma Colon Nervous System Lymphoma P53 Mutants Arcene Breast GBM TSP Leukemia Colon Prostate Lung Breast SRBCT

No of Features 11 32 34 12600 5920 12000 11225 32 5920 12533 2308 7129 2000 9182 10509 15154 2000 7457 279 5147 7070 12532 9 18 2000 7129 4026 5409 10000 21548 18348 319 7130 2000 12600 12535 24482 2309

No of Selected Features 9 6 10 12 8 7 7 30 30 20 25 15 1-200 1-200 50-150 22 35 3 4 4 3 5 8 34 37 29 39 51 148 61 108 7 19 3 3 6 28

Leukemia Colon CNS Lung-Ontario Lung-Michigan Lymphoma Prostate

7129 2000 7129 2880 7129 7129 12600

3 60 38 11 9 110 26

202

Santosini Bhutia and Bichitrananda Patra

Table 4. (Continued) Year

Dataset

2016

Colon Leukemia Prostate Brain Lung Lymphoma Dermatomyositis Hepatitis-C Ovarian Lung Prostate Lymphoma Leukaemia ALLGSE412

2018

Average Accuracy

2021

No of Features 2000 7129 12600 12625 12533 22284 34 22278 54622 2533 12533 7070 7129 8280

No of Selected Features 30 35 50 25 80 5 4 5 4 5 41 29 26 45

105 100 95 90 85 80 75

Datasets

Figure 5. Analysis based on datasets.

Classifier Several existing studies have been reviewed in this study, with SVM accounting for 45 percent of the total. However, adopting the SVM is a difficult undertaking because performance can only be improved by finetuning hyperparameters like cost, gamma, and kernel. With high-dimensional data, tuning these takes a long time. As a result, the model’s temporal complexity will skyrocket. NB and KNN have completed 14 percent of the considered research effort, whereas the RF, LR and GP have completed

Feature Selection and Classification of Microarray Cancer Dataset

203

roughly 5 percent. ANN, DECORATE and ELM account for just 4 percent of the overall work examined. RF is well-known for its ensemble approach among all ML approaches and using RF as a classifier will undoubtedly improve the performance level. As a result, in a future study, RF may be used to measure performance using various feature selection and metaheuristic optimization techniques.

Conclusion A complex disease like cancer is the greatest threat to human life. The development of microarray technology provides better accuracy in the diagnosis of cancer. In microarray data analysis, feature selection and classification are the two important methods. Feature selection is an efficient technique to handle the problem of high-dimensional data by deleting irrelevant and redundant data, which can save computational time, increase learning accuracy, and make learning models or data easier to comprehend. In this study, we have discussed some of the existing methods for feature selection and further classification techniques are applied to evaluate the classification accuracy. Random Forest is well-known among all ML techniques, and using it as a classifier will undoubtedly improve performance. As a result, in future research, Random Forest could be used to quantify performance using various feature selection and metaheuristic optimization strategies.

References Ab Hamid TM, Sallehuddin R, Yunos ZM, Ali A. Ensemble based filter feature selection with harmonize particle swarm optimization and support vector machine for optimal cancer classification. Machine Learning with Applications. 2021 Sep 15;5:100054. Almugren N, Alshamlan H. A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE access. 2019 Jun 14;7:78533-48. Al-Rajab M, Lu J, Xu Q. A framework model using multifilter feature selection to enhance colon cancer classification. Plos one. 2021 Apr 16;16(4):e0249094. Aziz R, Verma C, Srivastava N. A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data. Genomics data. 2016 Jun 1;8:4-15.

204

Santosini Bhutia and Bichitrananda Patra

Bączkiewicz A, Wątróbski J, Sałabun W, Kołodziejczyk J. An ANN Model Trained on Regional Data in the Prediction of Particular Weather Conditions. Applied Sciences. 2021 Jan;11(11):4757. Begum S, Sarkar R, Chakraborty D, Sen S, Maulik U. Application of active learning in DNA microarray data for cancerous gene identification. Expert Systems with Applications. 2021 Sep 1;177:114914. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians. 2018 Nov;68(6):394-424. Dash S, Patra B, Tripathy BK. A Hybrid Data Mining Technique for Improving the Classification Accuracy of Microarray Data Set. International Journal of Information Engineering & Electronic Business. 2012 Apr 1;4(1). Dash S, Patra B. Feature selection algorithms for classification and clustering in bioinformatics. In Global Trends in Intelligent Computing Research and Development 2014 (pp. 111-130). IGI Global. Dash S, Patra B. Genetic diagnosis of cancer by evolutionary fuzzy-rough based neuralnetwork ensemble. In Data Analytics in Medicine: Concepts, Methodologies, Tools, and Applications 2020 (pp. 645-662). IGI Global. Dash S, Thulasiram R, Thulasiraman P. Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR). 2019 Apr 1;10(2):1-20. Dash S. A diverse meta learning ensemble technique to handle imbalanced microarray dataset. In Advances in Nature and Biologically Inspired Computing 2016 (pp. 1-13). Springer, Cham. Elkhani N, Muniyandi RC. Review of the effect of feature selection for microarray data on the classification accuracy for cancer data sets. International Journal of Soft Computing. 2016;11(5):334-42. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999 Oct 15;286(5439):531-7. Guyon I, Elisseeff A. An introduction to feature extraction. In Feature extraction 2006 (pp. 1-25). Springer, Berlin, Heidelberg. Guzzi PH, Cannataro M. Challenges in microarray data management and analysis. In 2011 24th International Symposium on Computer-Based Medical Systems (CBMS) 2011 Jun 27 (pp. 1-6). IEEE. Hambali MA, Oladele TO, Adewole KS. Microarray cancer feature selection: review, challenges and research directions. International Journal of Cognitive Computing in Engineering. 2020 Jun 1;1:78-97. Haq AU, Li JP, Saboor A, Khan J, Wali S, Ahmad S, Ali A, Khan GA, Zhou W. Detection of breast cancer through clinical data using supervised and unsupervised feature selection techniques. IEEE Access. 2021 Feb 1;9:22090-105. Jain I, Jain VK, Jain R. Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Applied Soft Computing. 2018 Jan 1;62:203-15.

Feature Selection and Classification of Microarray Cancer Dataset

205

Ke W, Wu C, Wu Y, Xiong NN. A new filter feature selection based on criteria fusion for gene microarray data. IEEE Access. 2018 Oct 5;6:61065-76. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal. 2015 Jan 1;13:8-17. Li Z, Xie W, Liu T. Efficient feature selection and classification for microarray data. PloS one. 2018 Aug 20;13(8):e0202167. Lopez-Rincon A, Tonda A, Elati M, Schwander O, Piwowarski B, Gallinari P. Evolutionary optimization of convolutional neural networks for cancer miRNA biomarkers classification. Applied Soft Computing. 2018 Apr 1;65:91-100. Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing. 2017 Sep 20;256:56-62. Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A tri-stage wrapper-filter feature selection framework for disease classification. Sensors. 2021 Aug 18;21(16):5571. Patra B, Bhutia S, Panda N. Machine learning techniques for cancer risk prediction. Test Eng. Manage. 2020 May;83:7414-20. Patra B, Bisoyi SS. CFSES optimization feature selection with neural network classification for microarray data analysis. In 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA) 2018 Sep 21 (pp. 45-50). IEEE. Patra B, Jena L, Bhutia S, Nayak S. Evolutionary hybrid feature selection for cancer diagnosis. In Intelligent and Cloud Computing 2021 (pp. 279-287). Springer, Singapore. Patra B. Reliability Analysis of Classification of Gene Expression Data Using Efficient Gene Selection Techniques. International Journal of Computer Science Engineering & Technology. 2011 Dec 1;1(11). Piscaglia F, Ogasawara S. Patient selection for transarterial chemoembolization in hepatocellular carcinoma: importance of benefit/risk assessment. Liver Cancer. 2018;7(1):104-19. Prabhakar SK, Lee SW. An integrated approach for ovarian cancer classification with the application of stochastic optimization. IEEE access. 2020 Jul 1;8:127866-82. Qiu P, Wang ZJ, Liu KR. Ensemble dependence model for classification and prediction of cancer and normal gene expression data. Bioinformatics. 2005 Jul 15;21(14):3114-21. Rani RR, Ramyachitra D. Microarray cancer gene feature selection using spider monkey optimization algorithm and cancer classification using SVM. Procedia computer science. 2018 Jan 1;143:108-16. Sahu B, Dehuri S, Jagadev A. A study on the relevance of feature selection methods in microarray data. The Open Bioinformatics Journal. 2018 Jul 31;11(1). Sahu B, Panigrahi A, Rout SK. DCNN-SVM: A new approach for lung cancer detection. In Recent Advances in Computer Based Systems, Processes and Applications 2020 Jun 14 (pp. 97-105). CRC Press. Salem H, Attiya G, El-Fishawy N. Classification of human cancer diseases by gene expression profiles. Applied Soft Computing. 2017 Jan 1;50:124-34. Tai SK, Dewi C, Chen RC, Liu YT, Jiang X, Yu H. Deep learning for traffic sign recognition based on spatial pyramid pooling with scale analysis. Applied Sciences. 2020 Oct 7;10(19):6997.

206

Santosini Bhutia and Bichitrananda Patra

Tan AC, Gilbert D. Ensemble machine learning on gene expression data for cancer classification. 2003 Zhang B, Cao P. Classification of high dimensional biomedical data based on feature selection using redundant removal. PloS one. 2019 Apr 9;14(4):e0214406. Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H. Advancing feature selection research. ASU feature selection repository. 2010:1-28.

Part II: Application of Bioinformatics Tools and Databases

Chapter 7

Machine Learning Methods in Bioinformatics Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi4 and Lawrence Achilles Nnyanzi4 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract Machine learning (ML), a form of artificial intelligence is a computing program that enables the computer to use learning algorithms such as supervised learning (SL), semi-supervised learning (SSL), unsupervised learning (USL), optimization and reinforcement learning, artificial neural networks (ANN), best first tree (BFT) among others to self-study data and improve them. This learning mechanism has found use in many sectors, including virtual personal assistants, speech recognition, selfdriving cars/trains, biomedicine, and genomic research. In genomic analysis, ML has been effective in analyzing and interpreting data sets *

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

210 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. with its improvement applied in biomedicine for effective drug prediction and discovery, disease prediction and diagnosis, drug repositioning, and cancer research, among others.

Keywords: machine learning, artificial intelligence, computational, algorithms, bioinformatics, drug design

biomedicine,

Introduction The application of machine learning has existed for many decades, with its first expression in 1959 by Arthur Lee Samuel (Naresh et al., 2020). The current application of this type of artificial intelligence system in bioinformatics has further improved the accuracy of research and development in several fields ranging from biomedical applications such as neuroimaging, disease prediction, and diagnosis, climatic prediction, control of air pollution, molecular phenotypes, more precise protein prediction, precision agriculture through precision farming and even precision or personalized medicine (Serra et al., 2018). Serra et al., (2018) reported the application of machine learning (ML) in bioinformatics for studying how existing drugs and their substrates or compounds with known side effects have can be used for the treatment of certain or new diseases. This is called the repositioning of drugs. During this study, drugs or compounds already on the market have been used after studying for the treatment of diseases other than those they were initially designed for (Sleigh & Barton, 2010). This concept is based on the properties of drugs. Such properties explores their ability to bind to multiple or different targets, a concept called polypharmacology (Boran and Iyengar, 2010; Reddy and Zhang, 2013). An example of such drug is in the use of Aspirin designed initially to reduce pain, generally called an analgesic, which is now suitable for stroke management, cancer treatment as successful trials and observational studies have been reported by various authors (Amory and Amory, 2007; Knox et al., 2011; Baron, 2012); treatment of athero-thrombosis (Patrono et al., 2015), myocardial infarction (Dalen, 2009), and heart attack (Reddy and Zhang, 2013). According to Napolitano et al., (2013), he reported that a good drug repositioning drug must be able to integrate three different omics views, which comprise the drug’s targets, chemical structure, and genome-wide gene expression measures.

Machine Learning Methods in Bioinformatics

211

Recent advances in machine learning and bioinformatics in biomedicine for better healthcare have revolutionized the field. Today, the prediction of unknown or unfamiliar new diseases (Naresh et al., 2020, Adetunji et al., 2022a-l; Olaniyan et al., 2022a, b; Oyedara et al., 2022), better diagnosis of diseases associated with the gene of the host organism (Barman et al., 2019), better prediction of known genetic-based diseases such as cancer (Hossian et al., 2019; Chen et al., 2020; Rani et al., 2020), Alzheimer’s disease (Tan et al., 2021) progression and mortality using gene prediction by analysis of their respective gene expression motif (Hossain et al., 2019), coronary artery disease (Orlenko et al., 2020), neurodegenerative diseases such as Parkinson’s disease (Saeed et al., 2022), and how Alzheimer’s disease progression can be influenced by diabetics type 2 (Chowdhury et al., 2020). With the aid of ML coupled with bioinformatics, drug discovery, targeting and development using some platforms such as MODELLERS, MD Simulation and PDB databases is well understood now unlike decades ago (Fadare et al., 2021). Macesic et al., (2017), reported the promising application of ML in bioinformatics for combating antimicrobial resistance. In their study, using data obtained from next-generation sequencing of antimalarial resistance strains and also data obtained from medical health records through electronic means that is been used to predict phenotypes that are susceptible to antimicrobial resistance, they were able to discover the resistance genes in the strains and suggest better treatments for them. ML in bioinformatics in recent times have now been useful in understanding multi-drug resistance through their genes (Adetunji et al., 2022m) and the relationship between microbial community and our health (Adetunji et al., 2022n, o).

Recent Trends in the Application of Machine Learning in Bioinformatics Techniques Haoyu et al., (2018) revealed that through the advancement in genomic technology and bioinformatics research, biological interpretation of big data using machine learning approaches has improved tremendously. Computer algorithms and applications are now being utilized to solve complex data from the human genomic project. Machine learning, computational statistics, mathematic optimization and bioinformatics are utilized to establish and validate biological databases.

212 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

The use of machine learning and bioinformatics in the area of protein/amino-acids/nucleic acids research has evolved into an easier and better understanding of protein-protein interaction (Havugimana et al., 2017; Zhang et al., 2017), genetics and genomics (García et al., 2009; Libbrecht and Noble, 2015), genetic engineering (Alley et al., 2020), enzyme modification (Singh et al., 2021), protein/enzyme engineering (Vanella et al., 2022) using directed evolution or rational design to tailor enzyme properties to meet the demands in academia and industry (Siedhoff et al., 2020). In recent years, industrial and Proteins with a broad spectrum of industrial applications can now be engineered genetically to enhance their properties for effective use in various biotechnology processes (Ogundolie, 2015; Ayodeji et al. 2017; Ogundolie, 2021; Ogundolie et al. 2022; Adetunji et al. 2023 a, b). Bioinformatics tools facilitate retrieving, storing and analysing of biological data on proteins and genes. Large data are obtained through microarray and DNA sequencing as high throughput and understanding are enhanced using machine learning platforms and algorithms. Furthermore, Random forest as a technique in bioinformatics incorporates features and decision trees of ensemble interactions with high predictive accuracy in biological data analysis. The random forest can be utilized for the classification of samples in gene expression, disease-linked gene analysis, genomic wide analysis, identifying protein sequence, and protein-protein interactions. In bioinformatics, the transformation of large data from biomedical research into valuable knowledge for diagnostic, therapeutic and predictive purposes remain a challenge. Machine learning is a state of art facility applicable to several fields of discipline providing a comprehensive perspective in bioinformatics research. A large amount of medical data like omics, signal, and image data are incorporated into bioinformatics platforms like machine algorithms for treatment, prediction, pattern interaction, and diagnostic options. Some of the machine learning algorithms commonly deployed for bioinformatics analysis of biological data like proteomics, genomics, transcriptomics, and system biology are support vector machines, hidden Markov models, random forests, Gaussian networks, decision trees and Bayesian networks. Artificial intelligence like deep neural networks, recurrent neural networks, convolutional neural networks, graph neural networks, variational autoencoder, generative adversarial networks, and emergent architectures advancement in language processing has been speech and image recognition, language translation, finger joint recognition and splice junction in DNA sequence.

Machine Learning Methods in Bioinformatics

213

The role of machine learning in bioinformatics is very wide-ranging from sequence analysis to structure reconstruction and prediction, bio-molecular functions and properties, diagnosis, biomedical image processing, systems biology and biomolecule interaction (Dash et al., 2017, Rahman et al., 2018; Sahu et al., 2018; Dash et al., 2020; Dash et al., 2021). BALDI and BRUNAK, (2001) reported that due to a large amount of complex data and databases, experts in bioinformatics techniques utilize artificial intelligence and machine learning techniques to mine, solve, and discover knowledge processes in biological data. Chen et al., (2018) showed that miRNAs are biological small non-coding RNAs that perform post-transcriptional gene expression regulations by a complementary base pairing mechanism. The author’s utilized machine learning methods like deep learning classifiers to predict miRNA-mRNA dataset interactions like seed region, site accessibility, evolutionary conservation, and free energy. Pedro et al., (2005) revealed that different models in machine learning like supervised classification, probabilistic and clustering graphical models in bioinformatics are utilized for knowledge discovery, determination and optimization in several biological domains. The authors demonstrated the importance of computational models such as Bayesian classifiers discriminant analysis, logistic regression, classification trees, neural networks, nearest neighbor, support vector machines, partitional clustering, ensembles of classifiers, hierarchical clustering, hidden Markov models, mixture models, Bayesian networks Monte Carlo algorithms, Gaussian optimization and networks simulated annealing, GAs, tabu search, estimation of distribution algorithms and genetic programming in machine learning techniques in bioinformatics. Siraj-Ud, (2019) showed that through genomic research, several data were produced without adequate understanding but machine learning algorithms such as Artificial Neural Networks, Alternating Decision Tree, Adaptive Neuro-Fuzzy Inference System, Bayes Net, C4.5, Best First Tree, Classification and Regression Trees, Decision Tree, Case Based Reasoning, Differential Evolution Support Vector Machine, Genetic programming, Emerging Pattern, Genetic Algorithm, IBk, gated recurrent unit - Support Vector Machine, J48, KStar, k-Nearest Neighbors, K-means and Support Vector Machine, k- Nearest neighbor Genetic Algorithm, k- Nearest neighbor + Independent Component Analysis, Logit Boost, Linear Regression, Logistic Model Tree, Multifactor Dimensionality Reduction, NNge, Multilayer Perceptron, Naive Bayesian + k-Nearest Neighbors, Naive Bayesian,

214 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

NNANF, PART, Random Forest, Radial Basis Function Neural Network, Support Vector Machine, Simple Logistics, Sequential Minimal Optimization, Self-OrganizingMaps, Sequential Minimal Optimization - Support Vector Machine, Softmax Regression, and ZeroR and performance principles based on confusion matrix like Recall/Sensitivity, Precision, Specificity, F-Score and Accuracy are good classifiers in bioinformatics research. Pengyi et al., (2010) utilized an ensemble learning technique with the huge advantage of analyzing small sample sizes, complex data structures and high dimensionality to solve complex bioinformatics challenges. Auslander et al., (2021) incorporated machine learning techniques into the bioinformatics framework to analyze molecular evolution, systems biology, protein structure analysis, disease genomics and molecular pathways. These bioinformatics techniques application can address the challenges in clinical and biological research for feature extraction, prediction, and selection models. Jayanthi and Mahesh, (2019) wrote extensively on the importance of incorporating machine learning algorithms into bioinformatics research like sequence analysis, genomic annotation, gene expression analysis, mutation analysis, protein structure prediction, modelling, and high throughput imaging processing. In their study, the authors reported that due to the rise in the acquisition of biological data, advancement in the application of computational biology for biomedical data analysis has been developed using machine learning algorithms like Supervised Learning, Semi-supervised Learning, Unsupervised Learning, optimization and reinforcement learning. Bhuiyan et al., (2021) revealed that due to ozone exposure over the last few decades, there has been health and environmental health risk resulting in a need to quantify the concentration of the ozone for proper monitoring. The authors suggested that machine learning and statistical modelling techniques like Naïve Bayes, K-Nearest Neighbors, Decision Tree, Stochastic Gradient Descent, and Extreme Gradient Boosting (XGBoost) algorithms and their ensemble technique will provide an improved way of curbing the associated risk with ozone. Zhang, (2019) reported that advancement in the application of nextgeneration sequencing has resulted in enhanced data acquisition in biomedical science. Machine learning tools leverage big data derived from the biomedical field to predict, validate and analyze data to derive valuable knowledge and insight into lung cancer disease. Machine learning algorithms uncover the transcription factors responsible for the metabolic reprogramming of small molecules and reconstruct the association between genes and these transcription factors.

Machine Learning Methods in Bioinformatics

215

Machine learning advantages in bioinformatics include notable hierarchical association within data with broader applications like image classification segmentation, image reconstruction and localization and detection. Today, bioinformatics research relies major on machine learning tools to perform predictive analysis of complex biological processes. Deep learning tools are utilized to detect cancer metastases, and patterns of the human genome, and to build predictive models. Hanif et al., (2019) added that the application of machine learning to bioinformatics is to analyze biological data to bring about logical conclusions. The authors revealed that machine learning algorithms like neural networks, decision trees, probabilistic approaches, cellular automata, genetic algorithms and hybrid methods assist researchers in molecular dynamic simulations, molecular docking analyses, discovering vaccines for diseases, identifications of novel compounds, in silico structure prediction, immune-informatics, virtual screening and ADMET properties. This chapter thus highlights the significant role of machine learning algorithms in bioinformatics and analysis of big data particularly in biomedical, surgery, medicine, environmental and biological sciences.

Conclusion The use of ML in bioinformatics research has further opened up various medical fields to better management in the health sector through precise interpretation of big data. This has resulted in earlier detection and more accurate diagnosis, better management and curative measures or options. In general, diverse biotechnology industries (Adetunji et al., 2022j-l) have been improved through the use and advances of ML in bioinformatics research.

References Adetunji, C. O., Inobeme, A., Tadso, J., Olaniyan, O. T., Abimbola, O. F., Shahnawaz, M., & Anani, O. (2022a). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota (pp. 301-312). Springer, Singapore. Adetunji, C. O., Ogundolie, F. A., Ajiboye, M. D., Mathew, J. T., Inobeme, A., Dauda, W. P., & Adetunji, J. B. (2022b). Nano-engineered Sensors for Food Processing. In Bio-

216 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. and Nano-sensing Technologies for Food Processing and Packaging (pp. 151-166). Royal Society of Chemistry. doi:10.1039/9781839167966-00151. Adetunji, C. O., Bodunrinde, R. E., Inobeme, A., Singh, K. R., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2002c) Microbial Community Analysis of Contaminated Soils. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 83-97). CRC Press. Adetunji, C. O., Inobeme, A., Singh, K. R., Bodunrinde, R. E., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2002d) Genomic Analysis of Heavy Metal-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 113-126). CRC Press. Adetunji, C. O., Mathew, J. T., Singh, K. R., Bodunrinde, R. E., Inobeme, A., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2022e) Molecular Characterization of Multidrug-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 127-141). CRC Press. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D., (2022f). Computational Intelligence Techniques for Combating COVID-19. DOI: 10.1201/9781003178903-16. In book: In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics (pp. 251-269). CRC Press. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Olugbenga, M. S., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D. (2022g). Machine Learning and Behaviour Modification for COVID-19.DOI: 10.1201/9781003178903-17. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903 Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Garipova, L. and Shariati, M.A. (2022h). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_10. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022i). Machine Learning Approaches for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_8. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Isabekova, O. and Shariati, M. A., (2022j). Smart Sensing for COVID19 Pandemic. In: Pani S.K., Dash S., dos Santos W.P., Chan Bukhari S.A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_9.

Machine Learning Methods in Bioinformatics

217

Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022k). Internet of Health Things (IoHT) for COVID-19. In: Pani S.K., Dash S., dos Santos W. P., Chan Bukhari S.A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_5. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Koriagina, N. and Shariati, M. A., (2022l). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_3. Adetunji, C. O., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E., Esiobu, N. D., Oyedara, O. O. and Adeyemi, F. M. (2022m). Application of Computational and Bioinformatics Techniques in Drug Repurposing for Effective Development of Potential Drug Candidate for the Management of COVID-19. doi: 10.1201/9781003178903-15. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition, First Published 2022, Imprint CRC Press. Pages 14. eBook ISBN 9781003178903. Adetunji, C. O., Samuel, M. O., Adetunji, J. B. and Oluranti, O. I., (2022n). Corn Silk and Health Benefits. DOI: 10.1201/9781003178903-11. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903Alley, E.C., Turpin, M., Liu, A.B., Kulp-McDowall, T., Swett, J., Edison, R., Von Stetina, S.E., Church, G.M. and Esvelt, K.M., (2020). A machine learning toolkit for genetic engineering attribution to facilitate biosecurity. Nature communications, 11(1), 1-12. Adetunji, Oluwaseun C. John Tsado Mathew, Abel Inobeme, Olugbemi T. Olaniyan, Kshitij RB Singh, Ogundolie Frank Abimbola, Vanya Nayak, Jay Singh & Ravindra Pratap Singh (2022o). Microbial and Plant Cell Biosensors for Environmental Monitoring. In: Singh, R. P., Ukhurebor, K.E., Singh, J., Adetunji, C.O., Singh, K.R. (eds) Nanobiosensors for Environmental Monitoring. Springer, Cham. https://doi.org/ 10.1007/978-3-031-16106-3_9. Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O., Dauda, W. and Ghazanfar, S., (2023a). Graphene-based nanomaterials for targeted drug delivery and tissue engineering. Novel Platforms for Drug Delivery Applications, pp.277-288. https://doi.org/10.1016/ B978-0-323-91376-8.00014-8 Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ghazanfar, S., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O. and Dauda, W., (2023b). Nanotube platforms for effective drug delivery applications. Novel Platforms for Drug Delivery Applications, pp.317-332. https://doi.org/10.1016/B978-0-323-91376-8.00 005-7

218 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Amory, J. K., & Amory, D. W. (2007). Dosing frequency of aspirin and prevention of heart attacks and strokes. The American journal of medicine, 120 (4), e5–e7. https://doi.org/ 10.1016/j.amjmed.2006.04.023. Auslander, N.; Gussow, A.B.; Koonin, E.V. (2021) Incorporating Machine Learning into Established Bioinformatics Frameworks. Int. J. Mol. Sci., 22, 2903. https://doi.org/ 10.3390/ijms22062903. Ayodeji, A. O., Ogundolie, F. A., Bamidele, O. S., Kolawole, A. O., & Ajele, J. O. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus CFU-01: purification and characterization for biotechnological application. J. Microbiol. Biotechnol., 6, 90-100. Baldi, P. and Brunak, S. (2001) Bioinformatics: The Machine Learning Approach, 2nd Ed., MIT Press. Barman, R. K., Mukhopadhyay, A., Maulik, U., and Das, S. (2019). Identification of infectious disease-associated host genes using machine learning techniques. BMC bioinformatics, 20(1), 1-12. Baron, J. A. (2012). Aspirin and cancer: trials and observational studies. Journal of the National Cancer Institute, 104(16), 1199-1200. Behera, R. N., Roy, M., & Dash, S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. Bhuiyan, M. A. M.; Sahi, R.K.; Islam, M.R.; Mahmud, S. (2021) Machine Learning Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Region. Mathematics, 9, 2901. https://doi.org/10.3390/math9222901. Boran Adw, Iyengar R. (2010) Systems approaches to polypharmacology and drug discovery. Current Opinion in Drug Discovery & Development. 13(3):297–309. Chen H., Engkvist O., Wang Y., Olivecrona M., & Blaschke T. (2018). The rise of deep learning in drug discovery. Drug Discovery Today, 23(6):1241-1250. Chen, L., Li, J., and Chang, M. (2020). Cancer diagnosis and disease gene identification via statistical machine learning. Current Bioinformatics, 15(9), 956-962. Chowdhury, U. N., Islam, M. B., Ahmad, S., & Moni, M. A. (2020). Network-based identification of genetic factors in ageing, lifestyle and type 2 diabetes that influence the progression of Alzheimer’s disease. Informatics in Medicine Unlocked, 19, 100309. Dalen J. E. (2009). Aspirin for prevention of myocardial infarction and stroke: is the right dose 81 or 160 mg/day? Journal of the American College of Cardiology, 53(21), 2010. https://doi.org/10.1016/j.jacc.2008.11.063. Dash, S., & Abraham, A. (2018). Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In International Conference on Hybrid Intelligent Systems (pp. 176-188). Springer, Cham. Dash, S., Abraham, A., Luhach, A. K., Mizera-Pietraszko, J., & Rodrigues, J. J. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. Dash, S., Ahmad, M., & Iqbal, T. (2021). Mobile cloud computing: a green perspective. In Intelligent Systems. vol.185, pp:523-533, Springer, Singapore. http://doi.org/10.1007/ 978- 981-33-6081-5-46.

Machine Learning Methods in Bioinformatics

219

Dash, S., Thulasiram, R., & Thulasiraman, P. (2017, December). An enhanced chaos-based firefly model for Parkinson’s disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT) (pp. 159-164). IEEE. Dash, S., Thulasiram, R., & Thulasiraman, P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. Fadare, O. A., Omisore, N. O., Adegbite, O. B., Awofisayo, O. A., Ogundolie, F. A., Adesanwo, J. K., & Obafemi, C. A. (2021). Structure based design, stability study and synthesis of the dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. García, S., Fernández, A., Luengo, J., & Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing, 13(10), 959-977. Hanif W, Afzal M A, Ansar S, Saleem M, Ikram A, Afzal S, Khan S A F, Larra S A, Noor H. (2019) Artificial intelligence in bioinformatics. Biomedical Letters; 5(2):114-119. Haoyu Y., Zheng A., Haotian Z. and Yawen H. (2018) Application of Machine Learning Methods in Bioinformatics. 6th International Conference on Computer-Aided Design, Manufacturing, Modeling and Simulation. 040015. 1-4. https://doi.org/10.1063/1.503 9089. Havugimana, P. C., Hu, P., & Emili, A. (2017). Protein complexes, big data, machine learning and integrative proteomics: lessons learned over a decade of systematic analysis of protein interaction networks. Expert review of proteomics, 14(10), 845855. Hossain, M. A., Islam, S. M. S., Quinn, J. M., Huq, F., & Moni, M. A. (2019). Machine learning and bioinformatics models to identify gene expression patterns of ovarian cancer associated with disease progression and mortality. Journal of biomedical informatics, 100, 103313. Jayanthi K, and Mahesh C. (2019) Need of Machine Learning In Bioinformatics. International Journal of Innovative Technology and Exploring Engineering. Volume8 Issue-11, 2608-2611. doi: 10.35940/ijitee.K1903.0981119. Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., Frolkis, A., Pon, A., Banco, K., Mak, C., Neveu, V., Djoumbou, Y., Eisner, R., Guo, A. C., and Wishart, D. S. (2011). DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic acids research, 39(Database issue), D1035–D1041. https://doi.org/10.1093/nar/gkq1126. Libbrecht, M. W., & Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6), 321-332. Macesic, N., Polubriaginof, F., & Tatonetti, N.P. (2017). Machine learning: novel bioinformatics approaches for combating antimicrobial resistance. Current opinion in infectious diseases, 30(6), 511-517. Napolitano, F., Zhao, Y., Moreira, V. M., Tagliaferri, R., Kere, J., D’Amato, M., & Greco, D. (2013). Drug repositioning: A machine-learning approach through data integration. Journal of Cheminformatics, 5(1), 30.

220 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Naresh, E., Vijaya Kumar, B. P., & Shankar, S. P. (2020). Impact of machine learning in bioinformatics research. In Statistical modelling and machine learning principles for bioinformatics techniques, tools, and applications (pp. 41-62). Springer, Singapore. Ogundolie, F. A. (2015). Characterization of a Purified β–Amylase from Black Marble Vine (Dioclea reflexa) Seeds (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A. (2021). Cloning of α-AMYLASE and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from cocoa (Theobroma cacao L.) Pods and Biochemical Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A., Ayodeji, A. O., Olajuyigbe, F. M., Kolawole, A. O., & Ajele, J. O. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. Olugbemi T. Olaniyan, Charles O. Adetunji, Mayowa J. Adeniyi, Daniel Ingo Hefft. (2022a). Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-20 Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Olugbemi T. Olaniyan, Charles O. Adetunji, Mayowa J. Adeniyi, Daniel Ingo Hefft. (2022). Computational Intelligence in IoT Healthcare. 2022 b. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-19. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Omotayo Opemipo Oyedara, Folasade Muibat Adeyemi, Charles Oluwaseun Adetunji, Temidayo Oluyomi Elufisan. (2022). Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARS-CoV-2 Infection. doi: 10.1201/9781003178903-10. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Orlenko, A., Kofink, D., Lyytikäinen, L. P., Nikus, K., Mishra, P., Kuukasjärvi, P., Karhunen, P.J., Kähönen, M., Laurikka, J. O., Lehtimäki, T. and Asselbergs, F.W., (2020). Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning. Bioinformatics, 36(6), 1772-1778. Patrono, C., García Rodríguez, L. A., Landolfi, R., & Baigent, C. (2005). Low-dose aspirin for the prevention of atherothrombosis. The New England journal of medicine, 353(22), 2373–2383. https://doi.org/10.1056/NEJMra052717. Pedro Larran‹aga, Borja Calvo, Roberto Santana, Concha Bielza, Josu Galdiano, Inaki Inza, Jose A. Lozano, Ruben Arman‹anzas, Guzman Santafe, Aritz Perez and Victor Robles (2005). Machine learning in bioinformatics. Briefings in Bioinformatics. 7(1) 86-112. doi:10.1093/bib/bbk007. Pengyi Yang, Yee Hwa Yang, Bing B. Zhou, and Albert Y. Zomaya (2010) A review of ensemble methods in bioinformatics: Including stability of feature selection and ensemble feature selection methods. Current Bioinformatics, 5, (4):296-308.

Machine Learning Methods in Bioinformatics

221

Rahman, A. U., Dash, S., & Luhach, A. K. (2021). Dynamic MODCOD and power allocation in DVB-S2: a hybrid intelligent approach. Telecommunication Systems, 76(1), 49-61. Rahman, A., Sultan, K., Dash, S., & Khan, M. A. (2018). Management of resource usage in mobile cloud computing. Int. J. Pure Appl. Math., 119(16), 255-261. Rana HK, Akhtar M, Islam MB, Ahmed MB, Lió P, Huq F, Quinn JM, Moni MA. (2020) Machine learning and bioinformatics models to identify pathways that mediate influences of welding fumes on cancer progression. Scientific reports. 10(1):1-5. Reddy, A. S., & Zhang, S. (2013). Polypharmacology: drug discovery for the future. Expert review of clinical pharmacology,6(1), 41–47. https://doi.org/10.1586/ecp.12.74. Saeed, F., Al-Sarem, M., Al-Mohaimeed, M., Emara, A., Boulila, W., Alasli, M., & Ghabban, F. (2022). Enhancing Parkinson’s Disease Prediction Using Machine Learning and Feature Selection Methods. Computers, Materials and Continua, 71(3), 5639-5658. Sahu, B., Dash, S., Mohanty, S. N., & Rout, S. K. (2018). Ensemble comparative study for the diagnosis of breast cancer datasets. International Journal of Engineering & Technology, 7(4.15), 281-285. Serra, A., Galdi, P., & Tagliaferri, R. (2018). Machine learning for bioinformatics and neuroimaging. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(5), e1248.doi:10.1002/widm.1248. Siedhoff, N. E., Schwaneberg, U., & Davari, M. D. (2020). Machine learning-assisted enzyme engineering. Methods in enzymology, 643, 281-315. Singh, N., Malik, S., Gupta, A., & Srivastava, K. R. (2021). Revolutionizing enzyme engineering through artificial intelligence and machine learning. Emerging Topics in Life Sciences, 5(1), 113-125. Siraj-Ud Doulah. (2019). Application of Machine Learning Algorithms in Bioinformatics. Bioinform. Proteom. Opn Acc J, 3(1): 000127. 1-11. Sleigh SH, Barton CL. Repurposing strategies for therapeutics. Pharm Med. 2010; 24:151– 159. doi: 10.1007/BF03256811. Tan, M. S., Cheah, P. L., Chin, A. V., Looi, L. M., and Chang, S. W. (2021). A review on omics-based biomarkers discovery for Alzheimer’s disease from the bioinformatics perspectives: Statistical approach vs machine learning approach. Computers in biology and medicine, 139, 104947. Vanella, R., Kovacevic, G., Doffini, V., de Santaella, J. F., & Nash, M. A. (2022). Highthroughput screening, next generation sequencing and machine learning: advanced methods in enzyme engineering. Chemical Communications, 58(15), 2455-2467. Zhang, M., Su, Q., Lu, Y., Zhao, M., & Niu, B. (2017). Application of machine learning approaches for protein-protein interactions prediction. Medicinal Chemistry, 13(6), 506-514. Zhang, Yi, (2019) “Novel Applications of Machine Learning in Bioinformatics” (2019). Theses and Dissertations--Computer Science. 83. https://uknowledge.uky.edu/cs_ etds/83.

Chapter 8

Molecular Biomarkers as Health and Disease Predictors Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi4 and Lawrence Achilles Nnyanzi4 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract In clinical medicine, the use of biomarkers has advanced in so many areas such as diagnosis, genome study, molecular biology and treatment of diverse infections. Biomarkers are generated in pathogenic processes, normal physiological activity, therapeutic intervention, and pharmacological reaction. Therefore this chapter intends to provide relevant and detailed information on Molecular Biomarkers as health and disease predictors.

*

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

224 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Keywords: molecular biomarkers, health and disease predictors, pathogenic processes, normal physiological activity, therapeutic intervention, and pharmacological reaction

Introduction There are numerous indicators that have been recognized as potential markers with numerous biological processes which entail pharmacological and pathogenic rejoinders are known as biomarkers (Tampa et al., 2018). These entail normal physiological biomarkers that occur with the ranges in healthy subjects. There are several challenges that might be responsible for stress in humans. A marker of such stress indicates that the person is largely not in physiological comfort and numerous energy-consuming mechanisms that functions within their bodies to guarantee the process of homeostasis (MarcoRamell et al., 2016). These stress markers are also called the biomarkers. Therefore the biomarker is a feature that could be determined and assessed as an indicator of the pathological and physiological process and pharmacological reactions to medical interventions (Naylor, 2003). Experts have suggested that an effective biomarker should possess specificity, particularly for certain illnesses. It should distinguish between various physiological conditions, exhibit sensitivity, provide prompt diagnosis, deliver essential results, and exhibit the ability to differentiate between genders and ethnicities (Sahu et al., 2011). Biomarkers can be made from nano biomaterials with biosensors (Adetunji et al., 2022a). These biomarkers also assist in disease investigation and tracking progression, results and regression after the intervention. It is important to list the enumerated factors in the body fluid or externally. Additionally, core body temperature, pulse rate, and respiration rate are reliable indicators of social, psychological, and environmental factors, as highlighted by Carboni in 2013. The utilization of diagnostic enzymes over a range of industries have revealed the importance of these type of proteins, they are commonly used for biotechnological, biomedical or industrial applications in various industries (Ogundolie, 2015; Ayodeji et al. 2017; Ogundolie, 2021; Ogundolie et al. 2022, Adetunji et al. 2023 a,b), chemical industries, pharmaceuticals, food, bakery and brewery (Ogundolie, 2021; Ogundolie et al. 2022) and today they

Molecular Biomarkers as Health and Disease Predictors

225

have found significant application in the area of clinical diagnosis for use as biomarkers which can either help to access the level of microbes or cells in response to different toxicants (Gonçalves et al. 2021) or they are used as diagnostic targets such as transketolase for malaria (Fadare et al. 2021).

Specific Authors That Have Worked on Molecular Biomarkers as Health and Disease Predictors Sahab et al. (2017) revealed that there are different classes of biomarkers such as screening biomarkers, antecedent biomarkers, diagnostic biomarkers, staging biomarkers, and prognostic biomarkers. Ronald, (2009) revealed that public health genomic and molecular biomarkers are utilized to diagnose certain infections. In the delivery of health care, predictive multi-morbidity is very important for designing biomarkers. Zacharakis et al. (2018) revealed that the incidence of patients with chronic liver diseases like Hepatocellular carcinoma has increased drastically in recent years causing death globally, thus, the need to develop biomarkers to enhance the diagnosis. The authors noted that the biomarker must be specific and sensitive for improved surveillance and early detection thereby rapidly providing therapeutic outcome. Alpha-fetoprotein is the molecular signature for the understanding of the physiological processes in Hepatocellular carcinoma pathogenesis. Bridget and Claire, (2009) noted that increased diagnosis of prostate diseases using Prostate-specific antigen has provided rapid results at the early stage. Though there are limitations of Prostate-specific antigen biomarkers for the diagnosis of prostate cancer due to non-specificity, but can be associated with indicators of prostate volume. The authors revealed that specific biomarkers are needed to rapidly diagnose prostate cancer and generate clinical management using high throughput technologies for analyzing specific protein or gene in the urine or blood. Thomas and Pierre, (2016) highlighted that in recent years, Computed tomography is one of the latest advancements in technology for the rapid screening, diagnosis and detection lung cancer. The authors noted that novel molecular biomarkers will provide useful data in lung cancer research thereby improving the therapeutic outcomes. Prabir et al. (2013) analyzed the utilization of molecular biomarkers in clinical medicine for the diverse management of disease conditions. Through high throughput technologies,

226 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

biomarkers such as kallikreins and Circulating microRNAs are used in diagnosis, therapeutic management and characterization in clinical medicine. The molecular understanding of diseases involves the identification of biomarkers in evidence-based medicine. The application of bioinformatics through dry lab and computational biology in the development of various forms of biomarkers have been a major boost to the pharmaceutical, medical sectors (Behera et al., 2016; Dash et al., 2017, Dash and Abraham, 2018; Rahman et al., 2018; Sahu et al., 2018; Dash et al., 2019; Dash et al., 2020; Rahman et al., 2021, Dash et al., 2021). Harry and William, (2017) provided detailed information on clinical medicine biomarker utilization which have to first undergo validation, characterization and evaluation. In different brain diseases, rapid development in the utilization of biomarkers in the diagnosis has been witnessed in recent years. Recently, established blood test for tau and amyloid in neurodegenerative diseases has been studied. Burke, (2016) analyzed the rate of exponential increase in the utilization of biomarkers for more than 20 years in clinical medicine. They noted that biomarker utilization for diagnosis between patients and physician is an integral part of decision making process and patient’s care. In the central nervous system, biomarkers can provide biochemical changes, genetic traits, functional features, and structural alterations. The authors noted that advancement in biomarkers research will be useful in the prevention, treatment and diagnosis of stroke, Alzheimer’s disease, motor neuron disease, Huntington’s disease and Parkinson’s disease. Branca et al. (2001) reported that the knowledge of biomarkers will assist in the understanding of diseases associated with diets. The authors suggested that prediction of biomarkers in line with dietary strategies will minimize the risk of disease development. Crystal et al. (2020) showed that the leading cause of death globally is cardiovascular diseases such as hypertension, stroke, atherosclerosis and heart failure. Thus, it is important to discover novel biomarkers that will help in the early detection, assessment of risk factors and understanding of the disease condition. The authors revealed some proteins and miRNA biomarker candidates currently investigated for their role in the pathogenesis of cardiovascular disease and remodeling which could assist in the development of therapeutic strategies for the management. Fischer et al. (2014) utilized high throughput profiling to identify biomarkers to enhance risk prediction and physiological mechanisms involved in the several diseases. The authors noted that biomarkers are associated with non-vascular, cardiovascular diseases, and cancer, thus can improve the

Molecular Biomarkers as Health and Disease Predictors

227

prediction of these diseases. Dhama et al. (2019) reported that homeostasis of the internal environment is affected by several factors including stress. Stress is known to generate oxidative free radicals that affect normal physiological regulation resulting into illness. Several biomarkers are indicative of pathological state in an organism such as thermal stress markers, heat shock proteins, Acute Phase Proteins, oxidative stress markers, innate immune markers, chemical secretions urine, blood and saliva. These biomarkers play a critical role in disease diagnosis, prediction, treatment and prognosis such as cardiovascular, hepatic, central nervous system, nephrological disorders, and metabolic dysfunction. Delphine et al. (2017) revealed that radical surgery and neoadjuvant chemo-radiation are the treatment option available for rectal cancer. They noted that the response to treatment varies from patients to patients and thus, there is need to identify effective approach to select rectal cancer patients according to their response to treatment. Radiological features and clinicopathological are less effective due to non-specificity and sensitivity compared to molecular biomarkers with the potential to adequately predict response to radical surgery and neoadjuvant chemoradiation. Molecular biomarker can identification of the mechanism of action in tumor and validate samples. Jennifer et al. (2018) revealed that to achieve improvement in risk stratification approach in the management of cardiovascular diseases, there is need for the discovery of novel biomarkers that will predict the biological pathways. The authors were able to identify numerous proteins like insulinlike growth factor 1, cystatin-C, insulin-like growth factor binding protein 1, insulin-like growth factor binding protein 2, leptin, adipsin, soluble receptor for advanced glycation end products, growth differentiation factor 15, C-type lectin domain family 3 member B, tetranectin, kallikrein B1, N-terminal prob-type natriuretic peptide, arabinogalactan protein 1, peripheral myelin protein 2, uncarboxylated matrix Gla protein, leptin receptor regulating metabolic, inflammatory and homeostatic pathways that could improve preventive strategies in cardiovascular diseases. Scott et al. (2019) reported that the availability of important tissue based biomarker in prostate cancer will provide active identification, surveillance, and emerging treatment options. Richard, (2004) noted that the understanding of neurological diseases involves the application and analysis of biomarkers. These biomarkers are constituents of body fluids, cell and tissues in health and disease conditions. Through molecular biology approach, diagnosis, prevention, prognosis,

228 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

monitoring and therapeutic interventions are carried out in the management of neurological diseases. Papadopoulou et al. (2022) revealed that modern approach in the search for cure or treatment of infectious diseases like COVID-19 involves the identification of cellular biomarkers and molecular mediators of the immune system which are known to provide information and transcriptomics signature based on the patient’s gene. The authors reported that various biomarkers like IL-6, neutrophil–lymphocyte ratio, procalcitonin and white blood cell counts have been highlighted to play a significant role in the management of COVID19 and SARS-CoV-2. Some studies have suggested that some circulating biomarkers are predictive of lifespan and health span. Some of these biomarkers are inflammatory biomarkers, glycemic, hematological and lipid biomarkers. Run-Feng et al. (2021) reported that Crohn’s disease and Ulcerative colitis which are inflammatory bowel disease can be diagnosed, treated and prevented using biological agents like antitumor necrosis factor, antiintegrins, Janus kinase inhibitors, and anti-interleukin (IL)-12/IL-23. Peter et al. (2017) revealed that lung cancer detection, screening, treatment, management and diagnosis can be improved by molecular biomarkers. Tjalf et al. (2019) revealed that multiple sclerosis is a heterogeneous degenerative – inflammatory disease that varies from individual to individual in terms of presentation. Thus individual prediction and characterization using biomarkers will reveal the molecular mechanism behind the pathophysiology.

Conclusion This chapter has provided a detailed information on different biomarkers that could be utilized for the detection, analysis, prognosis, diagnosis, predicting and risk assessment involve in some specific therapy (Adetunji et al. 2022am; Olaniyan et al. 2022a, b; Oyedara et al. 2022). Moreover, detailed information was provided on relevant molecular biomarkers that could be applied as health and disease predictors.

References Adetunji CO, Ogundolie FA, Olaniyan OT, Mathew JT, Inobeme A, Titilayo O, Ghazanfar S, Ijabadeniyi OA, Ajiboye MD, Ajayi OO, Dauda WP and Adetunji JB. (2022a).

Molecular Biomarkers as Health and Disease Predictors

229

Nanobiomaterials for Food Packaging Sensor Applications. In Bio-and Nano-sensing Technologies for Food Processing and Packaging, (pp. 167-180). Royal Society of Chemistry. Adetunji CO, Nwankwo W, Olayinka AS, Olugbemi OT, Akram M, Laila U, Olugbenga MS, Oshinjo AM, Adetunji JB, Okotie GE and Esiobu ND. (2022b). Machine Learning and Behaviour Modification for COVID-19. DOI: 10.1201/978100317890317. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903. Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Garipova L and Shariati MA. (2022c). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_10. Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Petukhova E and Shariati MA. (2022d). Machine Learning Approaches for COVID19 Pandemic. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-03079753-9_8. Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Isabekova O and Shariati MA. (2022e). Smart Sensing for COVID-19 Pandemic. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_9. Adetunji CO, Inobeme A, Tadso J, Olaniyan OT, Abimbola OF, Shahnawaz M and Anani O. (2022f). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota, (pp. 301-312). Springer, Singapore. Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Petukhova E and Shariati MA. (2022g). Internet of Health Things (IoHT) for COVID19. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-03079753-9_5. Adetunji CO, Olaniyan OT, Adeyomoye O, Dare A, Adeniyi MJ, Alex E, Rebezov M, Koriagina N and Shariati MA. (2022h). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In: Pani SK, Dash S, dos Santos WP, Chan Bukhari SA, Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_3. Adetunji, C.O., Abimbola, O.F., Singh, K.R., Olaniyan, O.T., Bodunrinde, R.E., Inobeme, A., Mathew, J.T., Singh, J. and Singh, R.P., (2022i). Microbe Performance and Dynamics in Activated Sludge Digestion. In Microbial Community Studies in

230 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Industrial Wastewater Treatment (pp. 99-112). CRC Press. eBook ISBN 9781003354147 https://doi.org/10.1201/9781003354147 Adetunji CO, Samuel MO, Adetunji JB and Oluranti OI. (2022j). Corn Silk and Health Benefits. DOI: 10.1201/9781003178903-11. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Adetunji CO, Ogundolie FA, Ajiboye MD, Mathew JT, Inobeme A, Dauda WP and Adetunji JB. (2022k). Nano-engineered Sensors for Food Processing. In Bio-and Nano-sensing Technologies for Food Processing and Packaging, (pp. 151-166). Royal Society of Chemistry. DOI:10.1039/9781839167966-00151. Adetunji CO, Mathew JT, Inobeme A, Olaniyan OT, Singh KRB, Abimbola OF, Nayak V, Singh J and Singh RP. (2022l). Microbial and Plant Cell Biosensors for Environmental Monitoring. In: Singh RP, Ukhurebor KE, Singh J, Adetunji CO, Singh KR. (eds) Nanobiosensors for Environmental Monitoring. Springer, Cham. https://doi.org/ 10.1007/978-3-031-16106-3_9. Adetunji CO, Nwankwo W, Olayinka AS, Olugbemi OT, Akram M, Laila U, Samuel MO, Oshinjo AM, Adetunji JB, Okotie GE and Esiobu ND. (2022m). Computational Intelligence Techniques for Combating COVID-19. DOI: 10.1201/978100317890316. In book: In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics, (pp. 251-269). CRC Press. Adetunji CO, Bodunrinde RE, Inobeme A, Singh KR, Mathew JT, Olaniyan OT, Abimbola OF, Singh J, Nayak V and Singh RP. (2002n). Microbial Community Analysis of Contaminated Soils. In Microbial Community Studies in Industrial Wastewater Treatment, (pp. 83-97). CRC Press. https://doi.org/10.1201/9781003354147. Adetunji CO, Inobeme A, Singh KR, Bodunrinde RE, Mathew JT, Olaniyan OT, Abimbola OF, Singh J, Nayak V and Singh RP. (2002o). Genomic Analysis of Heavy MetalResistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment, (pp. 113-126). CRC Press. Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O., Dauda, W. and Ghazanfar, S., (2023a). Graphene-based nanomaterials for targeted drug delivery and tissue engineering. Novel Platforms for Drug Delivery Applications, pp.277-288. https://doi.org/10.1016/ B978-0-323-91376-8.00014-8 Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ghazanfar, S., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O. and Dauda, W., (2023b). Nanotube platforms for effective drug delivery applications. Novel Platforms for Drug Delivery Applications, pp.317-332. https://doi.org/10.1016/B978-0-323-91376-8.00 005-7 Atwater T and Massion PP. (2016). Biomarkers of risk to develop lung cancer in the new screening era. Ann Transl Med, 4(8):158. doi: 10.21037/atm.2016.03.46. Ayodeji AO, Ogundolie FA, Bamidele OS, Kolawole AO and Ajele JO. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus CFU-01: Purification and characterization for biotechnological application. J Microbiol Biotechnol, 6, 90-100.

Molecular Biomarkers as Health and Disease Predictors

231

Behera RN, Roy M and Dash S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. Branca F, Hanley AB, Pool-Zobel B and Verhagen H. (2001). Biomarkers in disease and health. British Journal of Nutrition, 85, Suppl. 1, S55-S92. Bickers B and Aukim-Hastie C. (2009) New Molecular Biomarkers for the Prognosis and Management of Prostate Cancer – The Post PSA Era. Anticancer Research, 29: 32893298. Burke HB. (2016). Predicting Clinical Outcomes Using Molecular Biomarkers. Biomarkers in Cancer, 2016: 8 89-99 doi:10.4137/BIC.S33380. Burke HB and Grizzle WE. (2017). Clinical Validation of Molecular Biomarkers in Translational Medicine. CHAPTER 21. Biomarkers in Cancer Screening and Early Detection, First Edition. Edited by Sudhir Srivastava. John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Carboni L. (2013). Peripheral biomarkers in animal models of major depressive disorder. Dis Markers, 35, 33-41. doi: 10.1155/2013/284543. Dash S and Abraham A. (2018). Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In International Conference on Hybrid Intelligent Systems, (pp. 176-188). Springer, Cham. Dash S, Abraham A, Luhach AK, Mizera-Pietraszko J and Rodrigues JJ. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. Dash S, Ahmad M and Iqbal T. (2021). Mobile cloud computing: A green perspective. In Intelligent Systems, vol. 185, pp: 523-533, Springer, Singapore. http://doi.org/.1007/ 978- 981-33-6081-5-46. Dash S, Thulasiram R and Thulasiraman P. (2017, December). An enhanced chaos-based firefly model for Parkinson's disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT), (pp. 159-164). IEEE. Dash S, Thulasiram R and Thulasiraman P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. Dayde D, Tanaka I, Jain R, Tai MC and Taguchi A. (2017). Predictive and Prognostic Molecular Biomarkers for Response to Neoadjuvant Chemoradiation in Rectal Cancer. International Journal of Molecular Sciences, 18, 573; 1-20. doi:10.3390/ ijms18030573. Dhama K, Latheef SK, Dadar M, Samad HA, Munjal A, Khandia R, Karthik K, Tiwari R, Yatoo MI, Bhatt P, Chakraborty S, Singh KP, Iqbal HMN, Chaicumpa W and Joshi SK. (2019). Biomarkers in Stress Related Diseases/Disorders: Diagnostic, Prognostic, and Therapeutic Values. Front Mol Biosci, 6:91. doi: 10.3389/fmolb.2019.00091. Eggener SE, Rumble RB, Armstrong AJ, Morgan TM, Tony Crispino T, Cornford P, van der Kwast T, Grignon DJ, Rai AJ, Agarwal N, Klein EA, Den RB and Beltran H. (2019). Molecular Biomarkers in Localized Prostate Cancer: ASCO Guideline. J Clin Oncol, 38:1474-1494. Fadare OA, Omisore NO, Adegbite OB, Awofisayo OA, Ogundolie FA, Adesanwo JK and Obafemi CA. (2021). Structure based design, stability study and synthesis of the

232 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. Fischer K, Kettunen J, Wurtz P, Haller T, Havulinna AS, Kangas AJ, Soininen P, Esko T, Tammesoo M-L, Magi R, Smit S, Palotie A, Ripatti S, Salomaa V, Ala-Korpela M, Perola M, Metspalu A. (2014). Biomarker Profiling by Nuclear Magnetic Resonance Spectroscopy for the Prediction of All-Cause Mortality: An Observational Study of 17,345 Persons. PLoS Med, 11(2): e1001606. doi:10.1371/journal.pmed.1001606. Gonçalves AM, Rocha CP, Marques JC and Gonçalves FJ. (2021). Enzymes as useful biomarkers to assess the response of freshwater communities to pesticide exposure–A review. Ecological Indicators, 122, 107303. Ghantous CM, Kamareddine L, Farhat R, Zouein FA, Mondello S, Kobeissy F and Zeidan A. (2020). Advances in Cardiovascular Biomarker Discovery. Biomedicines, 2020, 8, 552; 1-19. doi:10.3390/biomedicines8120552. Ho JE, Lyass A, Courchesne P, Chen G, Liu C, Yin X, Hwang S-J, Massaro JM, Larson MG, Levy D. (2018). Protein Biomarkers of Cardiovascular Disease and Mortality in the Community. J Am Heart Assoc, 7: e008108. DOI: 10.1161/JAHA.117.008108. Marco-Ramell A, de Almeida AM, Cristobal S, Rodrigues P, Roncada P and Bassols A. (2016). Proteomics and the search for welfare and stress biomarkers in animal production in the one-health context. Mol Biosyst, 12, 2024-2035. doi: 10.1039/ c5mb00788g. Mazzone PJ, Sears CR, Arenberg DA, Gaga M, Gould MK, Massion PP, Nair VS, Powell CA, Silvestri GA, Vachani A and Wiener RS. (2017). Evaluating Molecular Biomarkers for the Early Detection of Lung Cancer: When Is a Biomarker Ready for Clinical Use? An Official American Thoracic Society Policy Statement, Volume 196, Number 7. 15-29. DOI: 10.1164/rccm.201708-1678ST. Mayeux R. (2004). Biomarkers: Potential Uses and Limitations. The Journal of the American Society for Experimental NeuroTherapeutics, Vol. 1, 182-188. Naylor S. (2003). Biomarkers: Current perspectives and future prospects. Expert Rev Mol Diagn, 3, 525-529. doi: 10.1586/14737159.3.5.525. Ogundolie FA. (2015). Characterization of a Purified β–Amylase from Black Marble Vine (Dioclea reflexa) Seeds (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie FA. (2021). Cloning of α-Amylase and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from cocoa (Theobroma cacao L.) Pods and Biochemical Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie FA, Ayodeji AO, Olajuyigbe FM, Kolawole AO and Ajele JO. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. Olaniyan OT, Adetunji CO, Adeniyi MJ, Hefft DI. 2022a. Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics. DOI: 10.1201/9780367548445-20 Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445.

Molecular Biomarkers as Health and Disease Predictors

233

Olaniyan OT, Adetunji CO, Adeniyi MJ, Hefft DI. 2022m. Computational Intelligence in IoT Health care. 2022 b. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics. DOI: 10.1201/9780367548445-19. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Oyedara OO, Adeyemi FM, Adetunji CO, Elufisan TO. 2022. Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARSCoV-2 Infection. DOI: 10.1201/9781003178903-10. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Papadopoulou G, Manoloudi E, Repousi N, Skoura L, Hurst T, Karamitros T. (2022). Molecular and Clinical Prognostic Biomarkers of COVID-19 Severity and Persistence. Pathogens, 11, 311. https://doi.org/10.3390/pathogens11030311. Prabir KM, Shivani Soni, Reams RR, Verri T, Mandal A and Mishra S. (2013). Molecular Biomarkers: Tools of Medicine. Hindawi Publishing Corporation BioMed Research International, Volume 2013, Article ID 595496, 2 pages. http://dx.doi.org/10.1155/ 2013/595496. Rahman AU, Dash S and Luhach AK. (2021). Dynamic MODCOD and power allocation in DVB-S2: A hybrid intelligent approach. Telecommunication Systems, 76(1), 49-61. Rahman A, Sultan K, Dash S and Khan MA. (2018). Management of resource usage in mobile cloud computing. Int J Pure Appl Math, 119(16), 255-261. Sahab ZJ, Semaan SM, and Sang Q-X. (2017). Methodology and Applications of Disease Biomarker Identification in Human Serum. Biomark Insights, 2017; 2: 117727190 700200. Sahu P, Pinkalwar N, Dubey RD, Paroha S, Chatterjee S and Chatterjee T. (2011). Biomarkers: An emerging tool for diagnosis of a disease and drug development. Asian J Res Pharm Sci, 1, 9-16. Sahu B, Dash S, Mohanty SN and Rout SK. (2018). Ensemble comparative study for the diagnosis of breast cancer datasets. International Journal of Engineering and Technology, 7(4.15), 281-285. Tampa M, Sarbu MI, Mitran MI, Mitran CI, Matei C and Georgescu SR. (2018). The pathophysiological mechanisms and the quest for biomarkers in psoriasis, a stressrelated skin disease. Dis Markers, 2018: 5823684. doi: 10.1155/2018/5823684. Ziemssen T, Akgün K and Brück W. (2019). Molecular biomarkers in multiple sclerosis. Journal of Neuroinflammation (2019) 16:272 https://doi.org/10.1186/s12974-0191674-2. Zacharakis G, Aleid A and Aldossari KK. (2018). New and old biomarkers of hepatocellular carcinoma. Hepatoma Res, 2018; 4: 65. http://dx.doi.org/10.20517/ 2394-5079.2018.76. Zhang R-F, Liu S, Wang Y-W, Li J. (2021). Potential molecular biomarkers used to predict the response to biological therapies in ulcerative colitis. Chinese Medical Journal, 2021; 134(9). 1058-1060. DOI:10.1097/CM9.0000000000001390. Zimmern RL. (2009). Testing challenges: Evaluation of novel diagnostics and molecular biomarkers. Clinical Medicine, Vol 9 No 1. 68-73.

Chapter 9

Systems Biology Applications and Bioinformatics Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi4 and Lawrence Achilles Nnyanzi4 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract System biology is the converging point where computer science, engineering, and biology meet. It involves the computational understanding of the ongoing interactions within the complex cells and the immediate surroundings. The application of this type of system biology with the aid of computational tools has tremendously transformed various fields, such as in the improvement of agricultural outputs; better understanding of pharmacology, and progress in

*

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

236 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. medicine; this has led to the emergence of timely and less tedious early diagnosis of various diseases, target medicine, enabling natural immunity, drug design, development and in the removal of contaminants or toxins from the environment through bioremediation.

Keywords: biomedicine, system biology, bioinformatics, bioremediation, contaminants, bioremediation

Introduction Systems biology (SB) is widely regarded as a discipline that seeks to comprehend the interaction of complexity in biological systems and environments. This complexity further involves understanding the interactions between micro- or macro-molecules present in the cells, the chemicals, and the external factors that result in the emerging products generally referred to as the phenotypes. Understanding how these molecules interact with changes in the environment at molecular levels has been a key to improving medical research today. The use of mathematical applications in biological systems through genomics, proteomics, transcriptomics, and other omics has revolutionized the multidisciplinary field called system biology. This approach allows better understanding and prediction of how potent a vaccine can be when introduced to humans (Querec et al. 2009), reproductive medicine (Krawetz, 2008), personalized medicine (Auffray and Hood, 2012, Adetunji et al. 2022a-l; Olaniyan et al. 2022a, b; Oyedara et al. 2022) with application in cancer research (Gonzalez-Angulo et al. 2010). Also, system biology can be applied in bioremediation (de Lorenzo, 2008; Chakraborty et al. 2012; Iman et al. 2017; Jaiswal et al. 2019), agriculture (Tripathi et al. 2014; Niehl et al. 2018), biofuel (Rupprecht, 2009; Hädicke et al. 2011; El-Dalatony et al. 2020) among others.

System Biology Medicine In the medical and pharmaceutical industry, the use of system biology has over the years given pharmacology, pharmacokinetics, and pharmacognosy a

Systems Biology Applications and Bioinformatics

237

significant boost with various applications ranging from drug design, drug discovery (Butcher et al. 2004; Cho et al. 2006), drug metabolism (Bugrim et al. 2004) or personalized medicine (Chen and Snyder, 2012). Today, the timeconsuming processes of this field have been shortened and made effective with approaches in system biology. The application of SB in disease diagnosis has transformed the medical world with the emergence of timely and less tedious early diagnosis of various diseases, target medicine, enabling natural/inherited immunity (Gardy et al. 2009), molecular epidemiology, and precision medicine (Nicholson, 2006). Pathways of several disease conditions such as cardiovascular disease (Wheelock et al. 2009), autism (Randolph-Gips, 2011), cancer (Hornberg et al. 2006; Laubenbacher et al. 2009; Yagi et al. 2011), and refractory epilepsy (Naimo et al. 2019) have been studied or investigated by applying various SB approaches. The application of SB in drug design, discovery, development, and delivery has been reviewed by a host of authors, such as Keskin et al. (2007), Rodriguez et al. (2010), and Pei et al. (2014). They reviewed the use of SB for designing drugs that target multiple proteins, the use of SB for the development of cardiac medications, an assessment of their toxicity, and the development of new structural-based drugs, respectively. In their study, Xie et al. (2011) explored the use of SB for designing drugs that can be used for drug repositioning and the effect of drugs that binds to other target proteins other than their initial primary targets.

Agriculture The field of agriculture has been improved with the application of the system biology approach to improving several agricultural sectors. Several scientists have reported the successful application of this approach in producing better crops and animals for breeding. The major abiotic stress faced by agricultural plants happens to be the salinity of the soil. Abrol et al. (1988) and Wicke et al. (2011) in their respective works reported that about 10% of the world’s landmass and the arable lands around the world have been observed to be affected by soil salinity is 20%. In order to improve food security, there is a need to understand how plants interact with this type of soil, their physiology, and their biochemistry, and hence find a better solution. System biology has been a good response to this, as we now have salt-tolerant plants and farm produce. Borsani et al. (2003); Cramer et al. (2011); Jogaiah et al. (2013);

238 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Wang et al. (2013); Mohanta et al. (2017); and Dangi et al. (2018), in their respective works reported the use of this approach in generating salt-tolerance in various varieties of plants. In both studies, Ilangumaran and Smith (2017) and Rodriguez et al. (2019) used a system biology approach to understand the various roles the microbiome plays in plant growth. The plant defense systems have also been studied to understand how the system biology approach can help plants improve their defenses against pathogenic attacks. Burow et al. (2010), Kliebenstein (2012), Newton et al. (2012), Barah and Bones (2015), Oates et al. (2016), and Escandón et al. (2021) all used different types of biological systems to boost plant immunity and thus improve their respective defence systems.

Bioremediation In achieving environmentally friendly and cost-efficient ways of removing toxic chemicals from our environment, understanding the metabolic interactions taking place within microbes and the environment at molecular levels with the aid of computational biology has today improved the process by which plants (phyto) and microbes (bio) absorb and degrade or destroy toxic chemicals and pollutants in our habitats. Improvements in this process have led to a better and faster rate of decontamination or depollution of toxic materials in our environment (Adetunji et al. 2022m-q). Chakraborty et al. (2012) in their work reviewed the application of SB using different computational and bioinformatics tools for investigating complex interactions at the molecular, subcellular and cellular levels as well as community levels of microbes to degrade toxic wastes and chemicals in the environment. In removing complex chemicals resulting from the application of pesticides from the environment through the application of gene editing and SB tools, Jaiswal et al. (2019) reported in their review of the current state of the technology in removing these chemicals. While the application of the SB approach and functional genomics for removing both organic and inorganic pollutants is reported by Yadav et al. (2019).

Systems Biology Applications and Bioinformatics

239

Current Techniques Involved Systems Biology Application and Bioinformatics Likić et al. (2010) reported that a biochemical system involving the analysis and quantitative measurements of cellular constituents at the protein, mRNA, metabolite levels, and in vivo metabolic response rates can be incorporated into computational and mathematical modelling of data derived through molecular biology techniques like proteomics, genomics, transcriptomics, and metabolomics to gain better understanding specific biochemical systems. Through bioinformatics algorithms, the advancement and development of Systems Biology Markup Language are gaining significant attention. This Systems Biology Markup Language involves the use of computer language to develop physiological processes which provide high-fidelity biological models. System biology includes signaling networks, metabolic pathways, and genome-scale networks, which can be conceptualized on the molecular level using sophisticated models. Yudong et al. (2013) reported that advancement in next-generation technologies and computational biology has led to the development of biomedicine, physiology and drug design. These computational algorithms are utilized to study complex interactions and system biology like protein – DNA or RNA interactions, protein-protein interactions, and biomarker validation for signal transduction mechanisms using a multiplex gene expression profiling platform. The authors gave a framework for analyzing drug discovery and diagnosing disease mechanisms. The computational approach in bioinformatics tools can also be applied to study the adverse effects of drugs, chemical-chemical interaction, and chemical-protein interaction. Bioinformatics and systems biology today can be used to improve the properties of enzymes/biocatalysts for various biotechnology processes in the various industries (Ogundolie, 2015; Ayodeji et al. 2017; Ogundolie, 2021; Ogundolie et al. 2022; Adetunji et al. 2023a, b). Mridul et al. (2013) showed that in the airway diseases like asthma, cancer and COPD, the respiratory mucosa displays a high level of inflammatory activity which is under-explored in many situations. The authors utilized a multiplex gene expression profiling platform to analyze the innate cytokine pathways and the complex phenotype due to the effect of epithelialmesenchymal transition. The authors suggested that computational approaches to system biology provide a coordinated and rapid mechanism,

240 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

thus reducing coupling intervals and providing an intensive understanding of complex phenotypes. Jun et al. (2013) revealed that computational biology methods such as dynamical modelling and network analysis have been utilized in system biology to understand the pathogenesis of diseases. Through the application of bioinformatics tools in system biology, important advancements have been brought to understanding human disease networks, investigation of disease mechanisms, treatment reaction prediction, disease-linked gene prediction, drug-target networks, investigation of drug adverse effects, prediction of drugtarget interactions, drug repositioning, drug combination prediction and drug discovery. Computational modelling in system biology describes biomolecular processes like Intermolecular association, covalent and catalysis modification, and intracellular localization. In system pharmacology, which is an approach facilitating understanding of drug action, computational algorithms provide comprehensive molecular interaction of drugs. Lei et al. (2013) reported the role of computational techniques in the analysis of adverse effects of drugs through hybridizing the chemicalchemical and chemical–protein interactions in pharmacological research. Fadare et al. (2021) reported the use of bioinformatics platforms such as MODELLERS, MD Simulation and PDB databases for drug discovery. Ballereau et al. (2013) revealed that recent advancement in omics technology such as genomics, transcriptomics, epigenomics, proteomics, lipidomics, and metabolomics has provided an opportunity for data analysis like clustering, prediction analysis, feature selection, pathway analysis and text mining. System biology modelling like De novo genome assembly, identification of co-or differentially expressed genes at proteins or transcripts level, genome annotation, the inference of protein-protein interaction networks is presently linked or incorporated with bioinformatics tools that can unravel the molecular pathways covering physiological processes. Ki-Bong, (2010) reported that the massive generation of biological data has become a challenge in computer analysis. The incorporation of bioinformatics techniques provides improved methods in system biology. Some of the computational skills in bioinformatics include LINUX or UNIX, PERL, HTML, Python, database management, storage, data representation, data and patterns mining, biological data interoperability, statistics, machine learning, probability, visualization, optimization, numerical methods, differential and integral calculus. Metabolites and enzymes involved in signal transduction mechanism, gene regulation and metabolic functions of cellular processes can now be

Systems Biology Applications and Bioinformatics

241

analyzed effectively using big data with bioinformatics tools (Ibiam and Ekwe, 2012). Feilim et al. (2010) reported that system biology utilizes next-generation approaches like in silico modelling and computational techniques to analyze complex disease-host-therapeutic interactions. The authors revealed that the application of system biology can provide an understanding of gene delivery, prediction of therapeutic design, biomarkers, and adverse effects. Recently, the advances in the development of system biology and computational approaches in the drug discovery process, drug screening, and validation have resulted in the physiological characterization of disease mechanisms. Carr et al. (2015) revealed various emerging techniques in system biology and computational tools deployed for comprehensive analysis of big biological datasets. Screening simulation and in silico mutagenesis techniques are computational approaches designed to analyze large scale mutant modelling. Mwololo et al. (2010) reported that the interpretation of large-scale biological data can be effectively managed using bioinformatics tools. Functional genomics techniques, microarray and transcriptome profiling can map DNA sequence, generate gene annotations, and cellular mRNA comparisons.

Conclusion In this chapter, we have reviewed the use of system biology and bioinformatics in the identification, development and management of several industries, the use of bioinformatics tools in the improvement of agricultural outputs and enhancing better and more precise diagnoses in the health sector. Understanding the cellular interactions through the omics approach has been a valuable approach to this evolving field of system biology.

References Abrol, I. P., Yadav, J. S. P. and Massoud, F. I. 1988. Salt-affected soils and their management. FAO soils bulletin No. 39. Rome, Food and Agriculture Organization of the United Nations. 131 Adetunji, C. O., Inobeme, A., Tadso, J., Olaniyan, O. T., Abimbola, O. F., Shahnawaz, M., & Anani, O. (2022a). Potential of Plastic Waste in Enhancing the level of

242 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota (pp. 301-312). Springer, Singapore. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D., (2022b). Computational Intelligence Techniques for Combating COVID-19. doi: 10.1201/9781003178903-16. In book : In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics (pp. 251-269). CRC Press. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Olugbenga, M. S., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D. (2022c). Machine Learning and Behaviour Modification for COVID-19. doi: 10.1201/9781003178903-17. In book : Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903. Adetunji, C. O., Ogundolie, F. A., Ajiboye, M. D., Mathew, J. T., Inobeme, A., Dauda, W. P., & Adetunji, J. B. (2022d). Nano-engineered Sensors for Food Processing. In Bioand Nano-sensing Technologies for Food Processing and Packaging (pp. 151-166). Royal Society of Chemistry. doi :10.1039/9781839167966-00151. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Garipova, L. and Shariati, M. A. (2022e). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In : Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_10. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Isabekova, O. and Shariati, M. A., (2022f). Smart Sensing for COVID19 Pandemic. In : Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_9. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022g). Internet of Health Things (IoHT) for COVID-19. In : Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_5. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Koriagina, N. and Shariati, M. A., (2022h). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In : Pani S.K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_3. Adetunji, C. O., Olaniyan, O.T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022i). Machine Learning Approaches for COVID-19 Pandemic. In : Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and

Systems Biology Applications and Bioinformatics

243

Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_8. Adetunji, C. O., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E., Esiobu, N. D., Oyedara, O. O. and Adeyemi, F. M. (2022j). Application of Computational and Bioinformatics Techniques in Drug Repurposing for Effective Development of Potential Drug Candidate for the Management of COVID-19. doi: 10.1201/9781003178903-15. In book : Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition, First Published 2022, Imprint CRC Press. Pages 14. eBook ISBN 9781003178903. Adetunji, C. O., Samuel, M. O., Adetunji, J. B. and Oluranti, O. I., (2022k). Corn Silk and Health Benefits. doi: 10.1201/9781003178903-11. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Adetunji, Oluwaseun C. John Tsado Mathew, Abel Inobeme, Olugbemi T. Olaniyan, Kshitij RB Singh, Ogundolie Frank Abimbola, Vanya Nayak, Jay Singh & Ravindra Pratap Singh (2022l). Microbial and Plant Cell Biosensors for Environmental Monitoring. In : Singh, R. P., Ukhurebor, K. E., Singh, J., Adetunji, C. O., Singh, K. R. (eds) Nanobiosensors for Environmental Monitoring. Springer, Cham. https://doi.org/10.1007/978-3-031-16106-3_9. Adetunji, C. O., Inobeme, A., Singh, K. R., Bodunrinde, R. E., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2002m) Genomic Analysis of Heavy Metal-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 113-126). CRC Press. Adetunji, C.O., Abimbola, O.F., Singh, K.R., Olaniyan, O.T., Bodunrinde, R.E., Inobeme, A., Mathew, J.T., Singh, J. and Singh, R.P., (2022n). Microbe Performance and Dynamics in Activated Sludge Digestion. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 99-112). CRC Press. https://doi.org/10.1201/ 9781003354147 Adetunji, C.O., Ogundolie, F.A., Ajiboye, M.D., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajayi, O.O., Dauda, W.P. and Ghazanfar, S., (2022o). Bio-and Nanosensors in the Food Industry. In Bio-and Nano-sensing Technologies for Food Processing and Packaging (pp. 22-36). Royal Society of Chemistry. Adetunji, C.O., Mathew, J.T., Singh, K.R., Bodunrinde, R.E., Inobeme, A., Olaniyan, O.T., Abimbola, O.F., Singh, J., Nayak, V. and Singh, R.P., (2022p) Molecular Characterization of Multidrug-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 127-141). CRC Press. Adetunji, C.O., Inobeme, A., Tadso, J., Olaniyan, O.T., Abimbola, O.F., Shahnawaz, M. and Anani, O., (2022q). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota (pp. 301-312). Singapore: Springer Nature Singapore. https://doi.org/10.1007/978-981-16-5403-9_16

244 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O., Dauda, W. and Ghazanfar, S., (2023a). Graphene-based nanomaterials for targeted drug delivery and tissue engineering. Novel Platforms for Drug Delivery Applications, pp.277-288. https://doi.org/10.1016/ B978-0-323-91376-8.00014-8 Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ghazanfar, S., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O. and Dauda, W., (2023b). Nanotube platforms for effective drug delivery applications. Novel Platforms for Drug Delivery Applications, pp.317-332. https://doi.org/10.1016/B978-0-323-913768.00005-7 Auffray, C., and Hood, L. (2012). Systems biology and personalized medicine–the future is now. Biotechnology Journal, 7(8), 938-939. Ayodeji, A. O., Ogundolie, F. A., Bamidele, O. S., Kolawole, A. O., & Ajele, J. O. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus CFU-01 : purification and characterization for biotechnological application. J Microbiol Biotechnol, 6, 90-100. Ballereau, S., Glaab, E., Kolodkin, A., Chaiboonchoe, A., Biryukov, M., Vlassis, N., Ahmed, H., Pellet, J., Baliga, N., Hood, L. and Schneider, R., (2013). Functional genomics, proteomics, metabolomics and bioinformatics for systems biology. Systems Biology: Integrative Biology and Simulation Tools, pp.3-41. Barah, P., and Bones, A. M. (2015). Multidimensional approaches for studying plant defence against insects: from ecology to omics and synthetic biology. Journal of experimental botany, 66(2), 479-493. Behera, R. N., Roy, M., & Dash, S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. Borsani, O., Valpuesta, V., and Botella, M. A. (2003). Developing salt-tolerant plants in a new century : a molecular biology approach. Plant Cell, Tissue and Organ Culture, 73(2), 101-115. Bugrim, A., Nikolskaya, T., & Nikolsky, Y. (2004). Early prediction of drug metabolism and toxicity: a systems biology approach and modelling. Drug discovery today, 9(3), 127-135. Burow, M., Halkier, B. A., and Kliebenstein, D. J. (2010). Regulatory networks of glucosinolates shape Arabidopsis thaliana fitness. Current opinion in plant biology, 13(3), 347-352. Butcher, E. C., Berg, E. L., and Kunkel, E. J. (2004). Systems biology in drug discovery. Nature Biotechnology, 22(10), 1253-1259. Carr SM, HD Marshall, T Wareham, D Craig. (2015). The Big ORF Theory : Algorithmic, Computational, and Approximation Approaches to Open Reading Frames in Shortand Medium-Length dsDNA Sequences Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology. Chapter 13. Pp 265-274. Chakraborty, R., Wu, C. H., & Hazen, T. C. (2012). Systems biology approach to bioremediation. Current Opinion in Biotechnology, 23(3), 483-490. Chen, R., and Snyder, M. (2012). Systems biology: personalized medicine for the future ? Current opinion in pharmacology, 12(5), 623-628.

Systems Biology Applications and Bioinformatics

245

Cho, C. R., Labow, M., Reinhardt, M., van Oostrum, J., and Peitsch, M. C. (2006). The application of systems biology to drug discovery. Current opinion in chemical biology, 10(4), 294-302. Cramer, G. R., Urano, K., Delrot, S., Pezzotti, M., and Shinozaki, K. (2011). Effects of abiotic stress on plants : a systems biology perspective. BMC plant biology, 11(1), 114. Dangi, A. K., Sharma, B., Khangwal, I., and Shukla, P. (2018). Combinatorial interactions of biotic and abiotic stresses in plants and their molecular mechanisms: systems biology approach. Molecular biotechnology, 60(8), 636-650. Dash, S., & Abraham, A. (2018, December). Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In International Conference on Hybrid Intelligent Systems (pp. 176-188). Springer, Cham. Dash, S., Abraham, A., Luhach, A. K., Mizera-Pietraszko, J., & Rodrigues, J. J. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. Dash, S., Ahmad, M., & Iqbal, T. (2021). Mobile cloud computing: a green perspective. In Intelligent Systems. vol. 185, pp :523-533, Springer, Singapore. http://doi.org/ 10.1007/978- 981-33-6081-5-46. Dash, S., Thulasiram, R., & Thulasiraman, P. (2017, December). An enhanced chaos-based firefly model for Parkinson’s disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT) (pp. 159-164). IEEE. Dash, S., Thulasiram, R., & Thulasiraman, P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. de Lorenzo, V. (2008). Systems biology approaches to bioremediation. Current opinion in biotechnology, 19(6), 579-589. El-Dalatony, M. M., Zheng, Y., Ji, M. K., Li, X., and Salama, E. S. (2020). Metabolic pathways for microalgal biohydrogen production : Current progress and future prospectives. Bioresource Technology, 318, 124253. Escandón, M., Castillejo, M. Á., Jorrín-Novo, J. V., & Rey, M. D. (2021). Molecular Research on Stress Responses in Quercus spp.: From Classical Biochemistry to Systems Biology through Omics Analysis. Forests, 12(3), 364. Fadare, O. A., Omisore, N. O., Adegbite, O. B., Awofisayo, O. A., Ogundolie, F. A., Adesanwo, J. K., & Obafemi, C. A. (2021). Structure based design, stability study and synthesis of the dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. Feilim Mac Gabhann, Brian H. Annex, and Aleksander S. Popel (2010) Gene Therapy from the perspective of Systems Biology. Curr Opin Mol Ther. 2010 October; 12(5) : 570– 577. Gardy, J. L., Lynn, D. J., Brinkman, F. S., & Hancock, R. E. (2009). Enabling a systems biology approach to immunology: focus on innate immunity. Trends in immunology, 30(6), 249-262.

246 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Gonzalez-Angulo, A. M., Hennessy, B. T., and Mills, G. B. (2010). Future of personalized medicine in oncology: a systems biology approach. Journal of clinical oncology, 28(16), 2777. Hädicke, O., Grammel, H., and Klamt, S. (2011). Metabolic network modelling of redox balancing and biohydrogen production in purple nonsulfur bacteria. BMC systems biology, 5(1), 1-18. Hornberg, J. J., Bruggeman, F. J., Westerhoff, H. V., & Lankelma, J. (2006). Cancer : a systems biology disease. Biosystems, 83(2-3), 81-90. http://dx.doi.org/10.1155/ 2013/505864. Ibiam O. F. A. and Ekwe A. (2012) Bioinformatics and scientific research. International Research Journal of Biotechnology. Vol. 3(9) pp. 141-151. Ilangumaran, G., and Smith, D. L. (2017). Plant growth-promoting rhizobacteria in amelioration of salinity stress : a systems biology perspective. Frontiers in Plant Science, 8, 1768. Iman, M., Sobati, T., Panahi, Y., and Mobasheri, M. (2017). Systems biology approach to bioremediation of nitroaromatics: constraint-based analysis of 2, 4, 6-trinitrotoluene biotransformation by Escherichia coli. Molecules, 22(8), 1242. Jaiswal, S., Singh, D. K., & Shukla, P. (2019). Gene editing and systems biology tools for pesticide bioremediation: a review. Frontiers in microbiology, 10, 87. Jaiswal, S., Singh, D. K., and Shukla, P. (2019). Gene editing and systems biology tools for pesticide bioremediation: a review. Frontiers in microbiology, 10, 87. Jogaiah, S., Govind, S. R., and Tran, L. S. P. (2013). Systems biology-based approaches toward understanding drought tolerance in food crops. Critical Reviews in Biotechnology, 33(1), 23-39. Keskin, O., Gursoy, A., Ma, B., & Nussinov, R. (2007). Towards drugs targeting multiple proteins in a systems biology approach. Current topics in medicinal chemistry, 7(10), 943-951. Ki-Bong Kim (2010) Bioinformatics: Latest Application and Interdisciplinary Field of Computer Science. Vol. 11, No. 3, pp. 971-977, Kliebenstein, D. J. (2012). Plant defense compounds : systems approaches to metabolic analysis. Annual review of phytopathology, 50, 155-173. Krawetz S. A. (2008). SBiRM: Systems Biology in Reproductive Medicine. Systems biology in reproductive medicine, 54(1), 1–2. https://doi.org/10.1080/19396360701 883282. Laubenbacher, R., Hower, V., Jarrah, A., Torti, S. V., Shulaev, V., Mendes, P., Torti, F. M. and Akman, S., (2009). A systems biology view of cancer. Biochimica et Biophysica Acta (BBA)-Reviews on Cancer, 1796(2), 129-139. Lei Chen, Tao Huang, Jian Zhang, Ming-Yue Zheng, Kai-Yan Feng, Yu-Dong Cai, and Kuo-Chen Chou (2013) Predicting Drugs Side Effects Based on Chemical-Chemical Interactions and Protein-Chemical Interactions. Hindawi Publishing Corporation BioMed Research International Volume 2013, Article ID 485034, 8 pages. http://dx.doi.org/10.1155/2013/485034. Likić, V. A., McConville, M. J., Lithgow, T., and Bacic, A. (2010). Systems biology: the next frontier for bioinformatics. Advances in bioinformatics, 2010. Article ID 268925, 10 pages. doi :10.1155/2010/268925. Article ID 268925, 10 pages.

Systems Biology Applications and Bioinformatics

247

Mohanta, T. K., Bashir, T., Hashem, A., and Abd_Allah, E. F. (2017). Systems biology approach in plant abiotic stresses. Plant physiology and biochemistry, 121, 58-73. Mridul Kalita, Bing Tian, Boning Gao, Sanjeev Choudhary, Thomas G. Wood, Joseph R. Carmical, Istvan Boldogh, Sankar Mitra, John D. Minna, and Allan R. Brasier (2013) Systems Approaches to Modeling Chronic Mucosal Inflammation. Hindawi Publishing Corporation BioMed Research International Volume 2013, Article ID 505864, 17 pages. Mwololo, J. K., Munyua, J. K., Muturi, P. W.and Munyiri, S. W. (2010) An overview of advances in bioinformatics and its application in functional genomics. Journal of Animal & Plant Sciences. Vol. 6, Issue 3 : 645- 652. Naimo, G. D., Guarnaccia, M., Sprovieri, T., Ungaro, C., Conforti, F. L., Andò, S., and Cavallaro, S. (2019). A systems biology approach for personalized medicine in refractory epilepsy. International journal of molecular sciences, 20(15), 3717. Newton, A. C., Torrance, L., Holden, N., Toth, I. K., Cooke, D. E., Blok, V., and Gilroy, E. M. (2012). Climate change and defense against pathogens in plants. Advances in applied microbiology, 81, 89-132. Nicholson, J. K. (2006). Global systems biology, personalized medicine and molecular epidemiology. Molecular systems biology, 2(1), 52. Niehl, A., Soininen, M., Poranen, M. M., & Heinlein, M. (2018). Synthetic biology approach for plant protection using ds RNA. Plant biotechnology journal, 16(9), 16791687. Oates, C. N., Denby, K. J., Myburg, A. A., Slippers, B., and Naidoo, S. (2016). Insect gallers and their plant hosts : from omics data to systems biology. International journal of molecular sciences, 17(11), 1891. Ogundolie, F. A. (2015). Characterization of a Purified β–Amylase from Black Marble Vine (Dioclea reflexa) Seeds (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A. (2021). Cloning of α-AMYLASE and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from cocoa (Theobroma cacao L.) Pods and Biochemical Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A., Ayodeji, A. O., Olajuyigbe, F. M., Kolawole, A. O., & Ajele, J. O. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. 2022a. Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book : Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-20 Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. In. Computational Intelligence in IoT Healthcare. 2022 b. In book : Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-19. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445.

248 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Omotayo Opemipo Oyedara, Folasade Muibat Adeyemi, Charles Oluwaseun Adetunji, Temidayo Oluyomi Elufisan. 2022. Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARS-CoV-2 Infection. doi: 10.1201/9781003178903-10. In book : Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Pei, J., Yin, N., Ma, X., & Lai, L. (2014). Systems biology brings new dimensions for structure-based drug design. Journal of the American Chemical Society, 136(33), 11556-11565. Querec, T. D., Akondy, R. S., Lee, E. K., Cao, W., Nakaya, H. I., Teuwen, D., Pirani, A., Gernert, K., Deng, J., Marzolf, B. and Kennedy, K., (2009). Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans. Nature immunology, 10(1), pp. 116-125. Rahman, A. U., Dash, S., & Luhach, A. K. (2021). Dynamic MODCOD and power allocation in DVB-S2 : a hybrid intelligent approach. Telecommunication Systems, 76(1), 49-61. Rahman, A., Sultan, K., Dash, S., & Khan, M. A. (2018). Management of resource usage in mobile cloud computing. Int J Pure Appl Math, 119(16), 255-261. Randolph-Gips, M. (2011, July). Autism: A systems biology disease. In 2011 IEEE First International Conference on Healthcare Informatics, Imaging and Systems Biology (pp. 359-366). IEEE. Reinhard Schneider, Rudi Balling and Charles Auffray (2013) Functional Genomics, Proteomics, Metabolomics and Bioinformatics for Systems Biology. Chapter 1. Systems Biology, doi: 10.1007/978-94-007-6803-1_1. Rodriguez, B., Burrage, K., Gavaghan, D., Grau, V., Kohl, P., & Noble, D. (2010). The systems biology approach to drug development: application to toxicity assessment of cardiac drugs. Clinical Pharmacology & Therapeutics, 88(1), 130-134. Rodriguez, P. A., Rothballer, M., Chowdhury, S. P., Nussbaumer, T., Gutjahr, C., and Falter-Braun, P. (2019). Systems biology of plant-microbiome interactions. Molecular plant, 12(6), 804-821. Rupprecht, J. (2009). From systems biology to fuel—Chlamydomonas reinhardtii as a model for a systems biology approach to improve biohydrogen production. Journal of Biotechnology, 142(1), 10-20. Sahu, B., Dash, S., Mohanty, S. N., & Rout, S. K. (2018). Ensemble comparative study for the diagnosis of breast cancer datasets. International Journal of Engineering & Technology, 7(4.15), 281-285. Tripathi, P., Rabara, R. C., and Rushton, P. J. (2014). A systems biology perspective on the role of WRKY transcription factors in drought responses in plants. Planta, 239(2), 255-266. Wang, J., Chen, L., Wang, Y., Zhang, J., Liang, Y., and Xu, D. (2013). A computational systems biology study for understanding salt tolerance mechanism in rice. PLoS One, 8(6), e64929. Wheelock, C. E., Wheelock, Å. M., Kawashima, S., Diez, D., Kanehisa, M., van Erk, M., Kleemann, R., Haeggström, J. Z. and Goto, S., (2009). Systems biology approaches

Systems Biology Applications and Bioinformatics

249

and pathway tools for investigating cardiovascular disease. Molecular BioSystems, 5(6), 588-602. Wicke, B., Smeets, E., Dornburg, V., Vashev, B., Gaiser, T., Turkenburg, W. and Faaij, A. (2011). The global technical and economic potential of bioenergy from salt-affected soils. Energy & Environmental Science, 4(8) : 2669. https://doi.org/10.1039/c1ee 01029h. Xie, L., Xie, L., & Bourne, P. E. (2011). Structure-based systems biology for analyzing offtarget binding. Current opinion in structural biology, 21(2), 189-199. Yadav, S., Bhardwaj, Y., & Singh, A. (2019). Functional genomics and system biology approaches in bioremediation of soil and water from organic and inorganic pollutants. In Microbial Genomics in Sustainable Agroecosystems (pp. 1-20). Springer, Singapore. Yagi, H., Tan, W., Dillenburg-Pilla, P., Armando, S., Amornphimoltham, P., Simaan, M., Weigert, R., Molinolo, A. A., Bouvier, M. and Gutkind, J. S., (2011). A synthetic biology approach reveals a CXCR4-G13-Rho signaling axis driving transendothelial migration of metastatic breast cancer cells. Science signaling, 4(191) 60-ra60. Yudong C., Tao H., Lei C., and Bin N. (2013) Application of Systems Biology and Bioinformatics Methods in Biochemistry and Biomedicine. Hindawi Publishing Corporation BioMed Research International Volume 2013, Article ID 651968, 2 pages. http://dx.doi.org/10.1155/2013/651968.

Chapter 10

Genome Data Resources and Tools for Sequence Analysis Sandesh Behera1,* Tikshana Yadav2,* Surendra Pratap Singh2,† and Hrudayanath Thatoi1,‡ 1Department

of Biotechnology, Maharaja Sriram Chandra Bhanja Deo University, Takatpur, Baripada, Odisha, India 2Plant Molecular Biology Laboratory, Department of Botany, Dayanand Anglo-Vedic (PG) College, Chhatrapati Shahu Ji Maharaj University, Kanpur, India

Abstract Bioinformatics is the study of a set of omics data of organisms that involves the creation of statistical analysis software, techniques, and tools that can analyze and interpret data to acquire knowledge about biological information. Many databases, such as the National Center for Biotechnology Information (NCBI), the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ), provide access to biological and genetic data from members of the International Nucleotide Sequence Database Collaboration (INSDC). These databases include data ranging from raw reads to sequence alignments and assemblies to functional annotation and are enriched with information about samples and experiments. The study of an organism’s entire 

These authors are contributed equally to the article. Corresponding Author’s Email: [email protected]. ‡ Corresponding Author’s Email: [email protected]. †

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

252

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al. genetic makeup is referred to as genomics. The introduction of highthroughput sequencing techniques in this genomic era has resulted in a massive amount of genomic data being generated on a regular basis. The effective use of tools and databases aids in the analysis of genetic data and diversity. Genetic Diversity, Next-Generation Sequencing (NGS) Genetic Markers, Gene Mapping, Genotyping, Genome-Wide Interaction, Genomic Selection (GS), Multiparent Advanced Generation Inter-Cross (MAGIC), and other techniques are grouped into different categories. In genomics research, these databases and tools are commonly utilized.

Keywords: bioinformatics tools, databases, genomics, high-throughput sequencing, next-generation sequence

Introduction to Bioinformatics National Center for Biotechnology Information (NCBI) defines bioinformatics as “the technical discipline where biological science, computational analysis, and data science merge into a disciple.” Bioinformatics is the study of statistical techniques and procedures, as well as analytical computational software, for the collection, storage, analysis, and display of biological data. The invention of statistical techniques and new algorithms for assessing correlations among broad biological datasets, the application of these tools and techniques to the analysis and interpretation of huge biological datasets, and database construction for effective storage and management of large datasets, as well as quick search, retrieval, and analysis of the essential information, are all part of the bioinformatics discipline. The necessity for new tools and approaches to deal with the massive volumes of data being created prompted the development of bioinformatics.

Role of Genomics The identification of DNA as genetic material, the determination of DNA structure, the analysis of genetic code, the growth of genetic engineering, and the establishment of DNA sequencing technology all helped to pave the way for multiple Genome Projects. Large and complicated genomic data sets have been generated in public databases as a result of these technologies, which have virtually changed the study of all living creatures. These datasets give

Genome Data Resources and Tools for Sequence Analysis

253

biological and biomedical research a whole new level. Many study tools are available thanks to interconnected advances in genetics, comparative genomics, and bioinformatics, which allow the functioning of organisms in diverse fields of an undiscovered area of molecular detail. With the increase in genome sequencing, the elucidation of genome function is expanding. De novo and reference-based genome assembly are two fundamental objectives of Next-Generation Sequence Analysis. The assembly process is intricate, and producing high-quality assemblies necessitates a thorough understanding of numerous methodologies and their associated characteristics. Genes are encoded by only a small fraction of the genome, and they are surrounded by repetitive DNA sequences that are tough to piece together. The ability to generate larger amounts of paired read sequence data, in combination with bioinformatics tools, provides the foundation for sequencing big and complicated genomes.

Tools for Genomics Research Bioinformatics tools for genome sequencing are useful for the development of high throughput technologies to generate unprecedented sequence data. This is needed for the development of computer programs that can acquire, analyze, classify and store huge amounts of data and access the stored data. In this result, appropriate computer programs were made as well as suitable data management and storage system were implemented. For acquisition, detection, and analysis of the data, different computer programs were used. Large numbers of tools are designed to achieve several objectives, data storage, and management, association and pattern analysis, etc. are described. Next-generation technologies that produce an ampule of data quickly and at a reasonable cost gave rise to genome sequencing. The data must be clarified in order to comprehend biological meaning, which is a significant task. To understand the information in a DNA Sequence innumerable studies have been done. Determining gene function and products and their interaction to be done. Various tools for Genome Analysis are as follows:

FastQC A tool that is used to ensure that high-throughput sequences are of good quality data. Provides a set of analyses that gives a quick impression of data. That

254

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

information is extracted from BAM, SAM, or FastQ files; provide a problematic area in given data; provides graph and tables for a summary. High-throughput sequencers produce billions of sequences at a time. It is important to check biases in the data and ensure the raw data is good to proceed with further analysis.

Figure 1. (Continued).

Genome Data Resources and Tools for Sequence Analysis

Figure 1. (Continued).

255

256

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Figure 1. Result from FastQC analysis, (a) composition analysis of sequence data which describe the sequencing was done by illumina, total number of sequence is 33083619, No sequence in the data is poor quality, Average sequence length is 100 bp, and GC content is 44%; (b) Quality score range across bases at each position of the sequence the given FastQ file, the bar lies in green range thus quality score lies in good range; (c) For Illumina library only depicts the quality score from each tile across bases from blue to red, green and blue colors shows the quality scores are above average, as the colors move to orange and red it depicts the quality scores are poor, in the figure there is an exclamatory mark on yellow circle is tile shows a mean Phred score more than 2 less than the mean for that base across all tiles; (d) subset of sequence universal quality, the mean of sequence is 33 and has very low poor quality base; (e) proportion of each base (A, T, G, C) position in a file, in the figure there is X on red circle this shows, from position 1 to 13 there is difference between A and T, and G and C is greater than 20%; (f) normal distribution of GC content across whole length of each sequecnce the central peak corresponds to overall GC in given data, which is 44%; (g) Uniform length sequence fragments generated by high throughput sequencers which the graph shows distribution of fragment sizes in file, every sequences in file has length of 100 bp; (h) percentage of base call is plot at each position for which N was called, there is no N content; (i) table list all sequence which make more than 0.1% of total sequences, thus there is warning of overrepresented sequences; (j) kmer content in library for uneven coverage through length of reads, finds different source of bias in the library by presence of adapter sequence on the end, plot depicts the cumulative percentage count of proportion of the library.

Genome Data Resources and Tools for Sequence Analysis

257

Provide a quality control report that can detect inaccuracies in the sequencer or the starting library. Run in a standalone mode for instant examination of a small number of FastQ files, or in a non-interactive mode for systematic processing of a large number of files. It has a series of analysis modules. The outcomes of the analysis are presented as pass/fail grades, which are based on the library’s expectations. The normal sample is varied and random. As a starting point for learning about the library, use the summary evaluation. This tool can be downloaded from https://www.bioinformatics. babraham.ac.uk/projects/fastqc/ (Andrews et al., 2010).

GeneWise Weblink for the tool is https://www.ebi.ac.uk/Tools/psa/genewise/, Check the error and compare a protein sequence to a genomic DNA sequence, allowing for introns and frameshift errors, when DNA seq does not match the expected protein sequence. The following figure shows the window of GeneWise output for ETR protein and DNA Sequence. It can be accessed via an online web tool, API tool, open API Interface, and Common Workflow language. The input sequences can be GCG, FASTA, GenBank, PIR, NBRF, PHYLIP, UniProtKB/SwissProt format. The external match portion of the homology model is forced to be identified as a gross approximation. Uses a tied intron model. Probabilistic model on either side of homology segment (Birney et al., 2004).

Figure 2. GeneWise output file of a protein-coding gene in Arabidopsis thaliana.

258

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

NCBI Prokaryotic Genomes Automatic Annotation Pipeline PGAP is used to annotate the genomes of bacteria and archea. It combines ab initio and homology-based gene prediction algorithms. Structural and functional annotation is done by using Protein Family models; phylogeny composed of Hidden-Markov Model-based and BLAST-based protein families; and conserved Domain Databases architectures. For predicting protein-coding and RNA genes, as well as other functional components, this method combines alignment-based approaches with direct-from-sequence methods. When there is enough comparable data, it focuses more on sequence similarity. In the lack of external evidence, it depends more on statistical prediction. Provide a procedure for the creation and clarification of prokaryotic taxonomic annotation. The tool is available as a standalone package at https://github.com/ncbi/pgap (Tatusova et al., 2016).

GenSAS An online tool available at https://www.gensas.org/. Upload genome sequences and choose from a variety of tools for tasks such as repeat masking, gene model prediction, and other structural features, as well as multiple functional annotation tools, to create a pipeline for whole-genome structural and functional annotation for eukaryotic and prokaryotic genomes. The genome will not annotate using the tool for Contigs that are less than the average gene size. Utilizes a user-friendly interface through the annotation process. Performs structural and functional annotation of whole genome assemblies or single DNA sequences. It uses Apollo for the manual organization of gene models (Humann et al., 2019).

Ori-Finder Online tool available at http://tubic.org/Ori-Finder2/public/index.php/index, in archaeal genomes, a vast percentage of replication origin locations remain unknown. The integrated method is used to predict replication origins in archaeal genomes by examining base composition asymmetry, ORB element distribution, and the frequency of genes close to replication origin. For freshly sequenced bacterial genomes, determining replication origins is essential.

Genome Data Resources and Tools for Sequence Analysis

259

Based on an integrated approach that included gene identification, Z-curve analysis of base composition asymmetry, DnaA box distribution, the appearance of a gene that is commonly adjacent to oriC, and phylogenetic relationships. The web interface is created using CGI Perl Scripts and is implemented on an Apache server (Gao et al., 2008).

P2RP The web server available at http://www.p2rp.org/ provides a platform for regulatory protein prediction. Analyze the data and provide detailed annotation of each regulatory protein gene, including classification; sequence features; and functional domain. Identify and annotate TFs and TCSs proteins within a sequence of interest. Predicted proteins are examined for the presence of DNA-binding domains using genomic DNA sequences as input. Tool that is interactive, user-friendly, and convenient. It helps increase in annotation and reannotation consistency of regulatory proteins in published genomes. Putative genes are translated to constitute a proteome for UP prediction (Barakat et al., 2013).

KAAS Web server available https://www.genome.jp/tools/kaas/, BLAST or GHOST comparisons against a manually organised KEGG genes database offer functional gene annotation. The KO (KEGG Orthology) has been allocated, and a KEGG pathway has been constructed. Internally, this is used to annotate query genes for the MAPLE function evaluator. A template data set is defined as one or more species. When closely similar species to the query are included in the data set, accuracy improves. KO assignments are performed based on bi-directional or single-directional best hits. Based on the most up-to-date hit data and Smith-Waterman scores. Implementation of an automated approach for assigning K values to genes in the genome. When compared to the manually curated KEGG GENE database, the algorithm uses sequence similarity, bi-directional best hit, and certain heuristics to obtain a high level of accuracy (Moriya et al., 2007).

260

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Simple Synteny The online graphical tool available at https://www.dveltri.com/simplesynteny/ produces images for comparative genome analysis. Perform targeted analysis with selected contigs. Used to compare secondary metabolites or rearrangement in mating genes across taxa. Biologists using comparative genomic techniques to examine the evolution of organisms are frequently tasked with defining syntenic links among orthologous gene clusters. A focused web-based tool for directly investigating the collinearity of local genomic neighbors across several species. It uses an iterative method to explore and edit genomes, allowing users to change the way numerous contigs are organized. Additional tools are available to assist in determining whether contigs in the genome assembly include gene targets, as well as optimizing and elucidating circular genomes (Veltri et al., 2016).

Figure 3. Phylogenetic analysis by Mega, genome wide phylogeny of genes interspecies in Musaceae family.

Genome Data Resources and Tools for Sequence Analysis

261

MEGA11 Working with larger data sets. Efficiently compress large data based on common site configuration. A comprehensive tool for phylogenetic trees of species, pathogens, and gene families. Using multispecies sequence alignments, the Bayesian approach is used to estimate neutral evolutionary probability of alleles in a species. Tool contained distance-based and maximum likelihood methods. Methods for selecting the best-fit substitution model are reconstructing phylogenies, predicting ancestral sequences, estimating evolutionary distances and divergence times, testing for selection, and diagnosing disease mutation. Responds to a number of computer systems now utilised by molecular evolution and phylogenetics researchers. MEGA11 natively developed graphical user interface and command-line versions are available at www.megasoftware.net for Microsoft Windows, Mac OS, and Linux (Tamura et al., 2021).

DNA Plotter This programme generates circular and linear representations of genomes using interactive Java applications. Used to generate images of circular and linear DNA maps to display regions and gene features. Using Artemis libraries to give a user-friendly approach of loading data from relational databases as well as sequence files. Circular and linear DNA diagrams describe the features of a genome in its genomic context. Run a standalone application from Artemis. Reads common sequence formats in GFF, EMBL, and GenBank. Tracks indicate the location of genes for which orthologs have been computed. The GC content and GC skewness can also be depicted. This software is freely available for MacOSX, UNIX, and Windows at the websites: http://www.sanger.ac.uk/Software/Artemis/circular/ (Carver et al., 2009).

SNP SNP server is a configurable real-time web tool for finding SNPs (single nucleotide polymorphisms) within DNA sequence data. BLAST is used to find related sequences, while CAP3 is used to cluster and align these sequences. The alignments are fed into the AutoSNP discovery software, which recognizes SNPs as well as insertion and deletion polymorphisms.

262

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Redundancy is used by SNPServer and AutoSNP to distinguish between candidate SNPs and sequence errors. Two measures of confidence are derived for each candidate SNP, polymorphism redundancy at an SNP locus and cosegregation of the candidate SNP with other SNPs in the alignment (Savage et al., 2005). The SNPServer is a Web interface for real-time SNP discovery using the AutoSNP, BLAST, and CAP3 tools. BLAST finds similar EST sequences, CAP3 aligns and clusters, and AutoSNP analyzes the alignments for SNPs and insertion/deletion processes. In autoSNPdb, the results of this SNP discovery procedure, as well as the source of the EST data and annotation, are stored. This database contains free SNP data on rice, barley, and Brassica spp. SNPs and insertion/deletion in specific genes or genes associated with specific attributes, as well as between genes of specific pairs/groups of plant types, can be determined using AutoSNPdb. A userfriendly graphical user interface makes it simple to visualize the SNPs in the database. QualitySNPng uses a haplotype-based technique to visualize and find SNPs using NGS data without the need for a completely sequenced reference genome.

SNP2CAPS With the introduction of numerous SNP genotyping assays in recent years, a demand for a robust, yet cost-effective, test that could be done using traditional gel-based techniques has emerged. CAPS markers are useful techniques for assessing SNPs and insertion/deletion polymorphisms in laboratories. SNPbased marker tests in genomes offer the ability to address a wide range of important biological issues. For analysis, most of these assays need costly and specialized equipment. Cleaved amplified polymorphic sequences (CAPS), on the other hand, have shown to be reliable and cost-effective assays that may be used in laboratories without sophisticated equipment. The PCR amplification of an SNP site and detection of this site by a suitable restriction endonuclease whose recognition sequence has been changed or introduced by the SNP are the basic principles of CAPS assays. The selection of appropriate restriction endonuclease enzymes can be a complex and time-consuming procedure if done manually. SNP2CAPS is a program that makes it easier to convert SNPs into CAPS markers. According to a simple algorithm, the screening of multiply-aligned sequences for restriction sites is followed by a selection pipeline that permits the subtraction of CAPS candidates by identifying putative alternative restriction sites. SNP2CAPS has both a

Genome Data Resources and Tools for Sequence Analysis

263

command line and a graphical user interface. The SNP2CAPS can be offered for free from http://pgrc.ipkgatersleben.de/snp2caps/ (Wu et al., 2004).

TASSEL Association analyses that take advantage of a genome’s natural variety to map at extremely high resolutions are becoming essential. Researchers, on the other hand, must deal with the confounding effects of population and family structure. TASSEL (Trait Analysis by aSSociation, Evolution, and Linkage) is a population and family structure control program that uses general linear and mixed linear models. The tool allows linkage disequilibrium data to be produced and graphically visualized for result interpretation. In the program TASSEL, the General Linear Model (GLM) and Mixed Linear Model (MLM) methods have been applied. To reduce the likelihood of false associations, the GLM technique employs a structured association analysis based on a Q matrix. The Q matrix is a representation of population structure that may be calculated using the STRUCTURE software or the principal components analysis (PCA) approach. The MLM technique incorporates both the kinship (K) and the Q matrices into its model to limit the chance of detecting false-positive connections. The K matrix, which depicts the average relatedness between pairs of individuals, can be estimated using pedigree information or genotyping data for a large number of unlinked markers across the whole genome of the organism. TASSEL can process information from plant, animal, and human populations. It enables the evaluation of linkage disequilibrium as well as the graphic representation of these estimates. Other capabilities of this software include InDels analysis, diversity analysis, PCA execution, and missing data imputation. This data extraction and visualisation software includes the sequence alignment viewer, neighbor-joining cladogram creation, and several data graphing methods. The TASSEL user manual and other materials are accessible for free via the TASSEL website https://www.maizegenetics.net/tassel (Bradbury et al., 2007).

STRUCTURE Pritchard et al., (2000) developed STRUCTURE, a publicly available tool for population analysis. Within a single population, the STRUCTURE program

264

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

may determine the presence of two or more homogenous groups. The program structure is a free software tool for analyzing population structure using multilocus genotyping data. This program uses a Bayesian algorithm to cluster individuals genotyped for several unlinked markers using a model. SSRs, SNPs, and AFLPs are all examples of genetic markers that can be used. It tries to find out how many homogenous groups are most likely to exist in a given population. This software has been used to find genetic patterns in sampled populations, assign people to distinct groups within the sample, population admixture, and hybridization analysis, among many other things. The majority of research shows that STRUCTURE efficiently assigns distinct individuals to their origin populations, especially when the population comprises two to four well-differentiated homogenous groupings. The executables for the software are Windows, Linux, and Mac compatible. The computational component of the software is built in C, but it includes a Java front end with a variety of handy features (Porras-Hurtado et al., 2013). The STRUCTURE program is free to download and use at https://web.stanford.edu/group/ pritchardlab/structure.html.

ClustalW In bioinformatics, a multiple sequence alignment is possibly the most useful investigative tool. It helps in the prediction of protein structure and function and serves as the foundation for phylogenetic research. Clustal programmes are classified into two types: ClustalW (command-line user interface) and ClustalX (graphical user interface). Because it is straightforward to use, ClustalW is the most extensively used multiple sequence alignment tool. It uses a progressive alignment method in which pairs of sequences are compared for similarity. Following that, each comparable pair of sequences is regarded as a single sequence, and the resulting sequences are two by two compared and aligned in pairs. This procedure is continued until all sequences are aligned. Clustal programs support a variety of input and output formats, including Clustal, FASTA, and PHYLIP. However, the FASTA format is the most convenient for ClustalW input sequences (Thompson et al., 1994). This software can be freely accessible at https://www.genome.jp/tools-bin/ clustalw.

Genome Data Resources and Tools for Sequence Analysis

265

Bioinformatics Databases A biological database is a vast, well-organized collection of persistent data that is frequently accompanied by computerized software that allows users to edit, query, and retrieve data from the system. It is a systematized gathering of massive amounts of knowledge about a particular topic, such as nucleotide sequences or protein sequences. Computer software applications are used to organize, search, and retrieve such types of data called database management systems (DBMS). Relational or object-oriented database management systems (DBMS) are the most often utilized. MySQL is a full-featured, open-source relational database management system with a three-tier design that includes a user interface, and data storage levels. The primary focus of biological databases are the storing and administration of DNA and protein sequence data. The primary databases for nucleotide sequences are GenBank, EMBL (European Molecular Biology Laboratory), DDBJ (DNA Data Bank of Japan), and GSDB (Genome Sequence Databases), whereas the primary databases for protein sequences are Swiss-Prot, TrEMBL (Translation of EMBL nucleotide sequence database), PIR (Protein Information Resource), and MIPS (Martinsried Institute of Protein). Several specialized databases specialized in certain organisms are available (Table 1).

Figure 4. Bioinformatics database.

266

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Table 1. Other tools for big data analysis Tool

URL

Reference

SOAPdenovo

https://sourceforge.net/projects/soapdenovotrans/

Xie et al., 2014

Velvet

https://www.ebi.ac.uk/-zerbino/velvet

Zerbino and Birney, 2008

GenomeQC

https://genomeqc.maizegdb.org/

Manchanda et al., 2020

ABySS

http://www.bcgsc.ca/platform/bioinfo/software/abyss

Simpson T et al., 2009

Allpaths-LG

http://www.broadinstitute.org/science/programs/genomebiology/crd

Gnerre et al., 2011

AutoSeqMan

https://github.com/Sun-Yanbo/autoSeqMan

Jin et al., 2018

BioNanoAnalyst

https://github.com/AppliedBioinformatics/BioNanoAnalyst Yuan et al., 2017

CANU

https://github.com/marbl/canu

Koren et al., 2017

FALCON

https://github.com/PacificBiosciences/falcon

Chin et al., 2016

GAM-NGS

https://github.com/vice87/gam-ngs

Vicedomini et al., 2013

GenSeed

http://www.coccidia.icb.usp.br/genseed/

Sobreira et al., 2008

Kermit

https://github.com/rikuu/kermit

Walve et al., 2019

QUAST

http://bioinf.spbau.ru/quast

Gurevich et al., 2013

Genome Assembly

Mapping to Reference Bowtie

http://bowtie-bio.sourceforge.net/index.shtml

Langmead et al., 2009

Hisat

http://www.ccb.jhu.edu/software/hisat/index.shtml

Kim et al., 2015

BLAST

https://blast.ncbi.nlm.nih.gov/Blast.cgi

Altschul et al., 1990

BFAST

https://sourceforge.net/projects/bfast/

Homer et al., 2009

BLAT

http://genome.ucsc.edu/

Kent et al., 2002

GenomeMapper

https://1001genomes.org/

Schneeberger et al., 2009

GMaP

http://www.gmaptool.eu/en

Wu et al., 2005

AutoSNP

http://acpfg.imb.uq.edu.au

Duran et al., 2009

MISA

http://pgrc.ipk-gatersleben.de/misa/

Beier et al., 2017

PolyBayes

http://bioinformatics.bc.edu/marthlab/PolyBayes

Marth et al., 1999

PolyPhred

http://droog.mbt.washington.edu/

Nickerson et al., 1997

Molecular Marker

Genome Data Resources and Tools for Sequence Analysis Tool

URL

267 Reference

Transcriptome TopHat

http://tophat.cbcb.umd.edu/

Trapnell et al., 2012

StringTie

http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.0.3.tar.gz

Pertea et al., 2015

CuffLinks

http://cole-trapnell-lab.github.io/cufflinks/announcements/cufflinks- Trapnell et al., github/ 2012

IsoformEx

http://bioinformatics.wistar.upenn.edu/isoformex

Kim et al., 2011

RNA-MATE http://grimmond.imb.uq.edu.au/RNA-MATE/

Cloonan et al., 2009

STAR

https://code.google.com/archive/p/rna-star/

Dobin et al., 2013

EdgeR

http://bioconductor.org/packages/release/bioc/html/edgeR.html

McCarthy et al., 2012

NOISeq

https://bioinfo.cipf.es/noiseq/doku.php

Tarazona et al., 2015

Free Source ArrayExpress http://www.ebi.ac.uk/tools/rcloud HTS

Goncalves et al., 2011

Chipster

http://chipster.csc.fi/

Kallio et al., 2011

DEWE

https://www.sing-group.org/dewe

LópezFernández et al., 2019

easyRNASeq http://bioconductor.org/packages/release/bioc/html/easyRNASeq.html Delhomme et al., 2012 Visualization Artemis

http://www.sanger.ac.uk/Software/Artemis

Rutherford et al., 2000

BamView

http://bamview.sourceforge.net/

Carver et al., 2013

Browser Genome

http://www.browsergenome.org/

Schmid-Burgk et al., 2015

Integrative Genomics Viewer

http://www.broadinstitute.org/igv

Thorvaldsdóttir et al., 2013

MapView

http://evolution.sysu.edu.cn/mapview/

Bao et al., 2009

Bandage

http://rrwick.github.io/Bandage

Wick et al., 2015

268

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

GenBank Since 1992, GenBank (http://www.ncbi.nih.gov/Genbank) has been the principal genomic DNA sequence database maintained at NCBI in the United States. It’s an annotated library of all publicly available nucleotide sequences and their protein translations this is freely accessible. GenBank sequences are divided into two categories: organismal and functional. Despite the fact that nucleotide sequences from specific species are stored in the organismal category, those expressing specific functions, such as expressed sequence tags (ESTs) and high-throughput genomes (HTGs), are saved in the functional category, irrespective of the source organism. The PIR database stores the translations of nucleotide sequences recorded in GenBank. The GenBank database is linked to the DDBJ and EMBL databases, and the sequence information between these three databases is exchanged (Benson et al., 2000).

Phytozome The goal of the Phytozome database is to enable genomics-based plant evolution research and to make it easier to apply functional genomics data from model plants to crop plant development. It provides comparative data on genomes and gene families, as well as analysis tools. The Phytozome database can be accessed at http://www.phytozome.net/. This database contains detailed information on the evolution of each gene’s nucleotide sequence and structure, as well as the evolutionary history and genomic structure of plant gene families. Various tools for searching, identifying, and evaluating gene families are available on the Phytozome web portal. It contains data on the genomic background of plant genes, gene homologues and paralogues, RNA transcripts from specific genes, alternatively spliced RNA transcripts and the peptide sequences that arise, and gene family activities (Lee et al., 2012).

EMBL European Bioinformatics Institute (EBI), UK, which is an outstation of the EMBL (European Bioinformatics Institute), Germany, maintains the EMBL Nucleotide Sequence Database. The INSD (International Nucleotide Sequence Database Collaboration) has developed this database, which can be

Genome Data Resources and Tools for Sequence Analysis

269

found at http://www.ebi.ac.uk/embl. This database is the principal source of nucleotide sequences and this database is the part of a global partnership between DDBJ (Japan) and GenBank (USA). To achieve rapid synchronization, data is transferred to the participating institutes regularly. Other sequence similarity search techniques, such as FASTA and BLAST, are available. Ensembl (a genome annotation database) and Genome Reviews (curated whole-genome sequences kept in the EMBL Nucleotide Sequence Database) are two more genomic databases maintained at EBI (Kanz et al., 2005).

Swiss-Prot SWISS-PROT (http://www.ebi.ac.uk/) is a curated protein sequence database that aims to give a high level of annotations such as descriptions of a protein’s function, domain structure, and post-translational modifications, variations, with minimum redundancy, and a high degree of integration with other databases. The core data and annotation information are both included in each protein sequence entry. Sequence and taxonomy data, as well as citation information, makes up the core data. The annotation includes function, domains and locations, secondary and quaternary structures, posttranslational changes, and other information. The Swiss-Prot format is quite similar to the EMBL nucleotide sequence database format. TrEMBL (Translation from EMBL) is a Swiss-Prot supplement that provides unannotated protein sequences (Bairoch et al., 1996).

UniProtKB The UniProt Knowledgebase (UniProtKB) is a central repository for functional information of proteins that is accurate, consistent, and annotated. The primary goal of UniProtKB is to gather available information on protein sequence and function and make available to users all. UniProtKB composed of 2 sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Swiss-Prot is a non-redundant, manually annotated protein sequence database that combines experimental results, calculated characteristics, and scientific conclusions. Where UniProtKB/TrEMBL provides high-quality data that have been computationally processed and are enriched with automated annotation and

270

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

categorization. The databases listed below are part of the UniProtKB such as UniProt Archive (UniParc), UniProt Reference Clusters (UniRef), and UniProt Metagenomic and Environmental Sequences (UniMES). The UniProtKB is accessed using the website https://www.uniprot.org/. This website’s homepage gives a brief overview of the UniProtKB. It also includes tools for querying, data analysis, documentation, database identification, and mapping, among other things. It also includes the BLAST tool for comparing sequences and the ClustalW tool for aligning multiple sequences.

Gramene Gramene is a bioinformatics website that allows searching, visualizing, comparing plant genomes and biological pathways. It is an integrated data resource for comparative functional genomics in crops and model plant species that is curated, open-source, and integrated. This database started as a resource for the rice community and a repository of comparative grass mapping studies. A collection of similar anchor genetic markers was used to create comparison maps of several grass species, including rice, wheat, barley, maize, and others. It includes information on genes, metabolic pathways, proteins, genetic diversity, QTLs, and other relevant topics. The database’s information is linked to genetic, physical, bin, and other maps, and genome browsers. It also uses gene trees to anticipate orthologous and paralogous links and uses identifying genetic analysis to validate homology. Gramene describes genes, proteins, phenotypes, and alleles, using ontologies, which shared information with other databases. Gramene offers several web services, including a Distributed Annotation Server and essential tools such as BLAST. All of Gramene’s databases and software are available for free download (https://www.gramene.org/).

GrainGenes GrainGenes (https://wheat.pw.usda.gov/GG3/) is a database including genetic and genomic data on wheat, barley, oat, and rye. It includes curated information on genetic and physical maps, as well as probes used to generate the maps, oligogenic, ESTs, EST-derived simple sequence repeats, and QTLs.

Genome Data Resources and Tools for Sequence Analysis

271

MaizeGDB MaizeGDB (Maize Genetics and Genomics Database) is a federally funded, community-oriented, long-term informatics service for researchers interested in the agricultural plant and model organism Zea mays. The MaizeGDB is the primary source of data on maize genetics and genomics for the maize research community (http://www.maizegdb.org). It contains information on maize genome sequencing. At PlantGDB, maize sequences from GenBank are downloaded, curated, processed, and assembled into contigs. MaizeGDB provides DNA sequences, genetic research, QTL testing, gene products, and relevant literature references (Woodhouse et al., 2021).

Multiple Databases and Tools as Sources Several bioinformatics resources provide access to a variety of databases, as well as database search and data analysis tools. It indicates that bioinformatics is advancing at an incredible rate, and software tools that are at the top in their domains now may become less popular or outdated tomorrow. With this content, bioinformatics resources may often be quite helpful.

NCBI The National Center for Biotechnology Information (NCBI) is an organization of the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH). It is based in Maryland, USA. It is the principal source of biotechnology and biomedicine information and promotes research and health by making biomedical and genetic data accessible to users. The NCBI provides several databases that include nucleotide sequences, health, genomes, genes, organisms, proteins, chemicals, pathways, and relevant literature. GenBank and PubMed (bibliographic database for biomedical literature) are two of NCBI’s major databases. Among the other databases are Gene, Genome, Epigenomics, Gene Expression Omnibus, Structure, RefSeq, Database of Short Genetic Variation (dbSNP), and TAXONOMY. The NCBI also offers several database search and data analysis tools. Entrez, BLAST, Genomes Browser, Cn3D, CDTree, Open Reading Frame Finder, Genetic

272

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Codes, SNP Database Specialized Search Tools, TAXONOMY BROWSER, and others are some of the various tools available. The NCBI resources are PubMed, PubMed Central, NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, Splign, UniGene, ProtEST, Epigenomics, Genome and related tools, Model Maker, Trace Archive, BioProject, Retroviral Genotyping Tools, Gene Expression Omnibus, Online Mendelian Inheritance in Animals, the Conserved Domain Architecture Retrieval Tool, Protein Clusters, Entrez Programming Utilities, Gene, the COBALT, RefSeq, HomoloGene, dbMHC, dbSNP, dbVar, the Genetic Testing Registry, the Map Viewer, Evidence Viewer, Sequence Read Archive, BioSample, HIV1/Human Protein Interaction Database, Probe, the Molecular Modeling Database, the Conserved Domain Database, Biosystems, and the PubChem are (NCBI Resource Coordinators, 2012). Table 2. A collection of databases related to genomic resources Database Entrez PubMed Gene NCBI Taxonomy Assembly BioCollections BioProject BioSample ClinVae Conserved Domains dbGaP dbVar Genome GEO Datasets GEO Profiles GTR HomoloGene

Type of data stored Sequence and structure of DNA and proteins, gene, genome, genetic variation, and gene expression References, citations, and abstracts on life sciences and biomedical topics A record of reported correlations between human variation and observed health state, accompanied by supporting evidence organisms that have at least one protein or nucleotide sequence in the genetic databases Assembled Genome Structure, Assembly Names, and Other Meta-Data, Statistical Reports, and Genomic Sequence Data Dataset of metadata for culture collection, museums, herbaria, and natural history collection Collection of Biological Data related to a single native, diverse data type generated for a single project Biological Source Materials Used in the Experiment Genomic Variance and its relationship to human health Annotation of functional units in Proteins Interaction of Genotype and Phenotype in Human Human Genomic Structural variation – insertion, deletions, duplication, inversions, mobile elements, translocations, and complex variants Organized data of genomes – sequences, maps, chromosomes, assemblies, and annotations Gene Expression Omnibus repository – locate experiment of interest Search for profiles of interest based on gene annotation or pre-computed profile attributes. Voluntary submission of genetic test information Homology groups from complete gene sets of a wide range of eukaryotic species

Genome Data Resources and Tools for Sequence Analysis Database Identical protein Groups Nucleotide Protein Protein clusters

273

Type of data stored Single entry for each protein translation including coding regions

PubChem BioAssay dbSNP SRA Structure

Nucleotide sequence from several sources Protein Sequences from several sources Protein sequence derived from annotations of whole genome, organelles, and plasmids. Bioactivity screens of chemical substances Human single nucleotide variations, microsatellites, and small-scale insertions and deletions Raw sequencing data and alignment information Three-Dimensional structure

KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) (https://www.genome. jp/kegg/) is a database resource for determining high-level functions and utility of biological systems including cells, organisms, and ecosystems from molecular-level data, particularly large-scale molecular datasets generated by Table 3. Type of data stored in KEGG databases KEGG databases PATHWAY BRITE MODULE ORTHOLOGY GENES GENOME COMPOUND GLYCAN REACTION ENZYME NETWORK DISEASE DRUG MEDICUS

Type of data stored Collection of custom route maps that represent molecular interaction, reaction, and relationship networks A group of hierarchical classification systems that capture the functional hierarchies of numerous biological objects Gene sets and reaction sets have specified functional components Functional orthologs are used to represent molecular functions in a database Genes and proteins derived from entire genomes of biological organisms and viruses using publically accessible resources Collection of KEGG organisms, which are the organisms with complete genome sequences Small molecules, biopolymers, and other chemical compounds that are important to biological systems Collection of glycan structures that have been determined experimentally Collection of all reactions found in the KEGG metabolic pathway maps Implementation of the Enzyme Nomenclature To capture the knowledge of diseases and drugs in terms of perturbed molecular networks Diseases are viewed as perturbed states of the molecular network system. Complete drug information database based on active ingredient chemical structure and chemical components Health-related information resources aimed at bringing in the genomic revolution

274

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

genome sequencing and other high-throughput experimental methods. It also includes KEGG mapping tools, which help researchers understand cellular and organism-level functions by analyzing genome sequences and other molecular data. Its goal is to derive higher-order biological processes and their significance to cells and organisms from genetic information. KEGG has several databases namely, PATHWAY, BRITE, MODULE, ORTHOLOGY, GENES, GENOME, COMPOUND, GLYCAN, REACTION, ENZYME, NETWORK, DISEASE, DRUG, and MEDICUS databases (Kanehisa et al., 2021).

Conclusion In the era of Computational Biology and Genomics, knowledge of biological databases and tools greatly benefits researchers, academicians, and scientists. With knowledge to work in the command line and use various scripts (bash, Perl, python, java) to easily work with these discussed genomics tools. It is necessary to understand the required input with the appropriate format asked along with the optimum parameter for desired output. Most tools used in genomics require a Linux OS that provides dependencies to the tools that are unavailable in Windows OS. It is necessary to generate quality data and figures, thus using tools as per the requirement and type of analysis to be performed is necessary. For a genomic study, it is necessary to study genomewide analysis, whole genome investigation, identification of molecular markers, visualization in synteny, chromosomal localization, and phylogenetic tree, with defined knowledge of databases, is necessary. Working in genomics generates a huge amount of data, thus to store it there is a need for databases. There are many databases from where we can retrieve available information on biological sequences, molecular markers, protein structures, chemical information, organism information, biological pathways, gene ontologies, literature, diseases, and many more. In future, we might need a database for mutations and evolution which is yet unknown and researchers find it difficult to identify the types of mutation occurring while analysis which remains ignored in most cases.

Genome Data Resources and Tools for Sequence Analysis

275

References Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J., (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), pp. 403-410. Andrews, S., (2010). FastQC: a quality control tool for high throughput sequence data. Bairoch, A. and Apweiler, R., (1996). The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic acids research, 24(1), pp. 21-25. Bao, H., Guo, H., Wang, J., Zhou, R., Lu, X. and Shi, S., (2009). MapView: visualization of short reads alignment on a desktop computer. Bioinformatics, 25(12), pp. 15541555. Barakat, M., Ortet, P. and Whitworth, D. E., (2013). P2RP: a web-based framework for the identification and analysis of regulatory proteins in prokaryotic genomes. BMC genomics, 14(1), pp. 1-6. Beier, S., Thiel, T., Münch, T., Scholz, U. and Mascher, M., (2017). MISA-web: a web server for microsatellite prediction. Bioinformatics, 33(16), pp. 2583-2585. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler, D. L., (2000). GenBank. Nucleic acids research, 28(1), pp. 15-18. Birney, E., Clamp, M. and Durbin, R., (2004). GeneWise and genomewise. Genome research, 14(5), pp. 988-995. Bradbury, P. J., Zhang, Z., Kroon, D. E., Casstevens, T. M., Ramdoss, Y. and Buckler, E. S., (2007). TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics, 23(19), pp. 2633-2635. Carver, T., Harris, S. R., Otto, T. D., Berriman, M., Parkhill, J. and McQuillan, J. A., (2013). BamView: visualizing and interpretation of next-generation sequencing read alignments. Briefings in bioinformatics, 14(2), pp. 203-212. Carver, T., Thomson, N., Bleasby, A., Berriman, M. and Parkhill, J., (2009). DNAPlotter: circular and linear interactive genome visualization. Bioinformatics, 25(1), pp. 119120. Chin, C. S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A. and Cramer, G. R., (2016). Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods, 13(12), pp.1050-1054. Cloonan, N., Xu, Q., Faulkner, G. J., Taylor, D. F., Tang, D. T., Kolle, G. and Grimmond, S. M., (2009). RNA-MATE: a recursive mapping strategy for high-throughput RNAsequencing data. Bioinformatics, 25(19), pp.2615-2616. Delhomme, N., Padioleau, I., Furlong, E. E. and Steinmetz, L. M., (2012). easyRNASeq: a bioconductor package for processing RNA-Seq data. Bioinformatics, 28(19), pp. 2532-2533. Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T. R., (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), pp. 15-21. Duran, C., Appleby, N., Clark, T., Wood, D., Imelfort, M., Batley, J. and Edwards, D., (2009). AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants. Nucleic Acids Research, 37(suppl_1), pp. D951-D953.

276

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Gao, F. and Zhang, C. T., (2008). Ori-Finder: a web-based system for finding oriC s in unannotated bacterial genomes. BMC bioinformatics, 9(1), pp. 1-6. Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., Sharpe, T., Hall, G., Shea, T. P., Sykes, S. and Berlin, A. M., (2011). High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4), pp. 1513-1518. Goncalves, A., Tikhonov, A., Brazma, A. and Kapushesky, M., (2011). A pipeline for RNA-seq data processing and quality assessment. Bioinformatics, 27(6), pp. 867-869. Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp. 1072-1075. Homer, N., Merriman, B. and Nelson, S. F., (2009). BFAST: an alignment tool for large scale genome resequencing. PloS one, 4(11), p. e7767. Humann, J. L., Lee, T., Ficklin, S. and Main, D., (2019). Structural and functional annotation of eukaryotic genomes with GenSAS. In Gene prediction (pp. 29-51). Humana, New York, NY. Jin, J. Q. and Sun, Y. B., (2018). AutoSeqMan: batch assembly of contigs for Sanger sequences. Zoological Research, 39(2), p. 123. Kallio, M. A., Tuimala, J. T., Hupponen, T., Klemelä, P., Gentile, M., Scheinin, I., Koski, M., Käki, J. and Korpelainen, E. I., (2011). Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC genomics, 12(1), pp. 1-14. Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. and Tanabe, M., (2021). KEGG: integrating viruses and cellular organisms. Nucleic acids research, 49(D1), pp. D545-D551. Kanz, C., Aldebert, P., Althorpe, N., Baker, W., Baldwin, A., Bates, K., Browne, P., van den Broek, A., Castro, M., Cochrane, G. and Duggan, K., (2005). The EMBL nucleotide sequence database. Nucleic acids research, 33(suppl_1), pp. D29-D33. Kent, W. J., (2002). BLAT—the BLAST-like alignment tool. Genome research, 12(4), pp. 656-664. Kim, D., Langmead, B. and Salzberg, S. L., (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), pp. 357-360. Kim, H., Bi, Y., Pal, S., Gupta, R. and Davuluri, R. V., (2011). IsoformEx: isoform level gene expression estimation using weighted non-negative least squares from mRNASeq data. BMC bioinformatics, 12(1), pp. 1-9. Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M., (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5), pp. 722-736. Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L., (2009). Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), pp. 1-10. Lee, H. C., Lai, K., Lorenc, M. T., Imelfort, M., Duran, C. and Edwards, D., (2012). Bioinformatics tools and databases for analysis of next-generation sequence data. Briefings in functional genomics, 11(1), pp. 12-24. López-Fernández, H., Blanco-Míguez, A., Fdez-Riverola, F., Sánchez, B. and Lourenço, A., (2019). DEWE: A novel tool for executing differential expression RNA-Seq

Genome Data Resources and Tools for Sequence Analysis

277

workflows in biomedical research. Computers in biology and medicine, 107, pp.197205. Manchanda, N., Portwood, J. L., Woodhouse, M. R., Seetharam, A. S., Lawrence-Dill, C.J., Andorf, C. M. and Hufford, M. B., (2020). GenomeQC: a quality assessment tool for genome assemblies and gene structure annotations. BMC genomics, 21(1), pp. 1-9. Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H., Stitziel, N. O., Hillier, L., Kwok, P. Y. and Gish, W. R., (1999). A general approach to single-nucleotide polymorphism discovery. Nature genetics, 23(4), pp.452-456. McCarthy, D. J., Chen, Y. and Smyth, G. K., (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic acids research, 40(10), pp. 4288-4297. Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A. C. and Kanehisa, M., (2007). KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic acids research, 35(suppl_2), pp. W182-W185. Nickerson, D. A., Tobe, V. O. and Taylor, S. L., (1997). PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic acids research, 25(14), pp. 2745-2751. Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T. C., Mendell, J. T. and Salzberg, S. L., (2015). StringTie enables improved reconstruction of a transcriptome from RNAseq reads. Nature biotechnology, 33(3), pp. 290-295. Porras-Hurtado, L., Ruiz, Y., Santos, C., Phillips, C., Carracedo, Á. and Lareu, M. V., (2013). An overview of STRUCTURE: applications, parameter settings, and supporting software. Frontiers in genetics, 4, p. 98. Pritchard, J. K., Stephens, M. and Donnelly, P., (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), pp. 945-959. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M. A. and Barrell, B., (2000). Artemis: sequence visualization and annotation. Bioinformatics, 16(10), pp. 944-945. Savage, D., Batley, J., Erwin, T., Logan, E., Love, C. G., Lim, G. A., Mongin, E., Barker, G., Spangenberg, G. C. and Edwards, D., (2005). SNPServer: a real-time SNP discovery tool. Nucleic Acids Research, 33(suppl_2), pp. W493-W495. Schmid-Burgk, J. L. and Hornung, V., (2015). BrowserGenome. org: web-based RNA-seq data analysis and visualization. Nature Methods, 12(11), pp.1001-1001. Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O. and Weigel, D., (2009). Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9), pp. 1-12. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I., (2009). ABySS: a parallel assembler for short read sequence data. Genome research, 19(6), pp. 1117-1123. Sobreira, T. J. and Gruber, A., (2008). Sequence-specific reconstruction from fragmentary databases using seed sequences: implementation and validation on SAGE, proteome and generic sequencing data. Bioinformatics, 24(15), pp. 1676-1680. Tamura, K., Stecher, G. and Kumar, S., (2021). MEGA11: molecular evolutionary genetics analysis version 11. Molecular biology and evolution, 38(7), pp. 3022-3027.

278

Sandesh Behera, Tikshana Yadav, Surendra Pratap Singh et al.

Tarazona, S., Furió-Tarí, P., Turrà, D., Pietro, A. D., Nueda, M. J., Ferrer, A. and Conesa, A., (2015). Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic acids research, 43(21), pp. e140-e140. Tatusova, T., DiCuccio, M., Badretdin, A., Chetvernin, V., Nawrocki, E. P., Zaslavsky, L., Lomsadze, A., Pruitt, K. D., Borodovsky, M. and Ostell, J., (2016). NCBI prokaryotic genome annotation pipeline. Nucleic acids research, 44(14), pp.6614-6624. Thompson, J. D., Higgins, D. G. and Gibson, T. J., (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22(22), pp. 4673-4680. Thorvaldsdóttir, H., Robinson, J. T. and Mesirov, J. P., (2013). Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics, 14(2), pp. 178-192. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L. and Pachter, L., (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols, 7(3), pp. 562-578. Veltri, D., Wight, M. M. and Crouch, J. A., (2016). SimpleSynteny: a web-based tool for visualization of microsynteny across multiple species. Nucleic acids research, 44(W1), pp. W41-W45. Vicedomini, R., Vezzi, F., Scalabrin, S., Arvestad, L. and Policriti, A., (2013). GAM-NGS: genomic assemblies merger for next generation sequencing. BMC bioinformatics, 14(7), pp. 1-18. Walve, R., Rastas, P. and Salmela, L., (2019). Kermit: linkage map guided long read assembly. Algorithms for Molecular Biology, 14(1), pp. 1-10. Wick, R. R., Schultz, M. B., Zobel, J. and Holt, K. E., (2015). Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20), pp. 3350-3352. Woodhouse, M. R., Cannon, E. K., Portwood, J. L., Harper, L. C., Gardiner, J. M., Schaeffer, M. L. and Andorf, C. M., (2021). A pan-genomic approach to genome databases using maize as a model system. BMC plant biology, 21(1), pp.1-10. Wu, T. D. and Watanabe, C. K., (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21(9), pp.1859-1875. Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., Huang, W., He, G., Gu, S., Li, S. and Zhou, X., (2014). SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics, 30(12), pp. 1660-1666. Yuan, Y., Bayer, P. E., Scheben, A., Chan, C. K. K. and Edwards, D., (2017). BioNanoAnalyst: a visualisation tool to assess genome assembly quality using BioNano data. BMC bioinformatics, 18(1), pp. 1-9. Zerbino, D. R. and Birney, E., (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research, 18(5), pp. 821-829.

Chapter 11

Bioinformatics Tools for Biomarker Discovery Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi4 and Lawrence Achilles Nnyanzi4 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract Biological markers that are usually used to analyze and measure human health conditions are referred to as biomarkers. These markers are often used along with bio-computational tools in predicting the normal or abnormal state of human health and the effective management of diseases. Biomarkers play an essential role in medical intervention. Thus, the use of proteomics techniques, bioinformatics, and machine learning to characterize and identify essential proteins in health and diseases have been explored for biomarker discovery and treatment interventions. The discovery of these biological-based markers has aided in disease 

Corresponding Author’s Email:[email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

280 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. diagnosis and early detection. The discovery of biomarkers entails the discovery of biomarkers before certification, which requires biomarker validation and clinical validation.

Keywords: biomarker discovery, bioinformatics tools, next-generation sequencing, RNA/DNA-sequencing, mRNA profiling, clinical validation

Introduction The use of several biological-computational tools for sequence analysis, extraction of data, analysis, interpretation, prediction and comparison nucleic acid data plays critical roles in achieving this (Adetunji et al. 2022a-h; Olaniyan et al. 2022a, b; Oyedara et al. 2022). Effective and timely management of diseases and ailments, early detection and diagnosis are very significant in the effective management of such diseases (Adetunji et al. 2022i-n). Bioinformatics tools like epigenetics, genomics, proteomics, and metabolomics are integral to the discovery of biological-based markers used in clinical diagnosis referred to as biomarkers. Biomarkers are biological indicators that objectively describe the biological state of an organism. These markers provide information on the abnormal or normal state of the organism under analysis (Goossens et al. 2015). These biomarkers aid today in better and more efficient diagnosis, evaluation, and disease management options. In clinical medicine, they are useful for preventive medicine, diagnosis, therapeutics, prognosis, and drug discovery (Chen et al. 2011). Biomarkers’ efficiency is based on the right prognosis of biological molecules, which include proteins, peptides, DNA/RNA, and metabolites of reactions in the organisms which have been analyzed (Goossens et al. 2015). The use of biomarkers in clinical research is very significant in understanding patients’ medical conditions, enabling early and accurate diagnosis, understanding the mechanisms of several diseases at the molecular level, and finally increasing the effectiveness of treatments. It also helps in identifying possible new drug targets (Manzanares et al. 2021). Several enzymes with biotechnological applications such as amylases, glucoamylases, proteases, lipases, and transketolase (Ogundolie, 2015; Ayodeji et al. 2017; Fadare et al. 2021; Ogundolie, 2021; Ogundolie et al. 2022; Adetunji et al. 2023a, b), are now been studied over the years as potential biomarkers for different diseases such as acute kidney injury in patients (Awdishu et al. 2019)

Bioinformatics Tools for Biomarker Discovery

281

cerebral malaria (Nortey et al. 2022), stress markers (Takai et al. 2007; Guglielminotti et al. 2012), metabolic disorders (Ko et al. 2020; Wang et al. 2021), Cystic Fibrosis (Almeslet et al. 2022), intestinal Behcet’s disease from Crohn’s disease (Park et al. 2021), detection of thrombin level detection in patients (Sun et al. 2015) In disease management, the use of biomarkers can provide an insight into the entire spectrum of the various disease under investigation from the earliest symptom (s) to the terminal stages (Mayeux, 2004). Biomarker Discovery entails the route by which new biomarkers are being developed. The use of biomarkers for disease management has been well documented. The process of discovering new biomarkers is quite aimed at improving on existing ones, using basic information to design new concepts or more personalized targets with the advancements in genomic research, single-cell next-generation sequencing, high-throughput sequencing technologies today which include mRNA profiling techniques such as RNAsequencing (RNA-seq) and cDNA microarrays (Sheng et al. 2020; Ou et al. 2021), ribosome sequencing (Ribo-seq), methylation sequencing (Methyl-seq) (Churko et al. 2013), large data gathering, analysis, designing, chromatin immunoprecipitation sequencing (ChIP-seq) (Churko et al. 2013), biomarker validation and clinical validation is now easier, quicker, cost-effective and more precise.

Typical Examples of Bioinformatics Tools That Could Be Applied in the Discovery of Biomarkers Hasan et al. (2022) reported that through bioinformatics and systems biology, they could analyze some blood-based biomarkers against SAR-CoV-2 and develop drug targets in COVID-19 patients. The authors revealed that the peripheral blood mononuclear cell method was utilized for treatment and drug development by generating blood cell transcripts against SAR-CoV-2 progression at the gene expression level. The authors’ study established one microarray dataset and two RNA-Seq transcriptomic datasets using bioinformatics, identifying 102 significant genes. Further analysis of the physiological pathways in the genes resulted in the discovery of gene-disease relationships, gene ontology, protein hubs, and signaling networks in the pathophysiology of COVID-19.

282 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Hasan et al. (2022) reported several blood-based-protein hubs such as TPX2, NCAPG, DLGAP5, CCNB1, HJURP, and KIF11, AURKB, TTK, TOP2A, and BUB1B with great therapeutic advantage in COVID-19 patients. Paolo et al. (2013) reported that in environmental epidemiology, omics techniques serve as valuable tools for understanding and interpreting concepts and models such as Omics-Based Biomarkers. The authors revealed that complex hazardous environmental pollutants are known to be the major causative agents of diseases, by their interactions with genes, proteins, and other biomolecules. As such, they must be identified appropriately to unravel their mechanism of action using next-generation high throughputs omics techniques like metabolomics, adductomics, transcriptomics, proteomics and epigenomics (Behera et al. 2016; Dash and Abraham, 2018; Dash et al. 2019; Rahman et al. 2021). In the body, biomarkers are a wide range of biomolecules, like small molecule metabolites, covalent complexes of chemical molecules with proteins and DNA, and downstream gene expression proteins. The progress made in understanding human diseases as a result of the development of human genome projects with technological advancement in functional proteomics, next-generation high throughput technologies, microarrays, and genomics is needed to identify complex biomarkers in disease mechanisms such as cancer, neurodegenerative diseases, and metabolic dysfunctions. Recently, bioinformatics and spectrometry have been utilized to characterize different biomarkers using protein expression profiling in biological samples like urine, tears, saliva, sweat, cerebrospinal fluid, serum, and nipple aspirate fluid. Denny-Gouldson, (2007) highlighted the emergence of possible solutions in drug discovery through the use of technologies and data intelligence tools for validating and identifying biomarkers. The author noted that drug discovery is beginning to proliferate due to the huge interest displayed by several pharmaceutical industries towards automation and technological development like genomics, metabolomics, and proteomics (Dash et al. 2017, Rahman et al. 2018; Sahu et al. 2018; Dash et al. 2020; Dash et al. 2021). Craig et al. (2013) described using several omics technologies to identify and discover novel biomarkers in patients with inflammatory lung diseases. The authors reported that quantifying proteins, genes, metabolites, and lipids involves primary techniques like metabolomics, lipidomics, transcriptomics, and proteomics in the urine, blood, and lungs. Identifying biomarkers through these techniques will provide clinical interpretation, diagnosis, surveillance, and therapeutic approach to the disease through a specific molecular pathway.

Bioinformatics Tools for Biomarker Discovery

283

Ahmed et al. (2004) reported that using proteomics for clinical purposes like biomarker discovery has been very scary. The authors revealed that biological fluids are the main target for prognosis, diagnostics, and therapeutic using biomarkers due to the release of proteins from the tissue either in a healthy or disease state. Thus, some of the proteins may be disease-linked biomarkers using proteomic techniques to analyze and generate a lot of diagnostic reports. Several fluids in the body are reported to contain biomarkers such as cerebrospinal fluids, pleural fluid, bile, urine, blood, and saliva (Adetunji et al. 2022l; Adetunji et al. 2022m). Priyanka et al. (2007) revealed that in tumor cells, the use of proteomics to monitor the tissue expression of the protein is beginning to attract significant attention due to its tremendous potential for the discovery of possible biomarkers that can be utilized in the diagnosis of cancer. The authors showed several techniques or tools involved in the use of proteomics, such as two-dimensional polyacrylamide gel electrophoresis (2DPAGE), surfaceenhanced laser desorption/ionization time of flight (SELDI-ToF-MS technology), 2D-DIGE, protein arrays, iTRAQ, multidimensional protein identification technology (MudPIT), and isotope-coded affinity tags (ICAT). These techniques are suitable for different samples like cell lysates, serum, cell secretomes, plasma, nipple aspirate fluid, and tumor tissue to analyze the molecular basis of cancer pathophysiology through the characterization and validation of disease-linked proteins. Recent advances in imaging, electrophoresis, protein array-based methods, protein labelling, bioinformatics, genomics, and spectrometry have provided great opportunities for biomarker discovery. Biomarkers play an essential role in medical intervention. Thus, the use of proteomics techniques, bioinformatics, and machine learning to characterize and identify vital proteins in health and disease has been explored for biomarker discovery and treatment interventions. In machine learning, algorithms and computational techniques for large databases of biological samples have many applications in the biomedical sciences. Beyzanur et al. (2015) reported that computer software and hardware development facilitate the distribution, storage, and analysis of big data from biological samples. The authors revealed that these computer platforms, statistical techniques, or algorithms are integral to bioinformatics analysis and are utilized for solving complex challenges in large biological databases. Today, transcriptomics, genomics, and proteomics are utilized for large-scale, high throughput analysis of biological samples combined with bioinformatics techniques for biomarker design, development, and discovery. These

284 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

techniques are utilized for gene expression analysis, biomarker identification, validation, and quantification. Kenneth and Laura, (2012) highlighted that in biomarker assessment, discovery, quantification, validation, and standardization, various bioinformatics tools and molecular techniques provide critical support and market acceptance in clinical practice. Over the last few years, the authors revealed that bioinformatics techniques had been utilized for biomarker implementation and development. Mathematical tools, algorithms, statistical packages, and informatics in biomarker development involve a strong association between clinical endpoints or features and biomarkers. The primary method for understanding disease conditions is to study the biomarkers involved as evidence-based medicine for diagnosis, the efficacy of interventions, and safety for better outcomes. Studies have shown that, due to the increase in population, the prevalence of infectious and chronic diseases has increased tremendously. The discovery of novel biomarkers through genomic research can accelerate the improvement of health costs and care, but some barriers can militate against biomarker development, research, discovery, and commercialization. Today, biomarker discovery utilizes the integration of genomics and bioinformatics tools in developing and creating biomarkerbased diagnostic methods within the practice of personalized medicine. Biomarkers facilitate the understanding of diseases and processes involved, treatment response, and disease variation in patients. Francisco et al. (2009) reported that using computational models to discover cardiovascular biomarkers has translated into significant breakthroughs in clinical practice using biological knowledge. The authors noted that omics technology is a predictive platform, integrating all data for screening, prognosis, and diagnosis purposes. The diagnostic biomarkers for various cardiovascular diseases are troponin T and I for screening myocardial infarction and brain natriuretic peptides for heart failure. In cancer cells and normal cells such as those in the vasculature, fibroblasts, and immune system, differentiation and action processes result in complex cellular signalling systems. In silico approaches are capable of profiling and computationally extracting the cell types involved in cell-to-cell interaction. The development and advancement of single-cell omics technology have offered tremendous potential as high-resolution profiling of immune cells in patients with accurate prediction levels, thus improving the understanding of therapy-resistant phenotypes and therapy-responsive phenotypes, resulting in the enhanced discovery of immune-modulatory agents.

Bioinformatics Tools for Biomarker Discovery

285

Avisek et al. (2013) reported several opportunities and challenges in discovering cancer biomarkers. Recent advances in identifying particular populations in drug discovery for new biomarker discovery strategies using high throughput technologies like transcriptomics, proteomics, and genomics approaches are fundamental to future modeling and biomarker prediction. Yan et al. (2016) showed that cancer biomarker discovery using biological networks and omics data has the potential to revolutionize cancer prediction, molecular mechanisms, and diagnosis. Francisco et al. (2012) reported cardiovascular biomarker discoveries like TNF alpha, plasma vitamin E concentrations, osteoprotegerin, IL-6, LDL cholesterol, antiphosphorylcholine IgM, total antioxidants, and Lipoprotein linked phospholipase A2 profiles using systems-based approaches that offer promising clinical use. Cardiovascular diseases like heart failure can be monitored using a brain natriuretic peptide biomarker; C-reactive protein can be utilized to evaluate coronary artery disease. Multidimensional complex interactions, such as protein-protein interactions, are implicated in myocardial infarction and inflammation. Phan et al. (2009) revealed that recent developments in the fields of biomarker design and discovery, nanotechnology, and bio-computing had generated great opportunities in medicine for disease diagnosis, detection, and treatment using molecular techniques for evaluating disease progression, development, and clinical outcome. The authors showed that biomarkers show the physiological status, process, condition, or event in health, therapy, and diseases. Bioinformatics tools are developed to identify proteins, genes, and miRNA bases in health and disease. Different assays like imaging, immunohistochemistry, methylation profiling, miRNA profiling, and protein and gene expression are available to identify biomarkers from biological fluids like saliva, urine, serum, nipple aspirate, pancreatic juice, and pleural lavage. Glaab et al. (2021) recently utilized machine learning to analyze omics data for biomarker discovery in patients. During their study, laboratorydeveloped tests were used to characterize biomarkers. The authors utilized multivariate machine learning approaches to identify multifactorial signatures in disease-linked cellular processes. Some machine learning techniques are transfer learning, semi-supervised learning, distance metric learning, structured machine learning, meta-learning, generative models, and multiview learning. The data processing techniques are new dimension reduction approaches, data augmentation techniques, and outlier removal methods. The authors described the model validation methods used in the study as bolstered or bootstrapping CV and uncertainty quantification.

286 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Xie et al. (2020) demonstrated that webserver bioinformatics tools are very important in biomarker discovery and development. In their work, the authors showed that cancer, as a public health disease, has become a global health burden. The modern approach to the identification, diagnosis, and treatment of cancer in patients involves the isolation of biomarkers for possible therapeutic measures. Using high-throughput microarray and sequencing technologies, data on cancer transcriptomics can offer great opportunities for biomarker validation and discovery. Xie et al. (2020) described the bioinformatics webservers like KM plotter, Gene Expression Profiling Interactive Analysis), Tumor Immune Estimation Resource, and Oncomine as an available model for cancer transcriptomic dataset-analysis even though there are some minor limitations in the use of these techniques, such as a single data source and a difficult registration process. These challenges can be overcome by utilizing the web-server OSluca for cancer of the lungs. Terkelsen et al. (2020) revealed that next-generation sequencing and omics technologies with high-throughput for cancer dataset analysis in the biomedical field have grown extensively. The authors utilized the Cancer Biomarker Prediction Pipeline (CAMPP), which performs k-means clustering, elastic-net regression, differential abundance and expression analysis, coexpression and correlation network analyses, protein-protein/miRNA-gene interaction networks, and survival analysis in bioinformatics analysis of physiological data. Gerald et al. (2017) reported that in modern medicine, such as precision medicine, modern methods like modern bioinformatics are targeted toward generating big data in biomedical fields. The authors revealed that highthroughput and analytical platforms like microarray data are integrated into clinical analysis, diagnostics, molecular physiology, pathology, and imaging for biomarker design, discovery, validation, and implementation in health, and disease conditions. Biomarkers are known to play an important role in medicine and the health care system. Intermediate metabolites are utilized to study a particular organ by making a specific diagnosis of potential new biomarkers. There are different types of biomarkers, such as predictive biomarkers, diagnostic biomarkers, therapeutic biomarkers, and prognostic biomarkers. Microarray gene expression data and machine learning strategies are utilized for validation, screening, monitoring, prediction, and identification of genes as advanced diagnostic indicators. Chloé-Agathe, (2020) reported that the genomic differences in an individual are responsible for variation in response, treatment, and

Bioinformatics Tools for Biomarker Discovery

287

predisposition to the disease condition. Advancements in gene expression study through technical platforms or algorithms are utilized for the measurement of intermediate metabolites, small proteins, transcripts, biological molecules, and proteins. Through bioinformatics, large datasets derived from genetics, proteins, and metabolites are integrated for computational modeling for wide applications. Yousef et al. (2014) revealed that novel biomarker discovery, analysis, prediction, and validation involve computational biology, like machine learning techniques, for measuring progress in therapeutic and physiological intervention. Several neurological disorders and neurodegenerative diseases, which are difficult to diagnose and understand due to complex pathophysiology, are now being studied using comprehensive computational tools to analyze Cerebral Spinal Fluid biomarkers and protein expression such as Aβ42. It is generally known that the Cerebral spinal fluid transport metabolites, proteolytic fragments, neurotransmitters, and cellular products. The Cerebralspinal fluid also contains serum albumin, immunoglobulin, haptoglobin, and transferrin isoforms that can be analyzed using bioinformatics tools for biomarker discovery. Anandaram, (2017) reported that the application of bioinformatics like unsupervised clustering techniques in machine learning, visualization tools, microarray data analysis, supervised analysis, and nanotechnology in the

genomic era for biomarker discovery in personalized medicine for the detection, diagnosis, and therapeutic approach in cancer management is receiving wide interest among biomedical scientists. Novel opportunities are available through the detection of candidate biomarkers in early-stage cancer development using two-dimensional polyacrylamide gel electrophoresis, a low throughput proteomics approach. Zhenhua Li et al. (2019) revealed that bioinformatics analysis using techniques such as the mRNA microarray provide valuable insight into the mechanisms of lung cancer pathogenesis through the identification of specific biomarkers. Jae-Wook et al. (2020) studied diabetic nephropathy using bioinformatics tools like two-dimensional gel electrophoresis, mass spectrometry, and two-dimensional differential gel electrophoresis for novel biomarker discovery. Dysfunctional glucose metabolism is the hallmark of diabetes mellitus, causing increased blood glucose levels; omics approach and bioinformatics can be utilized to predict biomarkers that affect vascular calcification and other improve therapeutic processes in diabetes mellitus. Proteomics techniques are used in large-scale protein interaction studies to gain a better understanding of disease pathophysiology and the potential

288 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

discovery of novel biomarkers to aid drug development. The proteins are located in different body fluids with specific disease applicability. Mona et al. (2020) reported that lung cancer is estimated to be the number one killer cancer in the United States, hence the need to identify biomarkers using computational biology to isolate some genes for lung cancer for possible treatment strategies. Also in the work of Xiaoyu et al. (2021), the authors revealed that breast cancer-related death in women is becoming a serious issue globally, and thus there is a need to accelerate molecular targets for biomarker discovery and therapy through computational methods like the Kaplan–Meier tool and the Decision Trunk Classifier on mRNA microarray datasets. Bent et al. (2020) showed the importance of digital biomarker discovery using an open-source software package to develop a wearable data and health system. It is known that there is a rapid expansion of the digital health system to improve health care outcomes. Digital biomarkers can be utilized to diagnose several chronic diseases through computational approaches like the Digital Biomarker Discovery Pipeline. Recently, surface-enhanced laser desorption and ionization have been utilized for biomarker discovery in the proteomics approach. The computational power of bioinformatics combined with clinical data has great potential for biomarker discovery in personalized medicine. Semih, (2020) showed that bioinformatics tools for the discovery of novel biomarkers in gastric cancer using transcriptomics data is possible. In their reports, the authors demonstrated that gastric cancer genes like GAST, GKN2, GIF, GKN1, HRASLS2, SCGB2A1, SFRP2, CHI3L1, EGR1, COL8A1, INHBA, NEAT1, CXCL8 GAST, MYL9, GIF, GKN1, GKN2, and HRASLS2 are targets for candidate biomarkers.

Conclusion The application of biological markers in predicting diseases and in the early detection of disease has improved the landscape of medicine New biomarkers are constantly being discovered in order to improve on the existing ones. Today, biomarker discovery utilizes the integration of genomics and bioinformatics tools in the development and creation of biomarker-based diagnostic methods within the practice of personalized medicine. These biological markers have tremendously improved the diagnosis and management of diseases.

Bioinformatics Tools for Biomarker Discovery

289

References Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D., (2022a). Computational Intelligence Techniques for Combating COVID-19. doi: 10.1201/9781003178903-16. In book: In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics (pp. 251-269). CRC Press. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Olugbenga, M. S., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D. (2022b). Machine Learning and Behaviour Modification for COVID-19.DOI: 10.1201/9781003178903-17. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903 Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Garipova, L. and Shariati, M. A. (2022c). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_10. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022d). Machine Learning Approaches for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_8 Adetunji, C. O., Bodunrinde, R. E., Inobeme, A., Singh, K. R., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2022e) Microbial Community Analysis of Contaminated Soils. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 83-97). CRC Press. https://doi.org/ 10.1201/9781003354147. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Isabekova, O. and Shariati, M. A., (2022e). Smart Sensing for COVID19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_9. Adetunji, C. O., Inobeme, A., Singh, K. R., Bodunrinde, R. E., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2022f) Genomic Analysis of Heavy Metal-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 113-126). CRC Press. Adetunji, C. O., Inobeme, A., Tadso, J., Olaniyan, O. T., Abimbola, O. F., Shahnawaz, M., & Anani, O. (2022g). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota (pp. 301-312). Springer, Singapore.

290 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022h). Internet of Health Things (IoHT) for COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/ 10.1007/978-3-030-79753-9_5 Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Koriagina, N. and Shariati, M. A., (2022i). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_3 Adetunji, C. O., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E., Esiobu, N. D., Oyedara, O. O. and Adeyemi, F. M. (2022j). Application of Computational and Bioinformatics Techniques in Drug Repurposing for Effective Development of Potential Drug Candidate for the Management of COVID-19. doi: 10.1201/9781003178903-15. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition, First Published 2022, Imprint CRC Press. Pages 14. eBook ISBN 9781003178903 Adetunji, C. O., Samuel, M. O., Adetunji, J. B. and Oluranti, O. I., (2022k). Corn Silk and Health Benefits. doi: 10.1201/9781003178903-11. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Adetunji, C. O., Ogundolie, F. A., Ajiboye, M. D., Mathew, J. T., Inobeme, A., Dauda, W. P., & Adetunji, J. B. (2022l). Nano-engineered Sensors for Food Processing. In Bioand Nano-sensing Technologies for Food Processing and Packaging (pp. 151-166). Royal Society of Chemistry. doi:10.1039/9781839167966-00151. Adetunji, C.O., Abimbola, O.F., Singh, K.R., Olaniyan, O.T., Bodunrinde, R.E., Inobeme, A., Mathew, J.T., Singh, J. and Singh, R.P., (2022m). Microbe Performance and Dynamics in Activated Sludge Digestion. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 99-112). CRC Press. https://doi.org/10.1201/ 9781003354147 Adetunji, Oluwaseun C., Mathew, J. T., Inobeme, A., Olaniyan, O. T., RB Singh, K., Abimbola, O. F., Nayak, V., Singh, J. & Singh, R. P. (2022n). Microbial and Plant Cell Biosensors for Environmental Monitoring. In Nanobiosensors for Environmental Monitoring (pp. 175-190). Springer, Cham. https://doi.org/10.1007/978-3-031-161063_9. Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O., Dauda, W. and Ghazanfar, S., (2023a). Graphene-based nanomaterials for targeted drug delivery and tissue engineering. Novel Platforms for Drug Delivery Applications, pp.277-288. https://doi.org/ 10.1016/B978-0-323-91376-8.00014-8 Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ghazanfar, S., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O. and Dauda, W., (2023b).

Bioinformatics Tools for Biomarker Discovery

291

Nanotube platforms for effective drug delivery applications. Novel Platforms for Drug Delivery Applications, pp.317-332. https://doi.org/10.1016/B978-0-323-91376-8.00 005-7 Ahmed, N., Barker, G., Oliva, K. T., Hoffmann, P., Riley, C., Reeve, S., Smith, A. I., Kemp, B. E., Quinn, M. A., Rice, G. E. (2004) Proteomic-based identification of haptoglobin1 precursor as a novel circulating biomarker of ovarian cancer. Br J Cancer, 91, 129140. Almeslet, A., Alnamlah, S., Alanzan, L., Aldriwesh, R., & AlWehaiby, S. E. (2022). Role of Salivary Biomarkers in Cystic Fibrosis: A Systematic Review. BioMed Research International, 2022. Anandaram Harishchander (2017) A review on application of biomarkers in the field of bioinformatics & nanotechnology for individualized cancer treatment. MOJ Proteomics & Bioinformatics. Volume 5 Issue 6. 179‒184. doi: 10.15406/mojpb .2017.05.00179. Avisek Deyati, Erfan Younesi, Martin Hofmann-Apitius and Natalia Novac (2013) Challenges and Opportunities for oncology biomarker discovery. Drug Discovery Today. 18: 614 – 624. http://dx.doi.org/10.1016/j.drudis.2012.12.011. Awdishu, L., Tsunoda, S., Pearlman, M., Kokoy-Mondragon, C., Ghassemian, M., Naviaux, R. K., Patton, H. M., Mehta, R. L., Vijay, B. and RamachandraRao, S. P., 2019. Identification of maltase glucoamylase as a biomarker of acute kidney injury in patients with cirrhosis. Critical care research and practice, 2019. Ayodeji, A. O., Ogundolie, F. A., Bamidele, O. S., Kolawole, A. O., & Ajele, J. O. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus CFU-01: purification and characterization for biotechnological application. J Microbiol Biotechnol, 6, 90-100. Behera, R. N., Roy, M., & Dash, S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. Bent B., Wang K., Grzesiak E., Jiang C., Qi Y., Jiang Y., Cho P., Zingler K., Ogbeide F. I., Zhao A., Runge R., Sim I., and Dunn J. (2020) The digital biomarker discovery pipeline: An open-source software platform for the development of digital biomarkers using mHealth and wearables data. Journal of Clinical and Translational Science 5: e19, 1–8. doi: 10.1017/cts.2020.511. Beyzanur Y., Eyyup U., Ramazan Y., Esra G. and Mehmet G. (2015) Bioinformatics in Breast Cancer Research. INTECH. Chapter 7. 175-185. http://dx.doi.org/10.5772/ 59519. Chen, X. H., Huang, S., & Kerr, D. (2011). Biomarkers in clinical medicine. IARC Scientific Publications, 163, 303-322. Chloé-Agathe Azencott. (2020) Machine learning tools for biomarker discovery. Machine Learning [stat.ML]. Sorbonne Université UPMC, 2020. tel-02354924v2. Churko, J. M., Mantalas, G. L., Snyder, M. P., and Wu, J. C. (2013). Overview of high throughput sequencing technologies to elucidate molecular pathways in cardiovascular diseases. Circulation research, 112(12), 1613–1623. Craig E. W., Victoria M. G., David B., Ben N., Joost B., Paul J. S., Stuart S., Dominic B., Arnaldo D., Ildiko H., Amphun C., Hassan A., Stephane B., Christos R., Kian F. C.,

292 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Paolo M., Stephen J. F., Ian M. A., Anthony D. P., Sven-Erik D., Anthony R., Peter J. S., Charles A., Ratko D. (2013) Application of ’omics technologies to biomarker discovery in inflammatory lung diseases. Eur Respir J; 42: 802–825 | doi: 10.1183/ 09031936.00078812. Dash, S., Abraham, A., Atta-ur-Rahman (2018) Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In: Madureira, A., Abraham, A., Gandhi, N., Varela, M. (eds) Advances in Intelligent Systems and Computing, 923, pp 178-188, 2018, Springer Nature. Dash, S., Abraham, A., Luhach, A. K., Mizera-Pietraszko, J., & Rodrigues, J. J. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. Dash, S., Ahmad, M., & Iqbal, T. (2021). Mobile cloud computing: a green perspective. In Intelligent Systems. vol. 185, pp: 523-533, Springer, Singapore. http://doi.org/ 10.1007/978- 981-33-6081-5-46. Dash, S., Thulasiram, R., & Thulasiraman, P. (2017, December). An enhanced chaos-based firefly model for Parkinson’s disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT) (pp. 159-164). IEEE Xplore, doi 10.1109/ICIT.2017.43. Dash, S., Thulasiram, R., & Thulasiraman, P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. Denny-Gouldson P. (2007) ‘Tools of the trade’ for target identification, validation and biomarker discovery. Informatics, Drug Discovery World Summer. Page 71-76. Fadare, O. A., Omisore, N. O., Adegbite, O. B., Awofisayo, O. A., Ogundolie, F. A., Adesanwo, J. K., & Obafemi, C. A. (2021). Structure based design, stability study and synthesis of the dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. Francisco A., Yvan D. and Daniel W. (2009) Computational biology for cardiovascular biomarker discovery. Briefings in Bioinformatics. 10 (4): 367-377 doi:10.1093/ bib/bbp008. Francisco J. A., Frederick E. D., Dirk L. B., Yvan D., Euan A. A., Daniel R. W., (2012) Systems-Based Approaches to Cardiovascular Biomarker Discovery. Circ Cardiovasc Genet.; 5:360-367. doi: 10.1161/CIRCGENETICS.112.962977. Gerald L., Peter B., Philip D. D., Paul G. O., Jacqueline A. J., Manuel S., Peter W. H. and Darragh G. M. (2017) Embracing an integromic approach to tissue biomarker research in cancer: Perspectives and lessons learned. Briefings in Bioinformatics, 18(4), 634– 646. doi: 10.1093/bib/bbw044. Glaab E., Rauschenberger A., Banzi R., Chiara Gerardi, Paula Garcia, Jacques Demotes, the PERMIT Group. (2021) Biomarker discovery studies for patient stratification using machine learning analysis of omics data: a scoping review. BMJ Open 2021;11: e053674. doi:10.1136/bmjopen-2021-053674. Goossens, N., Nakagawa, S., Sun, X., & Hoshida, Y. (2015). Cancer biomarker discovery and validation. Translational cancer research, 4(3), 256–269. https://doi.org/ 10.3978/j.issn.2218-676X.2015.06.04.

Bioinformatics Tools for Biomarker Discovery

293

Guglielminotti, J., Dehoux, M., Mentré, F., Bedairia, E., Montravers, P., Desmonts, J. M., & Longrois, D. (2012). Assessment of salivary amylase as a stress biomarker in pregnant patients. International journal of obstetric anesthesia, 21(1), 35-39. http://dx.doi.org/10.4236/jilsa.2014.64012. Imran H., Habibur R., Islam M. B., Zahidul I., Arju H., Mohammad A. M., (2022) Systems Biology and Bioinformatics approach to Identify blood-based signatures molecules and drug targets of a patient with COVID-19. Informatics in Medicine Unlocked 28. 100840. 1-12. https://doi.org/10.1016/j.imu.2021.100840. Kenneth P. H. P. and Laura B. P. (2012) Bioinformatics advances for clinical biomarker development. Expert Opin. Med. Diagn. (2012) 6(1):39-48. Ko, J., Cho, J., & Petrov, M. S. (2020). Low serum amylase, lipase, and trypsin as biomarkers of metabolic disorders: a systematic review and meta-analysis. Diabetes Research and Clinical Practice, 159, 107974. Manzanares, J., Sala, F., Gutiérrez, M. S. G., & Rueda, F. N. (2021). Biomarkers. Reference Module in Biomedical Sciences. https://doi.org/10.1016/B978-0-12-820472-6.000608. Mayeux, R. (2004). Biomarkers: potential uses and limitations. NeuroRx, 1(2), 182-188. https://doi.org/10.1602/neurorx.1.2.182. Mona Maharjan, Raihanul Bari Tanvir, Kamal Chowdhury, Wenrui Duan and Ananda Mohan Mondal (2020) Computational identification of biomarker genes for lung cancer considering treatment and non-treatment studies. BMC Bioinformatics 2020, 21(Suppl 9): 218 https://doi.org/10.1186/s12859-020-3524-8. Nortey, L. N., Anning, A. S., Nakotey, G. K., Ussif, A. M., Opoku, Y. K., Osei, S. A., Aboagye, B., & Ghartey-Kwansah, G. (2022). Genetics of cerebral malaria: pathogenesis, biomarkers and emerging therapeutic interventions. Cell & Bioscience, 12(1), 1-19. Ogundolie, F. A. (2015). Characterization of a Purified β–Amylase from Black Marble Vine (Dioclea reflexa) Seeds (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A. (2021). Cloning of α-Amylase and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from cocoa (Theobroma cacao L.) Pods and Biochemical Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A., Ayodeji, A. O., Olajuyigbe, F. M., Kolawole, A. O., & Ajele, J. O. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. Oh, J. W., Muthu, M., Haga, S. W., Anthonydhason, V., Paul, P., & Chun, S. (2020). Reckoning the Dearth of Bioinformatics in the Arena of Diabetic Nephropathy (DN)—Need to Improvise. Processes, 8(7), 808.808; doi:10.3390/pr8070808. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. (2022a). Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-20 Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 97803675 48445.

294 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. In. Computational Intelligence in IoT Healthcare. (2022 b). In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/ 9780367548445-19. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Omotayo O. O., Folasade M. A., Charles O. A., Temidayo O. E. (2022). Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARS-CoV-2 Infection. doi: 10.1201/9781003178903-10. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Ou, F. S., Michiels, S., Shyr, Y., Adjei, A. A., & Oberg, A. L. (2021). Biomarker discovery and validation: statistical considerations. Journal of Thoracic Oncology, 16(4), 537545. Paolo Vineis, Karin van Veldhoven, Marc Chadeau-Hyam, and Toby J. Athersuch (2013) Advancing the Application of Omics-Based Biomarkers in Environmental Epidemiology. Environmental and Molecular Mutagenesis 54: 461-467. doi 10.1002/ em. Park, J., Jeong, D., Chung, Y. W., Han, S., Kim, D. H., Yu, J., & Ryu, J. H. (2021). Proteomic analysis-based discovery of a novel biomarker that differentiates intestinal Behcet’s disease from Crohn’s disease. Scientific reports, 11(1), 1-12. Phan John H., Richard A. Moffitt, Todd H. Stokes, Jian Liu, Andrew N. Young, Shuming Nie and May D. Wang (2009) Convergence of biomarkers, bioinformatics and nanotechnology for individualized cancer treatment. Trends Biotechnol; 27(6): 350– 358. doi:10.1016/j.tibtech.2009.02.010. Priyanka Maurya, Paula Meleady, Paul Dowling and Martin Clynes (2007) Proteomic Approaches for Serum Biomarker Discovery in Cancer. Anticancer Research 27: 1247-1256. Rahman, A. U., Dash, S., & Luhach, A. K. (2021). Dynamic MODCOD and power allocation in DVB-S2: a hybrid intelligent approach. Telecommunication Systems, 76(1), 49-61. Rahman, A., Sultan, K., Dash, S., & Khan, M. A. (2018). Management of resource usage in mobile cloud computing. Int J Pure Appl Math, 119(16), 255-261. Sahu, B., Dash, S., Mohanty, S. N., & Rout, S. K. (2018). Ensemble comparative study for the diagnosis of breast cancer datasets. International Journal of Engineering & Technology, 7(4.15), 281-285. Semih Dalkılıç (2020) Analysis of Gastric Cancer Transcriptomic Data by Bioinformatics Tools and Detection of Candidate Diagnostic Biomarker Genes. Progress in Nutrition 2020; Vol. 22, Supplement 2: e2020003 doi: 10.23751/pn.v22i2-S.10179. Sheng, K. L., Kang, L., Pridham, K. J., Dunkenberger, L. E., Sheng, Z., & Varghese, R. T. (2020). An integrated approach to biomarker discovery reveals gene signatures highly predictive of cancer progression. Scientific reports, 10(1), 1-15. Sun, A. L., Jia, F. C., Zhang, Y. F., & Wang, X. N. (2015). Gold nanocluster-encapsulated glucoamylase as a biolabel for sensitive detection of thrombin with glucometer readout. Microchimica Acta, 182(5), 1169-1175.

Bioinformatics Tools for Biomarker Discovery

295

Takai, N., Yamaguchi, M., Aragaki, T., Eto, K., Uchihashi, K., & Nishikawa, Y. (2007). Gender‐specific differences in salivary biomarker responses to acute psychological stress. Annals of the New York Academy of Sciences, 1098(1), 510-515. Terkelsen T., Krogh A., Papaleo E. (2020) CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for the analysis of quantitative biological data. PLoS Comput Biol 16 (3): e1007665. https://doi.org/10.1371/journal.pcbi.1007665. Wang, X., He, Q., Chen, Q., Xue, B., Wang, J., Wang, T., Liu, H., & Chen, X. (2021). Network pharmacology combined with metabolomics to study the mechanism of Shenyan Kangfu Tablets in the treatment of diabetic nephropathy. Journal of Ethnopharmacology, 270, 113817. Xiaoyu Z., Gaoli S., Qiankun H. and Pingping Z. (2021). Screening and predicted value of potential biomarkers for breast cancer using bioinformatics analysis. Scientific Reports | (2021) 11:20799 | https://doi.org/10.1038/s41598-021-00268-9. Xie L., Wang L., Zhu W., Zhao J. and Guo X. (2020) Editorial: Bioinformatics Tools (and Web Server) for Cancer Biomarker Development. Front. Oncol. 10:599085. doi: 10.3389/fonc.2020.599085. Yan W., Wenjin X., Jiajia C. and Guang H. (2016) Biological Networks for Cancer Candidate Biomarkers Discovery. Cancer Informatics: 15(S3) 1–7 doi: 10.4137/ CIN.S39458. Yousef, M., Najami, N., Abedallah, L. and Khalifa, W. (2014) Computational Approaches for Biomarker Discovery. Journal of Intelligent Learning Systems and Applications, 6, 153-161. Zhenhua L., Meixiang S., Ziqiang T., Zhao L., Jian L., Fan Z. and Baoen S. (2019) Identification of key biomarkers and potential molecular mechanisms in lung cancer by bioinformatics analysis. Oncology Letters 18: 4429-4440. doi: 10.3892/ ol.2019.10796.

Chapter 12

A Review on Recent Advances in Different Modelling Techniques, Algorithms, and Software for Metabolic Pathways Analysis in System Biology Manish Paul1 Saikat Chakrabarti2 and Amrita Banerjee3,* 1Department

of Biotechnology, Maharaja Sriram Chandra Bhanja Deo University, Takatpur, Baripada, Mayurbhanj, Odisha 2 National Centre for Biotechnology Information, Indian Institute of Chemical Biology, Jadavpur, Kolkata, West Bengal 3Department of Biotechnology, Oriental Institute of Science and Technology, Rangamati, Midnapore, West Bengal, India

Abstract Research in in silico modeling and simulation of complex biological systems has become more critical in recent years. In a natural context, systems represent many interacting elements composed of proteins, genes, and metabolites. To study these systems, mathematical modeling was typically used in this context. However, this modeling has limitations in structural understanding and behavioural study. In this sense, extensive computational modeling is an approach that allows the modelling and simulation of these particular systems. To date, there are many computational modeling tools available to study and verify biological pathways. In the past few years, there has been a considerable *

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

298

Manish Paul, Saikat Chakrabarti and Amrita Banerjee effort in the computer science community to develop computational languages and software tools for modeling and analyzing biochemical systems. In addition, computer simulation constitutes an aid to traditional biological research by allowing biologists to execute efficient simulations considering the data obtained in wet experiments and to generate new hypotheses, which can be later verified in additional wet experiments. Computer simulations can run experiments in which several species can be monitored at the same time to explore various conditions and, in some cases, to observe the behaviour of the system in more detail than experimental techniques allow. A set of simulation studies has been proposed to analyze various biological pathways related to severe diseases such as Alzheimer’s, cardiovascular disease, cancer, and others. This book chapter presents an overview of modeling methods and parameter estimation by determining different metabolic pathways of biological systems through simulation software. The chapter also reviews the routine work employed in mathematical biology and bioinformatics to describe genetic regulatory systems.

Keywords: biological systems, metabolic pathways, bioinformatics, mathematical modeling, computer simulation

Introduction Expansions in molecular biosciences have brought many improvements. These include combined molecular and system-level approaches developed through biological research in systems biology. This scientific branch incorporates numerous types of molecular knowledge, which can be obtained by using the combined implementation of models and experimental data (Bruggeman et al., 2007; Lee et al., 2011). The systems biology approach encompasses the investigation of elements involved in cellular networks and their co-relations during the progression of a cellular pathway. These approaches implement high-throughputs techniques like whole-genome analysis and different computational techniques that can be incorporated with advanced experimental methods (Carlo, 2008). The combination of such techniques and knowledge obtained aids in the understanding of biological processes on a more systematic level. With the help of biological studies, we are learning more about cellular processes at an ever-increasing rate. Understanding and predicting the behaviour of a cell is one of the most significant subjects in systems biology. A software environment that allows biological and medical scientists (users) to model and simulate biological

A Review on Recent Advances in Different Modelling Techniques …

299

processes in the cell is particularly essential. Software based modeling and simulation are often performed for different gene regulated metabolic pathways and signal-transduction pathways involved in biological processes. First and foremost, appropriate architecture is required to represent and mimic biological pathways during the performance of this type of software based modelling and simulation. The goals of expanding systems biology are to reveal mechanisms that cause the modification of phenotypes during a disease development and invent novel therapeutics against the disease by economically design cells with desired and reliable properties using modeling.(Carlo, 2008). To achieve these goals, the systems biology field has to be elevated to a new era where the study and analysis of biological systems have to be done in holism rather than reductionism. In other words, it means that the focus ought to be concentrated on the function of organism along with the cellular structure and dynamics instead of the features of isolated cell parts of the organism (Kitano, 2002). According to Wiener (1948) and Bertalanffy (1969), the idea of understanding biological systems at the system level is not new in biology; the breakthroughs in molecular biology have been made possible in previous decades with continuous updates and improved findings. This means that the knowledge of biology is rapidly increasing and the data from previously developed analysis methods are barely able to compatible with it. This is because to keep up with the huge and rapid emergence of data, more sophisticated tools must be developed in the bioinformatics area. Most of the tools mentioned are developed based on Artificial Intelligence (AI) which aims to maximize the success possibility by using intelligent agents (Karaboga et al., 2012). The main purpose of these tools is to group and compare data and then retrieve the related information. In short, systems biology discusses the interaction between atoms within a molecule, which is illustrated through modelling and enhanced with computer hardware to make the explanation more comprehensive. As mentioned, the metabolic pathway needs to be modelled to simulate the process within the cell due to its importance in genetic engineering manipulations (Ceric and Kurtanjek, 2006). In metabolic engineering, to improve the desired metabolite production, three main aspects need to be considered, notably the kinetic modelling, model selection, and parameter estimation. In this paper, we limit the focus to only kinetic modelling and parameter estimation. This paper covers the description of kinetic modelling, which uses mathematical modelling methods, and existing parameter estimation algorithms. A comprehensive mathematical model is required for

300

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

obtaining the optimal solution of the optimization problem (Liu and Chen, 2011). The properties of the model basically rely on the various kinetic parameters that control the saturation constants, reaction rates, inhibition effects, and strong activation of various metabolites and cofactors that are distributed throughout the network (Ceric and Kurtanjek, 2006). Based on the predicted kinetic profile, the optimal kinetic parameter values can be obtained through the parameter estimation process. Lillacci and Khammash (2010) stated that the deterministic approach is more suitable for implementation if the modelling deals with metabolite concentration and reaction rate as variables instead of probability. Signal transduction is a non-linear protein signalling mechanism that involves a complex network of multifunctional connections. There are several definitions for systems biology, but we use the term “biology for the future” to describe the integration of biological data using a computational biology method at the system-level to design the new biology (Aderem, 2005; Westerhoff, 2005). In all these domains, system biology is becoming a global language. Studies need system-level thinking and a thorough grasp of fundamental biology and biochemistry in order to analyze the massive amounts of biological data and comprehend the entire biological system. When creating a quantitative model that can describe the entire system and anticipate the system’s reactions, systems biologists treat the experimental data statistically. Computer simulation is an effective approach for complex system design and analysis. In computer simulation, the general goal is to capture the dynamic properties of a real-world system in a computer model. Experiments are conducted on the model in order to obtain information that can be used to make educated decisions about the features of the real system. Simulations are appropriate for issues that do not have closed-form analytical solutions. Computer simulation is a strong and versatile tool in the analysis of complex systems since most dynamic issues in practice cannot be properly described and have to be solved using mathematical equations. Most of the biological data are significantly elaborated both in size and complexity. Systems bioinformatics, a part of system biology, can assist the arrangement of such vast and complex data in an organized manner. Understanding the inherent mechanisms and principles in biochemical systems is one of the main tasks when modelling such systems. To effectively investigate a biochemical system of interest, in silico analysis can be performed to reveal and formalise the underlying cellular functions and biochemical processes. Two different but complementary methods, quantitative and qualitative model learning approaches (Balden et al., 2010),

A Review on Recent Advances in Different Modelling Techniques …

301

can be applied to model biochemical systems: a given cellular system can be described and analyzed mathematically in a quantitative manner until desired biochemical properties are replicated in a virtual cellular environment, for instance, a web-based environment for kinetic modelling and dynamic simulation of cellular networks-WebCell (Lee et al., 2006); meanwhile, a biochemical system can be qualitatively modelled and identified through qualitative model learning (QML) (Pang and Coghill, 2010; 2011;2014) when only incomplete knowledge and imperfect data are available. The above facts motivate us to develop an integrated qualitative and quantitative model learning framework. This integrated learning framework can better the learning performance of a model which will facilitate the wet-lab research. In quantitative modelling approaches, a dynamic biochemical system is mathematically represented to model molecular mechanisms at a quantitative level, and interactions between molecules may be discovered through such modelling process. Further biochemical analysis from wet-lab experiments can be verified with the help of such precise quantitative modelling approach. In addition, cell-cell interactions can also be studied through quantitative simulation or system identification (Ljung, 1998). Qualitative modelling approaches can be used for the extraction of qualitative information from imprecise and incomplete data to model realworld problems, which is also known as qualitative reasoning (QR) (Forbus, 1996; Kuipers, 1994). Continuous aspects of a given dynamic system, for instance, space, time, and quantity can be represented or inferred automatically in QR. In QR-based research, qualitative values, such as, high, medium, low, zero, positive, and negative, can be used to describe complicated dynamic systems, instead of using precise numerical values. Therefore, behaviours of target biochemical systems can be predicted and reasoned qualitatively in silico with the support of QR (King et al., 2005). System biology can be used in multi-omics integration which is recently gaining significant attention in the Metabolomics Research Community. This study area also assists with data management strategies for multinational large-scale projects in biology.

Molecular Modeling in System Biology Studies on the structure of molecular networks and their model have contributed to the exploration of cell functioning principles that gives overarching to a single species. The amount and quality of the molecular data,

302

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

as well as the aim of modelling, are the factors that determine which modelling approach is most appropriate to use. Systems biology can be distinguished into two main approaches: the top-down and bottom-up approaches. The top-down systems biology is a method that features cells using system-wide data initiated by the Omics, combined with modelling. These models may be generally phenomenological, but they assist in exploring fresh intuitions in molecular networks. On the other hand, the bottom-up approach does not initiate with data but comes with a detailed model of a molecular network that includes the fundamentals of its molecular properties. By using this approach, molecular networks can be quantitatively studied to formulate predictive models that can be implemented in drug designs and the enhancement of desired product formation in bioengineering (Bruggeman et al., 2007). Biological processes are quite difficult to visualize at the molecular level both in their in vivo and in vitro set-up. Therefore, in silico molecular modeling of different biological pathways involved in maintaining the proper metabolic reaction in a living organism and simulation techniques are indispensably needed in this regard. Molecular modelling is gaining popularity as a result of its numerous applications in various fields of study. It is currently frequently utilized in physics, chemistry, and biology to explore the molecular structure of huge systems. The molecular behaviour of chemical or biological systems is simulated via molecular modelling. It is feasible to methodically study the system’s behaviour and, in certain cases, generate predictions using such a formal model. This type of research is a subset of the larger concept of simulated experiments (also called in silico experiments by biologists and numerical experiments by physicists). When in vivo or in vitro investigations are not feasible due to cost, practicality, or ethical concerns, these experiments are necessary. However, because a computer model is formal, it may be reasoned about and inferred qualities (such as the presence of a steady state, stability, phase transitions, and so on) that can be tested against natural facts. Formal models can have an instructional, normative, constructive, or ideological purpose in general: •



Pedagogical and heuristic: The model is used to produce information about a system or to depict a collection of complicated interactions that occur throughout a biological process. Normative: The model is used to compare different systems or as a reference among scientists.

A Review on Recent Advances in Different Modelling Techniques …





303

Constructive: The model is utilized to create a new biological creature as a blueprint. Biology has progressed to the point where new techniques have been developed to study existing natural phenomena (drug design, metabolic pathways, and genetically modified organisms). Ideological: A model displays the biological paradigms and constraints that were researched, as well as the schemes that were investigated. A variety of computer science concepts, such as programs, memory, information, control, and so on, have been adopted by biology to aid in the development of biological hypotheses. The transfer of concepts and techniques between biology and computer science is not one-way, and a computing model inspired by a biological phenomenon frequently leads to a formalism that is then used to describe additional biological processes. The history of cellular automata (CA), which was first established by J. Von Neuman and abstracted the idea of a tissue of cells to examine the notion of self-replicating programming, is an excellent example. Since then, the CA formalism has been widely used in biological simulation, such as modelling tumour growth (Eden’s models) or ecology (it has also been successful in numerous other application domains, like in physics). Computational biology’s contributions to molecular dynamics and ecological modelling are now well established. They are mostly concerned with the concept of dynamical systems. This type of computer model currently looks to be capable of making linkages between molecular pathways and a cell’s physiological features. The growing paradigm of gene expression, system dynamics, and cell physiology is becoming increasingly essential as researchers attempt to combine exponential knowledge of all the cell’s components into a meaningful understanding of the cell. However, in the broader area of development, this formalisation from biology to a dynamical system and back to biology has long been recommended (Figure 1).

Biological pathway analysis is a collection of frequently used techniques for life science research that aims to make sense of high-throughput biological data. Pathway analysis, also known as functional enrichment analysis, is quickly becoming one of the most important methods in Omics research. The primary goal of pathway analysis tools is to examine data from highthroughput technologies in order to identify meaningful groupings of related

304

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

genes that are changed in case of samples compared to controls. In this way, pathway analysis approaches solve the challenge of understanding such type of big data related to genes and their differential expression in the biological systems which is the major output of most basic high-throughput data analysis (García-Campos et al., 2015).

Figure 1. In silico computer modeling in the representation of various biological processes and their different applications.

Several effective attempts to simulate complicated biological processes such as metabolic pathways, gene regulatory networks, and cell signalling pathways have recently been made. The route models have produced not only empirically proven hypotheses, but also useful insights into the behaviour of complex biological systems. Many recent studies have linked phenotypic variability in organisms to inherent stochasticity that operates at the gene expression level. As a result, the creation of innovative mathematical representations and simulation techniques is important for effective biological modelling efforts. For each representation, the aim is to discover a physiologically appropriate representation (Tan et al., 2004).

A Review on Recent Advances in Different Modelling Techniques …

305

Concerns about Modeling and Simulation The limited adequacy of animal cell cultures and models to mimic the complexity of human bodies in laboratory conditions has emphasized researchers to find its quintessential bioelectronic alternative with improved competence. In this regard, tissue engineering has emerged as one of the most precise biomaterial technologies in terms of creating new tissues to model vital organs (Seidi et al., 2022). Modeling gives a graphical representation of a system to be built. To better understand the basis of the activity of any molecule with biological activity, it is important to know how this molecule interacts with its site of action, more specifically its conformational properties in solution and orientation for the interaction. Molecular recognition in biological systems relies on specific attractive and/or repulsive interactions between two partner molecules. Modelling assists in identifying such interactions between ligands and their host molecules, typically proteins, given their three-dimensional (3D) structures (Du et al., 2016). To apply computational methods in drug design, it is always necessary to remember that to be effective, a designed drug must discriminate successfully between the macromolecular target and alternative structures present in the organism. The last few years have witnessed the emergence of different computational tools aimed at understanding and modelling this process at the molecular level. Although still rudimentary, these methods are shaping a coherent approach in designing molecules with high affinity and specificity, both in lead discovery and in lead optimisation. Moreover, current information on the 3D structure of proteins and their functions provides a possibility to understand the relevant molecular interactions between a ligand and a target macromolecule (Breda et al., 2007). Following are the basic concepts of modelling and simulation: • • • •

Object is an entity that exists in the real world to study the behaviour of a model. The base Model is a hypothetical explanation of object properties and its behaviour, that remain valid throughout the model. System is an articulate object under definite conditions that exist in the real world. An experimental frame is used to study a system in the real world, such as experimental conditions, aspects, objectives, etc. A basic experimental frame consists of two sets of variables- the frame input

306

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

• •



variables and the frame output variables, which match the system or model terminals. The frame input variable is responsible for matching the inputs applied to the system or model. The frame output variable is responsible for matching the output values to the system or a model. Lumped model is an exact explanation of a system that follows the specified conditions of a given experimental frame. Verification is the process of comparing two or more items to ensure their accuracy. In modelling and simulation, verification can be done by comparing the consistency of a simulation program and the lumped model to ensure their performance. Validation is the process of comparing two results. In modelling and simulation, validation is performed by comparing experimental measurements with the simulation results within the context of an experimental frame. The model supposed to invalid if any mismatch occurred in the results (Meng et al., 2004).

Classification of Models A system can be classified into the following categories: •









Discrete-Event Simulation Model- In this model, the state variable values change only at some discrete points in time where the events occur. Stochastic vs. Deterministic Systems- Stochastic systems are not affected by randomness and their output is not a random variable, whereas deterministic systems are affected by randomness and their output is a random variable. Static vs. Dynamic Simulation- Static simulation includes models that are not affected by time, e.g., Monte Carlo Model. Dynamic Simulation includes models that are affected by time. Discrete vs. Continuous Systems- A Discrete system is affected by state variable changes at a discrete point in time. Its behaviour is depicted in the following graphical representation. Dynamical systems- Many natural phenomena can be modelled as dynamical systems. At any point in time, a dynamical system is characterized by its state. A state is represented by a set of state variables. For example, in the description of planetary motions

A Review on Recent Advances in Different Modelling Techniques …

307

around the sun, the set of state variables may represent the positions and velocities of the planets. Changes in the state over time are described by a transition function, which determines the next state of the system (over some time increment) as a function of its previous state and, possibly, the values of external variables (inputs to the system). This progression of states forms a trajectory of the system in its phase space (the set of all possible states of the system). A set of chemical reactions that occur inside the cell of a living organism is defined as metabolism, in general. This engine controls and converts many organic compounds into different types of biomolecules to ensure the survival of the organism. An extensive map of the reaction can be used to illustrate a metabolism, which entails the involved metabolites and their interactions with each other during the biological process. Using this map of reaction, the uncertain growth of microorganisms can be more profoundly addressed to control bacterial infection (Ullah et al., 2006). To accomplish this, the computational approach has emerged as one of the most efficient methods for comprehending this complex network and dealing with large numbers of previously obtained metabolic pathways data from high-throughput experiments. The functional capabilities of pathways, in terms of metabolic fluxes through a network that includes the biomass production rate, can be studied and understood by critically developing the modelling technique (Liberti and Kucherenko, 2005). This can be further enhanced by including the complete set of metabolic activities in the model for representing the global biochemical capabilities of the organisms (Dunlop et al., 2007). Reactions are thus concluded using the information obtained from the genome annotation through pathway modelling. There are few prominent databases commonly used to model the network, including the Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/), the Gene Database (GeneDB, http://www.genedb.org/Homepage), Biomodels Database (http://www.ebi.ac.uk/biomodels-main/), Biochemical Reaction Kinetics Database (SABRIO_RK, http://sabio.villa-bosch.de/), and REACTOME (http://www.reactome.org/ReactomeGWT/entrypoint.html). These databases store information on reactions, enzymes, and genes, which are vital for modelling the network. By analyzing the developed pathway model, the visualization and inference of related information concerning proteomics and genomics data within relevant biological functions can be improved as well (Chao and Lin, 2007).

308

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

Structured Dynamical Systems Many biological systems are organized, which means they may be categorized into individual components. The advancement of the total system’s state is thus seen as a result of the advancement of its states. The operation of a12 gene regulatory network, for example, may be defined in terms of individual gene activity. A dynamical system organized into component subsystems was formally referred to as a structured dynamical system (unit). The cartesian product of the sets of state variables of the component subsystems yields the set of state variables for the entire system. As a result, the state transition function of the entire system may be characterized as the sum of the state transition functions of these subsystems (Figure 2).

Figure 2. Flowchart of modelling and simulation in biological pathway.

Pathway Analysis in Biological Systems Phenomena that take place in in vivo and in vitro conditions are quite impossible to visualize at the molecular level. Therefore, innovative in silico technique is a promising approach for understanding complex biological events allowing a new direction towards disease diagnosis (Chetta et al., 2021). In this regard, pathway analysis is a frequently used technology in the

A Review on Recent Advances in Different Modelling Techniques …

309

biological science for interpreting high-throughput biological data. The technique of these tools is based on the collection and use of biomolecular information, as well as statistical testing and other methods. KEGG pathway is a set of manually drawn pathway maps that represents the genetic information of specific enzyme, correlation networks and molecular level interaction between enzymes involved in a signal transduction pathway which further assist in-depth understanding about a metabolism process, possible biological factors modulating the metabolic process of an organism. This type of details understanding could be helpful in diagnosis of human diseases and further successful drug development in a knowledge based approach. High-throughput sequencing and gene/protein profiling methods have revolutionized biological research by allowing for complete monitoring of a biological system. Analysis of high-throughput data collected from a biological system provides information of differentially expressed genes or proteins. This list is incredibly useful for discovering genes that may play a role in a particular trait or phenomenon. For many researchers, however, this list typically falls short of providing mechanistic insights into the underlying biology of the illness under investigation. In this approach, the introduction of high-throughput profiling technology has created a new challenge: interpreting a vast data of differentially expressed genes and proteins. One strategy for fixing the issue was to break down huge data of individual genes or proteins into smaller groups of similar genes or proteins which simplifies the analysis process. . Individual genes and proteins are known to be involved in biological processes, components, or structures, and the knowledge bases detail how and where gene products interact with one another. Identifying groupings of genes that participate in the same pathways is one illustration of this concept. For two reasons, analyzing high-throughput molecular observations at the functional level is particularly interesting. For the beginners, categorizing thousands of genes, proteins, and/or other biological molecules involved in a specific metabolic pathway can reduce the complexity of the task. Second, defining active pathways that differ between two circumstances has greater explanatory value than just identifying genes or proteins (Glazko and Emmert-Streib, 2009; Khatri et al., 2012). Singh et al. (2021) have performed a network analysis of the proteins involved in hostpathogen interactions (HPI) in cardiovascular diseases (CVDs) caused by microbe. In this study, KOBAS 3.0 HPIs were used to do ontology and route analysis for the whole (wHPI) and CVD-specific (cHPI) networks of Papillomavirus, Herpes, and Influenza virus, and some bacteria such as Yersinia pestis and Bacillus anthracis as well. Topological properties of HPI

310

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

networks in CVDs and all pathogens were studied using Cytoscape 3.5.1. This research provides a system-level understanding of cardiac damage induced by microbe. Apart from CVDs, different models have also been built for several other diseases and for the understanding of molecular kinetics. Hendrata and Sudiono (2021) built the multiscale modeling of tumor response to vascular endothelial growth factor (VEGF) inhibitor. Olsson and Noe (2019) built dynamic graphical models of molecular kinetics during the enzyme-substrate reactions (Figure 3).

Figure 3. The integration of extracellular, cellular, molecular and tissue scales in the extended model (a), Network analysis by KEGG pathway (b), Schematic diagram of endothelial cell intracellular signaling (c).

Interactions Flow through Biological Pathway Pathways are made up of a small number of basic network motifs (Milo et al., 2002). When transitions receive numerous inputs, they function as rule-based regulators of token flow. A basic rule is that when the number of tokens on input places is not equal, the number of tokens taken forward will be based on the input place with the least number of tokens. In instances where there are

A Review on Recent Advances in Different Modelling Techniques …

311

multiple outputs from a transition, the number of tokens on the upstream place will be reflected by the downstream places, i.e., flow is preserved. Where outputs come directly from a place, the number of tokens on the upstream place will be divided randomly amongst the downstream places.

Feedback Control Feedback loops are a fundamental feature of biological systems, and the ability to model them is an essential property of any modelling system. Inhibitor edges have a unique function. They originate from an entity (the inhibitor) and connect with a transition node, i.e., the process that is to be inhibited. This system includes two types of inhibitory edges, representing non-competitive and competitive inhibition. The first operates on the basis that if any tokens are present on the inhibitor, flow through the transition is completely blocked. In the second case, the number of tokens residing on the inhibitor is subtracted from the number of tokens flowing through the transition. As no tokens are lost through inhibitor edges, it has become standard practice to draw inhibitor places with an associated output transition (sink). Without this, flow through a negative feedback loop is completely and irrevocably stopped as tokens accumulated on the inhibitor remain there, blocking further flow through the target transition. An inhibitor with a sink transition node attached effectively has a half-life (as tokens are lost from the inhibitor via the sink), and depending on the configuration of the feedback loop, e.g., path length between input and inhibitor and type of inhibition (amongst other factors), the system may exhibit a range of oscillatory activities.

Pathway Activation Measurement Signaling pathway activation analysis is a powerful approach for extracting biologically relevant features from large-scale transcriptomic and proteomic data. However, modern pathway-based methods often fail to provide stable pathway signatures of a specific phenotype or reliable disease biomarkers. In silico Pathway Activation Network Decomposition Analysis (iPANDA) is a scalable and robust method for biomarker identification using gene expression data. The iPANDA method combines precalculated gene coexpression data with gene importance factors based on the degree of differential gene

312

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

expression and pathway topology decomposition to obtain pathway activation scores. The iPANDA reported to identify highly robust sets of biologically relevant pathway using Microarray Analysis Quality Control (MAQC) data sets and pretreatment data sets on Taxol-based neoadjuvant breast cancer therapy. Application of the pathway activation measurement implemented in iPANDA leads to significant noise reduction in the input data and hence enhances the ability to produce highly consistent sets of biologically relevant biomarkers acquired on multiple transcriptomic data sets. Another advantage of the approach presented is the high speed of the computation. The gene grouping and topological weights are the most demanding parts of the algorithm from the perspective of computational resources. Fortunately, these steps can be precalculated only once before the actual calculations using transcriptomic data. The calculation time for a single sample processing equals ∼1.4 s on the Intel (R) Core i3-3217U 1.8 GHz CPU (compared with 10 min for SPIA, 4 min for DART, about 10 s for ssGSEA, GSEA and PLAGE). Thus, iPANDA can be an efficient tool for high-throughput biomarker screening of large transcriptomic data sets (Ozerov et al., 2016). The use of microarray data for pathway activation analysis has wellknown limitations, as it cannot address individual variations in the gene sequence and consequently in the activity of its product. For example, a gene can have a mutation that reduces the activity of its product but elevates its expression level through a negative feedback loop. Thus, the elevated expression of the gene does not necessarily correspond with an increase in the activity of its product. Nevertheless, comprehensive analysis of the tumour pathway activation profile may be a more clinically relevant strategy to stratify the subset of patients whose tumours could probably respond and who would clinically benefit from anti-cancer therapeutic regimens than other outcome prediction methods based on the gene expression profile. While gene expression levels can be effectively used for phenotype prediction, it is quite possible that the most differentially expressed genes in a given signature will not be part of the pathways that actually drive tumour behaviour. Alternatively, expression of some genes within cancer-driving pathways is not always predictive of the overall pathway activation. Therefore, while there is no single preferential approach for interpreting gene expression results, the proposed method of transcriptomic data analysis at the signaling pathway level may not only be useful for discrimination between various biological or clinical conditions but may also aid in identifying functional categories or pathways that may be relevant as possible therapeutic targets. Although the iPANDA algorithm was initially designed for microarray data analysis, it can

A Review on Recent Advances in Different Modelling Techniques …

313

also be easily applied to the data derived from genome-wide association studies (GWAS). To do so, GWAS data can be converted into a form amenable to the iPANDA algorithm. Single-point mutations are assigned to the genes based on their proximity to the reading frames. Then each single-point mutation is given a weight derived from a GWAS data statistical analysis (Torkamani et al., 2008). Simultaneous use of the GWAS data along with microarray data may improve the predictions made by the iPANDA method. One of the rapidly emerging areas in biomedical data analysis is deep learning (Mamoshina et al., 2016). Recently, several successful studies on microarray data analysis using various deep learning approaches on gene-level data have surfaced (Hira and Gillies, 2015). Using pathway activation scores may be an efficient way to reduce the dimensionality of transcriptomic data for drug discovery applications while maintaining biologically relevant features (Aliper et al., 2016). From an experimental point of view, gene regulatory networks are controlled via the activation or inhibition of a specific set of signaling pathways. Thus, using the iPANDA signalling pathway activation scores as input for deep learning methods could bring results closer to experimental settings and make them more interpretable to bench biologists. One of the most difficult steps of multilayer perceptron training is the dimension reduction and feature selection procedures, which aim to generate the appropriate input for further learning (Ibrahim et al., 2014). Signalling pathway activation scoring using iPANDA will likely help reduce the dimensionality of expression data without losing biological relevance and may be used as an input to deep learning methods, especially for drug discovery applications. Using iPANDA values as input data seems to be a particularly promising approach to obtaining reproducible results when analysing transcriptomic data from multiple sources. Metabolic plasticity allows cancer cells to adjust their metabolic phenotypes to adapt to hostile environments. There is an urgent need to understand the crosstalk between gene regulation and metabolic pathways underlying cancer metabolic plasticity.

Future Aspects and Challenges In silico analyses and trial simulations have now become crucial tools for decision-making and communication with regulatory bodies, complementing experimental techniques in pharmaceutical R&D. Even though biology is multiscale by nature, most project work and software tools focus on isolated elements of drug action, such as pharmacokinetics at the organism scale or

314

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

pharmacodynamic interaction at the molecular level. PK-Sim® and MoBi® are modelling and simulation software platforms capable of designing and simulating models that interact across biological scales. Abnormal signal transmission leads to cell cycle transition and proliferation, which drive tumour development. The active metabolite inhibits Raf kinase in the signaling cascade which reported to promote cell cycle. The patient specific treatment results into chemotherapeutic intervention which is simulated for a large population with a varied genetic profile in virtual clinical research. As a result, the platform facilitates the development of models and the integration of biological knowledge and previous data at all biological scales. Observations in animal research and clinical trials can be related to experimental in vitro model systems. This mechanistic, insight-driven multiscale modelling technique may be used to study the interactions between people, illnesses, and medications, as well as subjects of high clinical importance such as pharmacogenomics, drug-drug, and drug-metabolite interactions. Probably the most important challenge today is that the knowledge embedded in these pathways about how various genes interact with each other is not currently exploited. The very purpose of these pathway diagrams is to capture some of our knowledge about how genes interact and regulate each other. However, the existing analysis approaches consider only the sets of genes involved in these pathways, without taking into consideration their topology. In fact, our understanding of various pathways is expected to improve as more data are gathered. Pathways will be modified by adding, removing, or redirecting links on the pathway diagrams. Most existing techniques are completely unable to even sense such changes. Thus, these techniques will provide identical results as long as the pathway diagram involves the same genes, even if the interactions between them are completely redefined over time (Draghici et al., 2007).

Conclusion A fundamental understanding of modelling for signal biological pathways would be beneficial and provide a general notion for fresher academicians (such as students, researchers etc.) from many disciplines who are going into this pioneering and hard sector where they may reveal some new aspects of biology. This new branch of biology, which is a strong technique for comprehending the complexity of signal transduction systems, is currently the

A Review on Recent Advances in Different Modelling Techniques …

315

subject of a lot of studies. Signal transduction research is being accelerated by the use of mathematical and kinetic models to understand cell biology information. Experimental biologists have recently started to employ computer-based modelling to improve their grasp of governing principles, theories, and novel discoveries in biological systems. The adoption of a computational systems biology approach will assist in elucidating the hierarchy of the signalling network in light of developing biological experimental results and the ever-increasing volume of data. While it is switching to expand the strategy to more complicated networks, this quantitative approach of bracketing biological knowledge was essential to find novel signalling pathways. The area of systems biology is concerned with the principles of coordinated operation of cellular networks as well as their design. One can anticipate how the model performs dynamically at different degrees of signal abstraction using the aforesaid computational technique and experimental framework. These forecasts, as well as the legalization, will be useful in the development of new biological theories. Systems biology is a key problem in biology that offers a quantitative way to analyze massive amounts of data. Recent advances in computational biology technologies have opened new possibilities for combining interdisciplinary knowledge to solve critical biological challenges.

References Aderem, A., (2005). Systems biology: its practice and challenges, Cell. 121, 511–513. Aliper, A., Plis, S., Artemov, A., Ulloa, A., Mamoshina, P. and Zhavoronkov, A. (2016). Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13, 2524–2530. Baldan, P., Cocco, N., Marin, A., Simeoni, M., (2010). Petri nets for modelling metabolic pathways: a survey. Nat Comput. 9, 955–989. Bertalanffy, L.V., (1969). General System Theory, Foundations, Development, Applications. 1st ed. George Braziller: New York. Breda, A., Valadares, N.F., de Souza, O.N., Garratt, R.C., (2007). Protein structure, modelling and applications. In Bioinformatics in tropical disease research: a practical and case-study approach [Internet]. National Center for Biotechnology Information (US). Bruggeman, F.J., Hornberg, J.J., Boogerd, F.C. and Westerhoff, H.V., 2007. Introduction to systems biology. In Plant systems biology (pp. 1-19). Birkhäuser Basel. Carlo, C., (2008). Introduction to System Biology [online]. UniversitàdegliStudi Magna Graecia, Catanzaro, Italy. Available from: http://users.ece.cmu.edu/~brunos/Lecture1. pdf.

316

Manish Paul, Saikat Chakrabarti and Amrita Banerjee

Ceric, S., Kurtanjek, Z., (2006). Model Identification, Parameter Estimation, and Dynamic Flux Analysis. Chem Biochem Eng Q. 3, 243-253. Chao, O., Lin, W. Comparison Between PSO and GA for Parameters Optimixation of PID Controller. Proceedings of IEEE Int Conf on Mechatronics and Automation; 2007 June 25-28; Luoyang, Henan. IEEE 2006. Chetta, M., Tarsitano, M., Vicari, L., Saracino, A. and Bukvic, N., (2021). In Silico Analysis of Possible Interaction between Host Genomic Transcription Factors (TFs) and Zika Virus (ZikaSPH2015) Strain with Combinatorial Gene Regulation; Virus Versus Host—The Game Reloaded. Pathogens. 10, p.69. Draghici, S., Khatri, P., Tarca, A.L., Amin, K., Done, A., Voichita, C., Georgescu, C. and Romero, R., (2007). A systems biology approach for pathway level analysis. Genome Res., 17, 1537-1545. Du, X., Li, Y., Xia, Y.L., Ai, S.M., Liang, J., Sang, P., Ji, X.L. and Liu, S.Q., (2016). Insights into protein–ligand interactions: mechanisms, models, and methods. Int. J. Mol. Sci., 17, p.144. Dunlop, M.J., Franco, E., Murray, R.M., (2007). A Multi-Model Approach to Identification of Biosynthetic Pathways. Proceedings of American Control Conference (ACC); 2007 July 11-13; New York, USA. IEEE. Forbus, K.D., (1996). Qualitative reasoning. CRC Handbook of Computer Science and Engineering. p. 715–733. García-Campos, M.A., Espinal-Enríquez, J. and Hernández-Lemus, E., (2015). Pathway analysis: state of the art. Front. Physiol. 6, p.383. Glazko, G., Emmert-Streib, F., (2009). Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 25, 2348– 2354. Hira, Z.M., Gillies, D.F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinformatics. 2015, 198363. Ibrahim, R., Yousri, N.A., Ismail, M.A., El-Makky, N.M., (2014). Multi-level gene/MiRNA feature selection using deep belief nets and active learning. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2014, 3957–3960. Karaboga, D., Gorkemli, B., Ozturk, C., Karaboga, N., (2012). A comprehensive survey: artificial bee colony (ABC) algorithm and applications. Artif. Intell. Rev. 42, 21-57. Khatri, P., Sirota, M. and Butte, A.J., (2012). Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8, p.e1002375. King, R.D., Garrett, S.M., Coghill, G.M., (2005). On the use of qualitative reasoning to simulate and identify metabolic pathways. Bioinformatics. 21, 2017–2026. Kirschner, M.W., (2005). The meaning of systems biology, Cell. 121, 503–504. Kitano, H., (2002). Computational systems biology, Nature. 420, 206–210. Kitano, H., (2002). Systems biology: a brief overview, Science. 295, 1662–1664. Kuipers, B. (1994). Qualitative reasoning: modeling and simulation with incomplete knowledge. Cambridge: The MIT Press. Lee, D., Yun, C., Cho, A., Hou, B.K., Park, S., Lee, S.Y., (2006). Webcell: a web-based environment for kinetic modeling and dynamic simulation of cellular networks. Bioinformatics. 22, 1150–1151.

A Review on Recent Advances in Different Modelling Techniques …

317

Lee, K.W., Choo, H.S., (2011). A critical review of selective attention: An interdisciplinary perspective. Artif Intell Rev. 40, 27-50. Liberti, L., Kucherenko S., (2005). Comparison of deterministic and stochastic approaches to global optimization. International Transactions in Operational Research. 12, 263285. Lillacci, G., Khammash, M., (2010). Parameter Estimation and Model Selection in Computational Biology. PLos, 6, 1-17. Liu, J.M., Chen, Y.W., (2011). Toward understanding the optimization of complex systems. ArtifIntell Rev. Ljung, L., (1998). System identification: theory for the user. New York: Pearson Education. Mamoshina, P., Vieira, A., Putin, E., Zhavoronkov, A. (2016). Applications of deep learning in biomedicine. Mol. Pharm. 13, 1445–1454. Meng, T.C., Somani, S., Dhar, P., (2004). Modeling and simulation of biological systems with stochasticity. In Silico Biol. 4, 293-309. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U., (2002). Network motifs: simple building blocks of complex networks. Science 298, 824–827. Olsson, S. and Noé, F., (2019). Dynamic graphical models of molecular kinetics. Proceedings of the National Academy of Sciences, 116(30), pp.15001-15006. Ozerov, I.V., Lezhnina, K.V., Izumchenko, E., Artemov, A.V., Medintsev, S., Vanhaelen, Q., Aliper, A., Vijg, J., Osipov, A.N., Labat, I., West, M.D., (2016). In silico Pathway Activation Network Decomposition Analysis (iPANDA) as a method for biomarker development. Nat. Commun., 7, pp.1-11. Pang, W., Coghill, G.M., (2011). An immune-inspired approach to qualitative system identification of biological pathways. Nat Comput.10, 189–207. Pang, W., Coghill, G.M. An immune network approach to learning qualitative models of biological pathways. In: IEEE congress on evolutionary computation (CEC), 2014. July 2014. p. 1030–1037. Seidi, S., Eftekhari, A., Khusro, A., Heris, R.S., Sahibzada, M.U.K. and Gajdács, M., (2022). Simulation and modeling of physiological processes of vital organs in organon-a-chip biosystem. Journal of King Saud University-Science, 34, p.101710. Singh, N., Rai, S., Bhatnagar, R. and Bhatnagar, S., (2020). Network analysis of hostpathogen protein interactions in microbe induced cardiovascular diseases. In Silico Biol. (Preprint), pp.1-19. Torkamani, A., Topol, E.J., Schork, N.J., (2008). Pathway analysis of seven common diseases assessed by genome-wide association. Genomics 92, 265–272. Ullah, M., Schmidt, H., Cho, K.H., Wolkenhauer, O., (2006). Determenistic modelling and stochastic simulation of biochemical pathways using MATLAB. IEE Proc. Syst. Biol. 153, 53-59. Westerhoff, H.V., (2005). Systems biology ... in action. Curr. Opin. Biotechnol. 16326– 328. Wiener, N. (1948). Cybernetics or Control and Communication in the Animal and the Machine. 2nd ed. MIT Press: Cambridge, MA. Zewde, N.T., (2019). Multiscale Solutions to Quantitative Systems Biology Models. Front. Mol. Bio., p.119.

Chapter 13

Bioinformatics Tools and Databases for Genomics Research Charles Oluwaseun Adetunji1,* Frank Abimbola Ogundolie2 Olugbemi Tope Olaniyan3 Sujata Dash4 Omosigho Omoruyi Pius1 Kehinde Kazeem Kanmodi5 and Lawrence Achilles Nnyanzi5 1Applied

Microbiology, Biotechnology and Nanotechnology Laboratory, Department of Microbiology, Edo University Iyamho, Auchi, Edo State, Nigeria 2Department of Biotechnology, Baze University Abuja, Nigeria 3Laboratory for Reproductive Biology and Developmental Programming, Department of Physiology, Edo University Iyamho, Nigeria 4Department of Information Technology, Nagaland University, Dimapur, Nagaland, India 5School of Health and Life Sciences, Teesside University, Middlesbrough, United Kingdom

Abstract Genomic research has primarily contributed to the improvements experienced today across different sectors ranging from biomedical, forensic, industrial, pharmaceutical, and drug discovery, to agricultural *

Corresponding Author’s Email: [email protected]; Tel: +2348039120079. ORCID iD is 0000-0003-3524-6441.

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

320 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. and medical research. These were due to the emergence of bioinformatics, chemoinformatics, and computational biology. The use of bioinformatics tools for the analysis and storage of biological data for structural and functional biological processes or properties of proteins depends on databases and repositories of DNA,RNA, or amino acid sequences that code for such protein. This chapter provides a brief overview of the use of bioinformatics tools and databases for genomic research in various sectors.

Keywords: DNA/RNA, bioinformatics tools, modelling, visualization, drug discovery

database,

biomedical,

Introduction Genomics research has transformed many areas of science, with improved medical diagnosis, treatments in the form of precision/personalized medicine, improved agricultural outputs, and the traditional time-consuming steps and stages required in producing vaccines. With the advent of genomic research, vaccines are made in record time. The use of bioinformatics tools such as databases and computer-based biological applications for analyzing, detecting, and predicting possible treatment plans based on information obtained from the genomic data at our respective disposal has given this field a shift from the more tedious approach of trials and errors or conventional methods (Adetunji et al., 2022a-k; Olaniyan et al., 2022a, b; Oyedara et al., 2022). Over the years, the advancement of bioinformatics has improved genomic research through various bio-computational analyses using bioinformatics tools that are involved in analyzing genomic data via proteomics, genomics, transcriptomics, or omics in general (Manzoni et al., 2018). (Manzoni et al., 2018). Bioinformatic tools, as well as genomic web servers/browsers such as sequence alignment analysis tools, Clustal W, gene prediction tool, SwissDock, PRATT, protein databank (PDB), NCBI, QMEAN, single nucleotide polymorphism (SNP) databases, ProtParam, UniProtKB, gene prediction database, Uniprot, primer blast/designing tools, knowledge resources on cell-lines (cellsaurus), SWISS-MODEL, SwissLipids, SwissDrugDesign, UniProt, ClustalO, Glycodigest, eukaryotic promoter database (EPD), Glycomod, expression database, among others, have given the world of genomic research a massive boost. These innovations have

Bioinformatics Tools and Databases for Genomics Research

321

affected all aspects of scientific life with notable improvements in medical research. This has improved medical diagnosis, predicted better disease management options, improved drug design (Wishart, 2005; Li et al., 2020; Rastogi et al., 2018) and drug discovery (Xia, 2017; Ramharack, and Soliman, 2018), vaccine research (Altindis et al., 2015; Soria-Guerra et al., 2015; María et al., 2017), and better understanding and treatment of genetic diseases (Xia, 2017) among others. Teufel et al., 2006 explored various applications of different bioinformatics tools and web-based servers for biomedical genomic research. They reviewed the current status of using omics in diagnosing several genetic diseases and analyzing big data. Angarica and Sol, (2017) reported the roles of several bioinformatics tools in protein dynamics, computational epigenetics, and genome-wide research on how epigenetic data assist in maintaining critical cellular processes in humans. According to their report, computational epigenetics is being used to better understand cellular interactions at the genetic and molecular levels, as well as how this affects disease pathogenesis.

Relevance of Bioinformatics Tools and Databases for Genomics Research Singh et al., (2016) reported that particular computers and algorithms in bioinformatics, chemoinformatics, computational biology, and drug discovery platforms are utilized to analyze and store biological data for structural and functional biological processes. The authors reported that these platforms are commonly used in the pharmaceutical, forensic, health, drug design, food industry, and agricultural sector. They also noted that bioinformatics tools are utilized for data organization for easy assessment by researchers. These platforms are also made to develop resources for a proper understanding of biological and interpretation of results. Some of the databases noted by the authors are GenBank (Genetic Data Bank), PIR (Protein Information Resource), SWISS-PROT, SCOP (Familial and Structural Protein Relationships), PDB (Protein Data Bank), and CATH (Hierarchical Classification of Protein Domain Structures). Biomedical data and genomics are managed through bioinformatics tools for healthcare management and analysis. Biomedical informatics falls under bioinformatics, which is applied to solve biomedical challenges at the cellular

322 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

level, such as novel diagnostics, vaccines, and therapeutics, for understanding infectious disease mechanisms, transmission cycles, and pathogen-host interactions. Bioinformatics tools are also currently used in developing novel treatments, the realization of a human genomic project, prediction of the individual at high risk, identification of disease susceptibility genes, classification of diseases, drug adverse reaction, enhanced efficacy, and analysis of proteomic data. In drug discovery, development of candidate dug is expensive, complex and time-consuming. Thus, bioinformatics tools generate a more objective, rational, structural approach for predicting agent or candidate molecule. Bioinformatics tools target biological pathways for a proper understanding of the mode of action of drug molecules and the pathophysiology of the disease. Forensic data and bioinformatics tools like CODIS (Combined DNA Index System), MDKAP (Mass Disaster Kinship Analysis Program), DNA-View, MFISys (Mass Fatality Identification System) are utilized for forensic analysis. Plant species and varieties are selected in the agricultural sector, and many traits involved in ensuring quality and disease resistance are analyzed through systematic functional analysis by bioinformatics tools. Also, these platforms enable researchers to assess plant bioinformatics information on mutations, maps, markers, and applicable discoveries. Through proper data sequencing and analysis, annotation of genes, phenotypes, and proteins in crop bioinformatics are generated for simulation, in silico modeling, and integration in next-generation plant breeding. In the food industry, through biotechnology, varieties of essential crops and fruits have been engineered, and safety assessments have been carried out, such as food allergens using bioinformatics algorithms like FASTA, Structural Database of Allergenic Proteins to determine different Immunoglobulin Ebinding potential in food proteins. Proteins with a broad spectrum of industrial processes can be engineered genetically to enhance their properties for practical use in various biotechnology processes (Ogundolie, 2015; Ayodeji et al., 2017; Ogundolie, 2021; Ogundolie et al., 2022) and in drug delivery (Adetunji et al. 2023a, b). Personalized food functions and prediction of different food disorders can be achieved in the future with bioinformatics tools. Furthermore, bioinformatics, proteomics, and genomics databases allow an analysis of essential loci and genes regulating traits for transgenic modifications like improved carcass composition, increased growth rate, change in milk composition, enhanced reproductive performance, enhanced

Bioinformatics Tools and Databases for Genomics Research

323

feed utilization, enhanced diseases resistance, and improved mohair generation. Mwololo et al., (2010) revealed that bioinformatics uses extensive biological information data to efficiently manage and generate an interpretation of results. Many genes of unknown functions are analyzed through combined bioinformatics function, transcriptomics profiling, and experimental approaches. In functional genomics and bioinformatics, microarray techniques are currently being used for comparative genomic hybridization, messenger RNA (mRNA), assessment of genomic rearrangements, and gene expression profiling. Lei et al., (2021) reported using bioinformatics to analyze the rice genome. The authors summarized the progress made in rice genome research and breeding through machine learning and single-cell sequencing. Seung et al., (2006) applied bioinformatics in the analysis of plant biology through data management, integration, visualization, analysis, prediction, and modeling. The authors highlighted important fundamental concepts like biological sequences, transcriptome analyses, computational metabolomics, computational proteomics, biological databases, and bio-ontologies. Orton et al., (2016) revealed that the utilization of bioinformatics in the field of viral genomics is receiving unprecedented attention due to the advancement in technology, such as high-throughput sequencing that can be rapidly carried out sequencing at low cost using large data sets of the viral genome, intra-host viral diversity analysis, and species dynamics. These methods generate answers for viral transmission, discovery, host jumping, and vaccine resistance. Complex data sets require new bioinformatics software and algorithms for diverse RNA sequencing and metagenomics activities. Dubravko et al., (2003) showed some background information on the role of macromolecular databases in solving complex biological challenges. Xiaokang et al., (2021) reported that gene expression datasets could be utilized for machine learning analysis to discover diverse biomarkers. The authors further revealed that these biomarkers are of significant importance in several fields of science. The extensive relation between biological samples and computational analysis facilitates a better understanding of complex domains of scientific investigations. Through algorithm development, bioinformatics research continues to gain more relevance in genes, biological pathways, proteins, oncology, signaling transduction mechanisms, and pathophysiology. Sridhar et al., (2012) wrote on the analysis of nutritional physiology utilizing bioinformatics tools. The authors demonstrated that omics technologies could generate large datasets that have a central role in dietary

324 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

investigations through genomics, transcriptomics, epigenomics, metabolomics, and proteomics. The availability of complete and accurate datasets of genomic sequence is required for proper analysis of genomics research (Adetunji et al., 2022l; Adetunji et al., 2022m; Behera et al., 2016; Dash and Abraham, 2018; Dash et al., 2019; Rahman et al., 2021). Bioinformatics entails the development of statistical tools and computational techniques in solving complex biological data (Behera et al., 2016; Dash and Abraham, 2018; Dash et al., 2019; Rahman et al., 2021). This approach involves designing new statistical algorithms, analysis involving the utilization of these algorithms and computational tools and management and storage of these datasets. Bioinformatics tools are high-throughput techniques utilized in generating data at a speedy approach using computational programs. These programs can acquire, classify, analyze, and store enormous amounts of data for subsequent retrieval. Because of the enormity of the data sets, computer hardware must be extensive, the use of appropriate computer programs must be maintained, suitable storage of data and new statistical techniques must be acquired (Dash et al., 2017, Rahman et al., 2018; Sahu et al., 2018; Dash et al., 2020; Dash et al., 2021). Different bioinformatics tools utilized for marker discovery, marker development, association analyses, gene prediction, data management, and data storage are AutoSNP, SNP2CAPS, TASSEL, STRUCTURE, Microarray software, A.C Elegans Database, MAPMAN, GenScan, and ClustaIW. Kumlachew et al., (2015) revealed that bioinformatics is currently utilized for plant disease management. In the study, the author pointed out that large biological datasets generated through proteomics and genomics are analyzed and integrated using computational tools to help model, predict visualization, and manage plant pathology. Through mapping the entire plant genomes, pathogenic traits can help improve understanding of plant diseases, causes, prevention, progression, transmission, and interaction. Ibiam and Ekwe, (2012) reported that bioinformatics helps to understand biological processes because it focuses on computational approaches such as data mining, visualization, and machine learning algorithms. Holton et al., (2013) adopted the application of bioinformatics in nutritional and food research through wiki-like food databases. It was noted by the authors that food and nutrition play an important role in hormonal, gene, physical, metabolic, disease, and mental regulation. The authors suggested that bioinformatics tools can be applied to study food bioactive peptides, food quality, safety, food composition databases, allergen detection, prebiotics and probiotics, and food molecular structure using omics techniques.

Bioinformatics Tools and Databases for Genomics Research

325

Daisuke et al., (2006) revealed that cancer research focusing on gene structure and function could be studied and interpreted better with bioinformatics prediction tools. Repositories of large data in cancer research facilitate microarray data analysis, interpretation and pattern mining of databases of DNA microarrays and proteomics (Dash et al., 2017, Rahman et al., 2018; Sahu et al., 2018; Dash et al., 2020; Dash et al., 2021). Various platforms and algorithms are available to manage large data sets from cancer research such as BLAST, FASTA, KEGG, UniProt GoMiner, GenMAPP, GoSurfer, GO tree, MGI, GEO, Array-Express, Microarray Gene Expression Data, Significance Analysis of Microarrays, NUDGE, Hidden Markov models, SMART, PROSITE, ELM, Pfam, PSORT, Protein Function Prediction, HMM, SABLE, PORTER, COILS, PONDR, TMHMM, HMMTOP, MODELLER, FUGUE, and SPARKS. Some of these bioinformatics tools such as MODELLERS, MD Simulation and PDB databases have also been used for drug design and discovery (Fadare et al., 2021). New data are generated from genomes, metabolites, metabolic pathways and proteins which can be analyzed and interpreted using next generation sequencing, omics and bioinformatics.

Databases for Genomics Research A large amount of data can be generated from a specific topic electronically, and subsequent updating, search, and retrieval. Three main types of databases exist: flat-file, hierarchical, and relational databases. Describing the flat file is the earliest, simple, and easy to set up the type of database for storing small data through a very complex storage approach. Relational databases deal with a tabular form of data indexed according to their peculiar features. SQL programming language can be utilized to construct databases to reduce redundancy, facilitate rapid data searches, and respond accurately to complex questions. In a hierarchy, data are well structured like an ordered tree for fast search, simple organization, and retrieval. Hierarchical require more space and is time-consuming. Database management systems are required to search, organize, access, and retrieve data using operational instructions studies have revealed that biological samples are generated based on protein and DNA sequence data storage, retrieval, and management systems. Some of the examples of datasets on nucleotide sequencing are GenBank, DDBJ, EMBL, and GSDB, while Swiss-Prot, PIR, TrEMBL, and MIPS are mainly for protein primary database sequences.

326 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

Milanowska et al., (2011) utilized bioinformatics and database tools to analyze DNA repair. In their study, the authors revealed that different toxicants are constantly exposed to DNA, causing lesions, diseases, and mutations. Some contaminants are toxic chemicals, ionizing radiation, ultraviolet light, and reactive metabolites resulting in ageing, death, damaged lesions, cancer, mutations, dysfunction, altered metabolism, and diseases at the cellular level. To prevent these types of damage, cells have various internal mechanisms such as a repair system (biochemical and physiological pathways involved in the elimination of single-strand lesions, like base excision repair, nucleotide excision repair), specialized polymerases temporarily taking over lesion-arrested DNA polymerases at the S phase in translesion synthesis, homologous recombination repair, non-homologous end-joining repair, and a DNA damage response system. Brian and Charlotte (2017) revealed that bioinformatics advancement results from the challenges faced using genomics applications in the biomedical field. Hong et al., (2011) showed that next-generation sequencing is revolutionized by genome sequencing, produces a large amount of data and thus requires bioinformatics to interpret and analyze the data. In modern genome analysis, two primary sequence-based biomarkers are currently utilized: single nucleotide polymorphisms and simple sequence repeats. These two sequencing approaches are used for association studies, biomarker selection, diversity analysis, and trait mapping. The gene ontology consortiums revealed that GO websites provide a wide range of literature and documents on gene ontology projects on several domains of cellular and molecular processes like structure, classifications, and controlled vocabularies. Roumpeka et al., (2017) reported the role of bioinformatics tools in the metagenomic sequence of data and bioprospecting. In their study, the authors analyzed some organisms found in a microbiome population through DNA sequencing, metagenomics, and bioinformatics tools for annotation, gene prediction, data sharing, and assembly for novel gene discovery. Suresh et al., (2014) utilized applied and basic research information with an integrated repository to analyze the tomato genome. Through this study, the authors established twelve tomato chromosomes in a single window using microRNAs, quantitative trait loci, simple sequence repeats, and the TomatoEXPEN 2000 genetic map. Lincoln et al., (2002) reported that reusable software for a different model of systemic databases could be developed using the Generic Model Organism System Database Project. This generic genome browser displays some

Bioinformatics Tools and Databases for Genomics Research

327

genomic features and annotations on the website with simple installation, easy integration, and flexible configuration. Genomics databases can be a repository of DNA sequences of animals, plants, or microbial species for research. Many of these data are maintained electronically using software algorithms and user interfaces. Some of the databases are GenBank databases, which contain DNA nucleotide sequences from multiple genome sources. The human genomic databases contain more than 3 billion nucleotides stored electronically. Also, several databases exist for plants, animals, other mammalian species, invertebrates, and single-cell microorganisms. Jeyachandran et al., (2020) reported that proteomics and genomics databases are important electronic platforms to share, store, retrieve, and analyze data for research purposes, for instance, the Protein Data Bank. Currently, many researchers utilize necessary DNA sequences to obtain codes for different organisms by assessing other genomic databases electronically, like transcriptomics and epigenetic data. Eric et al., (2021) revealed that the National Center for Biotechnology Information generates a large amount of data for online resources like the PubMed database, GenBank nucleic acid sequence database, PMC, Genome Data Viewer, Bookshelf, SRA, dbSNP, ClinVar, dbVar, BLAST, Pathogen Detection, Primer-BLAST, iCn3D, PubChem, and IgBLAST. These platforms, created by the National Center for Biotechnology Information, support global research activities. Andreas et al., (2006) revealed that the rich availability of genomic sequences brings about advancements in the human genome and molecular medicine through next-generation high-throughput genomic analysis and bioinformatics. Bioinformatics is rapidly broadening the scope of genomic research through bio-computational approaches like genomic databases, tools for sequence alignment, genome browsers, single nucleotide polymorphism databases, and devices like prediction algorithms, promoter prediction algorithms, and expression databases. Many public institutions, such as universities and businesses around the world, continue to sequence genetic information into these platforms, enriching the Human Genome Project as the genomic repository. Ashwini and Rajesh (2016) analyzed the progress made in bioinformatics. They reported that completing the human genomic project established the applicability of computational approaches to biological molecules. Extensive databases in the private and public domains on genomics, RNA, proteins, ORFs, and intergenic regions have facilitated research efforts in treating and

328 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al.

diagnosing diseases utilizing proteomics and DNA microarray sequences (Jimenez-Gutierrez et al., 2016). Latrice et al., (2018) identified a lack of adequate genomic diversity databases as a barrier to advancement in precision medicine practice and research output. The authors revealed that molecular biomarkers in precision medicine are utilized to assess patients’ therapeutic responses, diagnoses, risk factors, prognosis, and health status. Analyzing two public repository databases, they discovered that the database of phenotypes and genotypes and the Genome-Wide Association Study Catalog contain less information on African, Asian, and Latin American populations than on Europeans for diseases and data types. Before drug discovery, it was a complicated and complex procedure that was time-consuming and required adequate knowledge. However, drug design and discovery are now straightforward with the development of computational approaches in bioinformatics. In a short period of time, various target molecules and sights are analyzed. Biological databases house information like proteomics, genomics, metabolomics, phylogenetics, and microarray gene expression, which can be processed for gene function, localization, structure, effects of mutations, and other physiological functions. Peter et al., (2017) gave an insight into the human genetic mutation databases for biomedical research on inherited mutations. The authors explained that the databases are comprehensive collections of germline mutations in the genes closely linked to human inherited diseases. These databases will provide information and annotation of next-generation sequencing data for researchers, diagnostic laboratories, industrialists, clinicians, and genetic counsellors. Suraj et al., (2003) reported that the establishment of the Human Protein Reference Database would provide information about protein-protein interactions, enzyme/substrate relationships, posttranslational modifications, disease associations, subcellular localization, and tissue expression, which would support research, treatment, diagnosis, and prognosis on human diseases and health.

Conclusion Recent advances in several sectors, including agriculture, health, industrial, biomedicine, and pharmaceutical industries, have been further enhanced by using various bioinformatics tools and databases. Precision medicine,

Bioinformatics Tools and Databases for Genomics Research

329

precision agriculture, and faster, more efficient, and less time-consuming drug discovery, design, development, and production processes have been attributed to improved databases and bioinformatics tools. Production of vaccines and enhanced crops are also interesting improvements from applying these tools to research.

References Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D., (2022a). Computational Intelligence Techniques for Combating COVID-19. doi: 10.1201/9781003178903-16. In book: In Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics (pp. 251-269). CRC Press. Adetunji, C. O., Nwankwo, W., Olayinka, A. S., Olugbemi, O. T., Akram, M., Laila, U., Olugbenga, M. S., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E. and Esiobu, N. D. (2022b). Machine Learning and Behaviour Modification for COVID-19. doi: 10.1201/9781003178903-17. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 17. eBook ISBN 9781003178903 Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Garipova, L. and Shariati, M. A. (2022c). eHealth, mHealth, and Telemedicine for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_10. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022d). Machine Learning Approaches for COVID-19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_8. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Isabekova, O. and Shariati, M. A., (2022e). Smart Sensing for COVID19 Pandemic. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10. 1007/978-3-030-79753-9_9. Adetunji, C. O., Inobeme, A., Tadso, J., Olaniyan, O. T., Abimbola, O. F., Shahnawaz, M., & Anani, O. (2022f). Potential of Plastic Waste in Enhancing the level of Pathogenicity of diverse Pathogens in the Marine Biota. In Impact of Plastic Waste on the Marine Biota (pp. 301-312). Springer, Singapore.

330 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Petukhova, E. and Shariati, M. A. (2022g). Internet of Health Things (IoHT) for COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S.A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10. 1007/978-3-030-79753-9_5. Adetunji, C. O., Olaniyan, O. T., Adeyomoye, O., Dare, A., Adeniyi, M. J., Alex, E., Rebezov, M., Koriagina, N. and Shariati, M. A., (2022h). Diverse Techniques Applied for Effective Diagnosis of COVID-19. In: Pani S. K., Dash S., dos Santos W. P., Chan Bukhari S. A., Flammini F. (eds) Assessing COVID-19 and Other Pandemics and Epidemics using Computational Modelling and Data Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-79753-9_3. Adetunji, C. O., Olugbemi, O. T., Akram, M., Laila, U., Samuel, M. O., Oshinjo, A. M., Adetunji, J. B., Okotie, G. E., Esiobu, N. D., Oyedara, O. O. and Adeyemi, F. M. (2022i). Application of Computational and Bioinformatics Techniques in Drug Repurposing for Effective Development of Potential Drug Candidate for the Management of COVID-19. doi: 10.1201/9781003178903-15. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition, First Published 2022, Imprint CRC Press. Pages 14. eBook ISBN 9781003178903. Adetunji, C.O., Abimbola, O.F., Singh, K.R., Olaniyan, O.T., Bodunrinde, R.E., Inobeme, A., Mathew, J.T., Singh, J. and Singh, R.P., (2022j). Microbe Performance and Dynamics in Activated Sludge Digestion. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 99-112). CRC Press. https://doi.org/10.1201/ 9781003354147 Adetunji, C. O., Ogundolie, F. A., Ajiboye, M. D., Mathew, J. T., Inobeme, A., Dauda, W. P., & Adetunji, J. B. (2022k). Nano-engineered Sensors for Food Processing. In Bioand Nano-sensing Technologies for Food Processing and Packaging (pp. 151-166). Royal Society of Chemistry. doi:10.1039/9781839167966-00151. Adetunji, C. O., Inobeme, A., Singh, K. R., Bodunrinde, R. E., Mathew, J. T., Olaniyan, O. T., Abimbola, O. F., Singh, J., Nayak, V. and Singh, R. P., (2002l) Genomic Analysis of Heavy Metal-Resistant Genes in Wastewater Treatment Plants. In Microbial Community Studies in Industrial Wastewater Treatment (pp. 113-126). CRC Press. Adetunji, Oluwaseun C., Mathew, J. T., Inobeme, A., Olaniyan, O. T., RB Singh, K., Abimbola, O. F., Nayak, V., Singh, J. & Singh, R. P. (2022m). Microbial and Plant Cell Biosensors for Environmental Monitoring. In Nanobiosensors for Environmental Monitoring (pp. 175-190). Springer, Cham. https://doi.org/10.1007/978-3-031-161063_9. Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O., Dauda, W. and Ghazanfar, S., (2023a). Graphene-based nanomaterials for targeted drug delivery and tissue engineering. Novel Platforms for Drug Delivery Applications, pp.277-288. https://doi.org/10.1016/ B978-0-323-91376-8.00014-8 Adetunji, C.O., Ogundolie, F.A., Mathew, J.T., Inobeme, A., Titilayo, O., Olaniyan, O.T., Ghazanfar, S., Ijabadeniyi, O.A., Ajiboye, M.D., Ajayi, O.O. and Dauda, W., (2023b).

Bioinformatics Tools and Databases for Genomics Research

331

Nanotube platforms for effective drug delivery applications. Novel Platforms for Drug Delivery Applications, pp.317-332. https://doi.org/10.1016/B978-0-323-91376-8.000 05-7 Altindis, E., Cozzi, R., Di Palo, B., Necchi, F., Mishra, R. P., Fontana, M. R., Soriani, M., Bagnoli, F., Maione, D., Grandi, G. and Liberatori, S., 2015. Protectome Analysis: A New Selective Bioinformatics Tool for Bacterial Vaccine Candidate Discovery [S]. Molecular & Cellular Proteomics, 14(2), pp.418-429. Angarica, V. E., and Sol, A. D. (2017). Bioinformatics tools for genome-wide epigenetic research. Neuroepigenomics in Aging and Disease, 489-512. Ashwini Kamble and Rajesh Khairkar (2016) Basics of Bioinformatics in Biological Research. Int J. Appl Sci Biotechnol, Vol 4(4): 425-429. Ayodeji, A. O., Ogundolie, F. A., Bamidele, O. S., Kolawole, A. O., & Ajele, J. O. (2017). Raw starch degrading, acidic-thermostable glucoamylase from Aspergillus fumigatus CFU-01: purification and characterization for biotechnological application. J. Microbiol. Biotechnol., 6, 90-100. Behera, R. N., Roy, M., & Dash, S. (2016). Ensemble-based hybrid machine learning approach for sentiment classification-a review. International Journal of Computer Applications, 146(6), 31-36. Boddy, C. N. (2014). Bioinformatics tools for genome mining of polyketide and nonribosomal peptides. Journal of Industrial Microbiology and Biotechnology, 41(2), 443-450. Brian Salter and Charlotte Salter (2017) Controlling new knowledge: Genomic science, governance and the politics of bioinformatics. Social Studies of Science 2017, Vol. 47(2) 263–287. doi: 10.1177/0306312716681210. Daisuke Kihara, Yifeng David Yang and Troy Hawkins (2006) Bioinformatics resources for cancer research with an emphasis on gene function and structure prediction tools. Cancer Informatics: 2. 25– 35. doi: 10.3126/ijasbt.v4i4.16252. Dash, S., & Abraham, A. (2018). Kernel-based chaotic firefly algorithm for diagnosing Parkinson’s disease. In International Conference on Hybrid Intelligent Systems (pp. 176-188). Springer, Cham. Dash, S., Abraham, A., Luhach, A. K., Mizera-Pietraszko, J., & Rodrigues, J. J. (2020). Hybrid chaotic firefly decision-making model for Parkinson’s disease diagnosis. International Journal of Distributed Sensor Networks, 16(1), 1550147719895210. Dash, S., Ahmad, M., & Iqbal, T. (2021). Mobile cloud computing: a green perspective. In Intelligent Systems. vol.185, pp:523-533, Springer, Singapore. http://doi.org/10.1007/ 978-981-33-6081-5-46. Dash, S., Thulasiram, R., & Thulasiraman, P. (2017, December). An enhanced chaos-based firefly model for Parkinson’s disease diagnosis and classification. In 2017 International Conference on Information Technology (ICIT) (pp. 159-164). IEEE. Dash, S., Thulasiram, R., & Thulasiraman, P. (2019). Modified firefly algorithm with chaos theory for feature selection: A predictive model for medical data. International Journal of Swarm Intelligence Research (IJSIR), 10(2), 1-20. Dubravko Jeli, Tibor Toth and Donatella Verbanac (2003) Macromolecular Databases – A Background of Bioinformatics. Food Technol. Biotechnol. 41 (3) 269–286.

332 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Fadare, O. A., Omisore, N. O., Adegbite, O. B., Awofisayo, O. A., Ogundolie, F. A., Adesanwo, J. K., & Obafemi, C. A. (2021). Structure based design, stability study and synthesis of the dinitrophenylhydrazone derivative of the oxidation product of lanosterol as a potential P. falciparum transketolase inhibitor and in-vivo antimalarial study. In Silico Pharmacology, 9(1), 1-16. Hong C. L., Kaitao L., Michael T. L., Michael I., Chris D. and David E. (2011) Bioinformatics tools and databases for analysis of next-generation sequence data. Briefings in Functional Genomics Advance Access published December 19, 1-13. Jeyachandran S., Kiyun P., Ihn-Sil K. (2020) Genome Databases, Types and Applications: An overview. Recent Trends in Biochemistry. 1-5. Jimenez-Gutierrez L. R., C. J. Barrios-Hernández, G. R. Pedraza-Ferreira, L. Vera-Cala and F. Martinez-Perez (2016) Importance of databases of nucleic acids for bioinformatics analysis focused to genomics. Journal of Physics: Conference Series. Workshop on Processing Physic-Chemistry Advanced – WPPCA. IOP Publishing. 743. 012009. 1-4. doi:10.1088/1742-6596/743/1/012009. Kaja M., Kristian R., and Janusz M. B. (2011) Databases and Bioinformatics Tools for the Study of DNA Repair. SAGE-Hindawi Access to Research. Molecular Biology International Volume 2011, Article ID 475718, 9 pages. doi:10.4061/2011/475718. Kumlachew Alemu (2015) The Role and Application of Bioinformatics in Plant Disease. Advances in Life Science and Technology. Vol.28, 28-33. Latrice G. Landry, Nadya Ali, David R. Williams, Heidi L. Rehm, and Vence L. Bonham (2018) Lack of Diversity in Genomic Databases Is A Barrier To Translating Precision Medicine Research Into Practice. Health Affairs 37, NO. 5 (2018): 780–785. doi: 10.1377/hlthaff.2017.1595. Lei J., Lingjuan X., Sangting L., Qian-Hao Z., Longjiang F. (2021) Rice bioinformatics in the genomic era: Status and perspectives. The Crop Journal. (2021) 609–621. https://doi.org/10.1016/j.cj.2021.03.003. Li, K., Du, Y., Li, L., & Wei, D. Q. (2020). Bioinformatics approaches for anti-cancer drug discovery. Current Drug Targets, 21(1), 3-17. Lincoln D. Stein, Christopher Mungall, ShengQiang Shu, Michael Caudy, Marco Mangone, Allen Day, Elizabeth Nickerson, Jason E. Stajich, Todd W. Harris, Adrian Arva, and Suzanna Lewis (2002). The Generic Genome Browser: A Building Block for a Model Organism System Database. Cold Spring Harbor Laboratory. 12:1599–1610. Manzoni, C., Kia, D. A., Vandrovcova, J., Hardy, J., Wood, N. W., Lewis, P. A., & Ferrari, R. (2018). Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Briefings in bioinformatics, 19(2), 286-302. María, R. R., Arturo, C. J., Alicia, J. A., Paulina, M. G., & Gerardo, A. O. (2017). The impact of bioinformatics on vaccine design and development. Vaccines, 2, 3-6. Mwololo, J. K., Munyua, J. K., Muturi, P. W.and Munyiri, S. W. (2010) An overview of advances in bioinformatics and its application in functional genomics. Journal of Animal & Plant Sciences. Vol. 6, Issue 3: 645- 652. Ogundolie, F. A. (2015). Characterization of a Purified β–Amylase from Black Marble Vine (Dioclea reflexa) Seeds (Masters dissertation, Federal University of Technology, Akure).

Bioinformatics Tools and Databases for Genomics Research

333

Ogundolie, F. A. (2021). Cloning of α-AMYLASE and Pullulanase Genes of Bacillus licheniformis-FAO. CP7 from Cocoa (Theobroma cacao L.) Pods and Biochemical Characterization of the Expressed Enzymes (Doctoral dissertation, Federal University of Technology, Akure). Ogundolie, F. A., Ayodeji, A. O., Olajuyigbe, F. M., Kolawole, A. O., & Ajele, J. O. (2022). Biochemical Insights into the functionality of a novel thermostable β-amylase from Dioclea reflexa. Biocatalysis and Agricultural Biotechnology, 42, 102361. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. 2022a. Machine Learning Techniques for High-Performance Computing for IoT Applications in Healthcare. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/9780367548445-20 Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Olaniyan Olugbemi T., Adetunji Charles O., Adeniyi Mayowa J., Hefft Daniel Ingo. In. Computational Intelligence in IoT Healthcare. 2022b. In book: Deep Learning, Machine Learning and IoT in Biomedical and Health Informatics doi: 10.1201/ 9780367548445-19. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 13. eBook ISBN 9780367548445. Omotayo Opemipo Oyedara, Folasade Muibat Adeyemi, Charles Oluwaseun Adetunji, Temidayo Oluyomi Elufisan. 2022. Repositioning Antiviral Drugs as a Rapid and Cost-Effective Approach to Discover Treatment against SARS-CoV-2 Infection. doi: 10.1201/9781003178903-10. In book: Medical Biotechnology, Biopharmaceutics, Forensic Science and Bioinformatics. Edition 1st Edition. First Published 2022. Imprint CRC Press. Pages 12. eBook ISBN 9781003178903. Orton R. J., Q. Gu, J. Hughes, M. Maabar, S. Modha, S.B. Vattipally, G.S. Wilkie & A.J. Davison (2016) Bioinformatics tools for analysing viral genomic data. Rev. Sci. Tech. Off. Int. Epiz., 2016, 35 (1), 271-285. Rahman, A. U., Dash, S., & Luhach, A. K. (2021). Dynamic MODCOD and power allocation in DVB-S2: a hybrid intelligent approach. Telecommunication Systems, 76(1), 49-61. Rahman, A., Sultan, K., Dash, S., & Khan, M. A. (2018). Management of resource usage in mobile cloud computing. Int. J. Pure Appl. Math., 119(16), 255-261. Ramharack, P., & Soliman, M. (2018). Bioinformatics-based tools in drug discovery: the cartography from single gene to integrative biological networks. Drug discovery today, 23(9), 1658–1665. https://doi.org/10.1016/j.drudis.2018.05.041. Rastogi, S. C., Rastogi, P., & Mendiratta, N. (2022). Bioinformatics: Methods and Applications-Genomics, Proteomics and Drug Discovery. PHI Learning Pvt. Ltd. Roumpeka D. D., Wallace R. J., Escalettes F., Fotheringham I. and Watson M. (2017) A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data. Front. Genet. 8:23. doi: 10.3389/fgene.2017.00023. Sahu, B., Dash, S., Mohanty, S. N., & Rout, S. K. (2018). Ensemble comparative study for the diagnosis of breast cancer datasets. International Journal of Engineering & Technology, 7(4.15), 281-285. Sayers, Eric W., Jeffrey Beck, Evan E. Bolton, Devon Bourexis, James R. Brister, Kathi Canese, Donald C. Comeau, Kathryn Funk, Sunghwan Kim, William Klimke, Aron Marchler-Bauer, Melissa Landrum, Stacy Lathrop, Zhiyong Lu, Thomas L. Madden,

334 C. Oluwaseun Adetunji, F. Abimbola Ogundolie, O. Tope Olaniyan et al. Nuala O’Leary, Lon Phan, Sanjida H. Rangwala, Valerie A. Schneider, Yuri Skripchenko, Jiyao Wang, Jian Ye, Barton W. Trawick, Kim D. Pruitt and Stephen T. Sherry (2021) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 2021, Vol. 49, Database issue.10-17. doi: 10.1093/nar/gkaa892. Seung Yon Rhee, Julie Dickerson, and Dong Xu (2006) Bioinformatics and its Applications in Plant Biology. Annu. Rev. Plant Biol. 57:335–59. doi: 10.1146/annurev.arplant. 56.032604.144103. Singh Himanshu (2016) Bioinformatics: Benefits to Mankind. International Journal of Pharm Tech Research. Vol.9, No.4, Pp. 242-248. Soria-Guerra, R. E., Nieto-Gomez, R., Govea-Alonso, D. O., & Rosales-Mendoza, S. (2015). An overview of bioinformatics tools for epitope prediction: implications on vaccine development. Journal of biomedical informatics, 53, 405-414. Sridhar A. Malkaram, Yousef I. Hassan, and Janos Zempleni (2012) Online Tools for Bioinformatics Analyses in Nutrition Sciences. American Society for Nutrition. Adv. Nutr. 3: 654–665, 2012; doi:10.3945/an.112.002477. Stenson, Peter D., Matthew Mort, Edward V. Ball, Katy Evans, Matthew Hayden, Sally Heywood, Michelle Hussain, Andrew D. Phillips, David N. Cooper (2017) The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next‑generation sequencing studies. Hum. Genet. (2017) 136:665–677. doi 10.1007/s00439-017-1779-6. Suraj Peri, J. Daniel Navarro, Ramars Amanchy, Troels Z. Kristiansen, Chandra Kiran Jonnalagadda, Vineeth Surendranath, Vidya Niranjan, Babylakshmi Muthusamy, T. K. B. Gandhi, Mads Gronborg, Nieves Ibarrola, Nandan Deshpande, K. Shanker, H. N. Shivashankar, B. P. Rashmi, M. A. Ramya, Zhixing Zhao, K. N. Chandrika, N. Padma, H. C. Harsha, A. J. Yatish, M. P. Kavitha, Minal Menezes, Dipanwita Roy Choudhury, Shubha Suresh, Neelanjana Ghosh, R. Saravana, Sreenath Chandran, Subhalakshmi Krishna, Mary Joy, Sanjeev K. Anand, V. Madavan, Ansamma Joseph, Guang W. Wong, William P. Schiemann, Stefan N. Constantinescu, Lily Huang, Roya Khosravi-Far, Hanno Steen, Muneesh Tewari, Saghi Ghaffari, Gerard C. Blobe, Chi V. Dang, Joe G. N. Garcia, Jonathan Pevsner, Ole N. Jensen, Peter Roepstorff, Krishna S. Deshpande, Arul M. Chinnaiyan, Ada Hamosh, Aravinda Chakravarti, and Akhilesh Pandey (2003) Development of Human Protein Reference Database as an Initial Platform for Approaching Systems Biology in Humans. Cold Spring Harbor Laboratory. 13: 2363–2371. http://www.genome.org/cgi/doi/10.1101/gr.1680803. Suresh B. V., Roy R., Sahu K., Misra G., Chattopadhyay D. (2014) Tomato Genomic Resources Database: An Integrated Repository of Useful Tomato Genomic Information for Basic and Applied Research. PLoS ONE 9(1): e86387. doi:10.1371/ journal.pone.0086387. Teufel, A., Krupp, M., Weinmann, A., and Galle, P. R. (2006). Current bioinformatics tools in genomic biomedical research. International journal of molecular medicine, 17(6), 967-973. Holton, T. A., Vijayakumar, V., & Khaldi, N. (2013). Bioinformatics: Current perspectives and future directions for food and nutritional research facilitated by a Food-Wiki database. Trends in food science & technology, 34(1), 5-17.

Bioinformatics Tools and Databases for Genomics Research

335

Wishart, D. S. (2005). Bioinformatics in drug development and assessment. Drug metabolism reviews, 37(2), 279-310. Xia X. (2017). Bioinformatics and Drug Discovery. Current topics in medicinal chemistry, 17(15), 1709–1726. https://doi.org/10.2174/1568026617666161116143440. Xiaokang Zhang, Inge Jonassen, Anders Goksøyr (2021) Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data. In: Bioinformatics. Nakaya HI (Editor). Exon Publications, Brisbane, Australia. 53-64. doi: https://doi.org/10.36255/ exonpublications.bioinformatics.2021.

Chapter 14

Computer Viruses and Their Defences in Computer Networks: An e-Epidemiological Model Yerra Shankar Rao1, Binayak Dihudi2 Subash Chandra Mishra3 Ranjita Rath4 and Tarini Charan Panda5 1Department

of Mathematics, NIST Institue of Science and Technology (Autonomous) Instuite Park, Pallur Hills, Berhampur, Odisha, India 2Department of Mathematics, Konark Institute of Science and Technology, Jatni, Bhubaneswar, Odisha, India 3Department of Electrical Engineering, Einstein Academy of Technology and Management, Khorda, Odisha, India 4Department of Mathematics, Pendrani Mahavidyalaya College Umerkote, Odisha, India 5Department of Mathematics, Ravenshaw University, Cuttack, Odisha, India

Abstract With the rapid advancement of computer, communication and network technologies, network information systems have facilitated the industrial development of the country. In the time of information sharing, information security has emerged as one of the most significant and difficult topics to address. The virus attack is one of the typical 

Corresponding Author’s Email: [email protected].

In: Advances in Bioinformatics and Big Data Analytics Editors: Sujata Dash, Hrudayanath Thatoi, Subhendu Kumar Pani et al. ISBN: 979-8-88697-693-9 © 2023 Nova Science Publishers, Inc.

338

Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra et al. information security threats. For the sake of security measure we investigated an e-SVIRS fuzzy model of virus attack in this chapter. We have created a fuzzy membership functions for the two parameters, like transmission rate and recovery rate followed by infection. The computer virus load treated as function of infection in the network system. We discussed the stability analysis of model at virus free equilibrium point under malware loads. Here both the classical and fuzzy basic reproduction number has been calculated. Basically the fuzzy basic reproduction number is helpful for control the virus in the network. These phenomenons can be studied by comparing the system of the basic reproduction number. By comparing between Crisp and Fuzzy model, it is clear that fuzzy model is reasonable good, flexible. According to the simulation results, vaccination has considerable impact on either slowing o r stopping down the propagation of virus in computer network. The simulated result explains the character of nodes for reducing the infection of the computer network. It helps to reflect the positive impact to the antivirus software in transmission of the malicious object (virus) on the network.

Keywords: vaccination, fuzzy basic reproduction number, stability, computer virus

Introduction Computer viruses are the result of a variety of processes aimed at classifying computer viruses from normal files and, ultimately, detecting computer viruses. In this direction, machine learning techniques are commonly deployed. According to statistics, the number of malicious code attacks is increasing everyday, necessitating the development of robust detection tools. Designers of computer viruses employ a variety of strategies that are difficult to understand and detect. Static approaches also don’t seem to work insituations where attackers are constantly changing theirtactics, therefore the focus these days is on methods that are dynamic and can detect zero-day computer viruses. The rise in malevolent threats such as computer virus activities must be deal twith and closely monitored in order to ensure a robust defence that can save the security domain. Malicious code transmission through a network in computer system is similar to spread of biological diseases and epidemics in nature (Kermack,W. O. et al., (1927), (1932), (1933)). Epidemic systems, particularly those dealing with infectious diseases, are highly nonlinear and require various approaches .The susceptible and

Computer Viruses and Their Defences in Computer Networks

339

infectious nodes in the system are proportionally varies in epidemic followed by the theory of mass actions in nonlinearity (Alexander M. E., et al., (2004), Barros L. C., et al., (1997), Farahi M. H., et al., (2011), Klir G. L.,et al., (1995), Mishra Bimal Kumar et al., (2010) and Picqueria, J. R. C. (2009)). The idea of fuzzy analysis for Susceptibility and infectiousness are better study option rather than normal analysis. The transmissions of the malicious code are always vulnerable in the network. This can be over come by the use of mathematical modelling based upon the state of variables and parameters involved in the model. In this chapter the theory behind of fuzzy set is discussed, which is an development of crisp set. It also covers the methods of fuzzy calculation. We employed fuzzy logic in this model to better understand the propagation of harmful code more precisely. Recently many authors have focused for controlling the malicious codes as well as antivirus technique in different mathematical models related to the network systems in the cyberworld. (Aswin Kumar Rauta, et al.,(2015), Chen, T. et al., (2006), Hemraj Saini et al., (2012), Mondal P. K. et al.,(2015), Prasant Ku Nayak,et al., (2017), Picqueria, J. R. C. (2009), Yerra Shankar Rao et al., (2017), (2016), (2019) and Zou, C. C. et al., (2003)). The rest of this work is planned as follows: Sections 2 and 3 represent a simple SIVR model, section 4 represents a fuzzy SIVR model, section 5 presents a computation of the basic reproduction number, and section 6 presents a comparison of the fuzzy and classical basic reproduction numbers.Sections 7 and 8 demonstrate the malicious code’s controlling stagey. Summarizes the work as well as discussion of the simulated results.

Simple SIVR Model The general SIVRS model illustrates the propagation of malware and interacts with each other in the network. Here total nodes can be segregate in to four class i.e., susceptible, infected, vaccinated and recovered. The mortality

Figure 1. Block diagram of the model in computer network.

340

Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra et al.

and death rate are not being considered in this model due to other attack. Which can be represented diagrammatically in the (Figure1). The virus transmission in the computer network represents following mathematical model governed by system of nonlinear ODEs (ordinary differential equations)as specified below:

dS (t ) =  R(t ) −  S (t ) I (t ) dt dI (t ) = − I (t ) +  S (t ) I (t ) dt dV (t ) = −V (t ) +  I (t ) dt dR(t ) = − R (t ) + V (t ) dt

(1)

where

S (t ) + I (t ) + V (t ) + R(t ) = 1 S(t), I(t), V(t) and R(t) denotes proportion of susceptible. Infected, Vaccinated and recovered nodes respectively. Similarly β be the contact rate, γ be the rate of susceptible nodes after recovery, δ be rate of infected node and η be rate of vaccinated nodes after infections. We have made an attempt for the extension of SIR model in compacting with heterogeneities by assuming nodes with different amount of viruses which contribute the propagations of virus code.

SIVR Fuzzy Model We postulate that the malware load of infected nodes determines the population’s heterogeneity. That is, the larger the code load, the greater the likelihood of malware code propagation. Thus we use β=β(ϴ) considering the transmission rate which is the function of ϴ, in the amount of virus. This can be followed by the meeting between vulnerable and infectious nodes. These values of β are more likely than the other epidemic models. This can be

Computer Viruses and Their Defences in Computer Networks

341

emerges with the membership function in the fuzzy logic. In order to calculate the membership function β, it is considered that the number of malware codes in the node pointis reasonably fewer. As a result, the possibilities of malware codes spreading are extremely low, and a minimum number θ MIN of malwares is required to cause transmission, which is equivalent to one. We also assume that the number of malware codes θMe equals one and that the risk of transmission is maximal. We considered that the amount of harmful virus is restricted by θmax from the computernetwork. As a result, we define the membership function as shown in the figures. The malware codeload δ(ϴ) represents as function of infectious rate δ. Additionally, if we increase the infected nodes with its malware load θ then the chance of infection rate is more in the system. Hence δ(ϴ) can be elaborated according to its definitionas

 ( ) =

1 − 0

 max

+ 0 ,

Where δ0 be the lowest infection rate.

0, if    min

 ( ) =

 −  min , if  min     M  M −  min 1, if  M     max (2)

Figure 2. Fuzzy malicious code transmission coefficient =().

342

Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra et al.

δ(ϴ) represents as a function of virus vaccinated code having load ϴ. As the most of vaccinated nodes transmit virus codes so δ is an increasing functionof θ. The nodes which broadcast malware systems mainly are probable to be vaccinated. So if it can increase the values of δ=δ(ϴ) then the transmission rate will decreases virus. This can be illustrated in th efollowing figure

 ( ) =

1 − 0

 max

+ 0

Where δ0>0 is the smallest amount of vaccinated rate. Here η=η(ϴ) is the function of recovery rate from the vaccinated nodes of the malicious code load ϴ. If there are more number of vaccinated nodes are given into the network then the chance of recovery rate will be higher. Hence the recovery rate η is the function of ϴ, defined as given below

 ( ) =

1 − 0

 max

+ 0

where η0 >0 is the minimum recovery rate in vaccinated group or nodes.

Figure 3. Fuzzy rate in infectious class to vaccinated class δ=δ(ϴ).

Computer Viruses and Their Defences in Computer Networks

343

Figure 4. Fuzzy rate in vaccinated class to recovered node η=η(ϴ).

Again γ=γ(ϴ) is also a function ϴ which will lose the immunity from the recovery nodes. If loss the immunity from the recovery rate then the susceptible rate would be higher. This membership function defined as follows

 ( ) =

1− 0

 max

+ 0

Where γ0>0 is the lowest rate of susceptible nodes.

Figure 5. Fuzzy rate of loss of immunity γ=γ(ϴ).

344

Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra et al.

Solution of Equilibrium Points The variables which do not change with respect to time are called equilibrium points. Generally the study of the stability analysis of equilibrium point gives the existence of virus in the infected nodes in the network. The system(1) can be reduced to as follows

dS (t ) =  (1 − S (t ) − I (t ) − V (t )) −  S (t ) I (t ) dt dI (t ) = − I (t ) +  S (t ) I (t ) dt dV (t ) = −V (t ) +  I (t ) dt

(3)

For steady state equilibrium point

dS (t ) dI (t ) dV (t ) = 0, = 0, =0 dt dt dt Then, by solving the equation(3) simultaneously we get the Endemic Equilibrium point

  −   ( −  )  ( S * , I * ,V * ) =  , ,     +  +  (  +  +  ) 

   , 0, 0    Virus free equilibrium  By fuzziness we consider

 =  ( ),  =  ( ), =  ( ),  =  ( )

  ( )  , 0, 0    ( )  Fuzzy Virus freee quilibriums 

Computer Viruses and Their Defences in Computer Networks

345

At fuzzy endemic equilibrium pointis   ( )   ( ) ( ) ( ) −  ( )  ( ) ( )  ( )( ( ) ( ) ( ) −  ( )  ( ) ( )) (S * , I * ,V * ) =  , ,    ( )  ( ) ( ) ( ) +  ( ) ( ) ( ) +  ( )  ( ) (  ( ) ( ) ( ) +  ( ) ( ) ( ) +  ( )  ( )) ( ) 





4 It is clear that  = S , I , V , R  R+ : S + I + V + R = 1 is positively

invariant in the system(1) under the region Ψ.

K = S , I , V , R+3 : S + I + V  1 is a positive invariant set of (3). In the compute rnetwork it is considered that different nodes of computer are provided the different amount of malicious codes. This fuzzy numbers are in triangular shape according to its membership function. −   −  −  , if  −      +   ( ) = 1 −   0, otherwise

The above notations  represents the central value of the fuzzy set. Similarly α can be expressed as the dispersion of the fuzzy set treated as a function of ϴ. By taking the linguistic variable σ(ϴ) which mean low, medium, high and so on so as the fixed values of  .

Figure 6. Membership functions of the variable ϴ.

346

Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra et al.

Basic Reproduction Number R0 Clearly the expected number of secondary nodes which gives from the single main infection class into whole susceptible nodesis called basic reproduction R0 =

  .

number By calculating the basic reproduction number be  R0 =  1

By

fuzziness’

R0f =

1

 =  ( ),  =  ( ), =  ( ),  =  ( )

we can  ( ) represented as the basic reproduction number R0 =  ( ) . By analysis of above facts for the minimization of the virus from the network we should impose the condition max{R0(ϴ)}0 denoted as fuzzy expected values and minimum f vaccinated rate in infectious nodes respectively. R0 is well defined if

R0(ϴ)>1 but

(0 R0 ( ) )  1 . Hence the basic reproduction number in fuzzy f

system termed as R0 is the average number of secondary cases of infected nodes injected into a vulnerable node is referred to as the "basic reproduction number. So ,to represent FEV ( 0 R0 ( ) ) , by using probability measure, we take the fuzzy measures as  ( Z ) = sup ( ) : Z  R .  Z

This indicates that the system have maximum infectivity when the more

Rf

virus enters in the network group. Now, the linguistic meaning of 0 , which is assumed to indicate the amount of malicious codes on the network, is low,

Computer Viruses and Their Defences in Computer Networks

347

medium, or high. Then the membership function σ(ϴ) for the fuzzy set have three different situations such as

If

 +   min then σ(ϴ) is low,

If

 −   min and  +    M , then σ(ϴ) is medium

If

 −    M then σ(ϴ) is high.

Comparison R0 Versus

R0f

In this section we will interpret the above three situation sdiscussed in the earlier section for the amount of virus load in the system. Now for what ever cases we have

 ( ) 1  ( +  )  FEV ( 0 R0 ( ) )   ( )  0  ( +  ) i.e.,

R0 ( )  R0f  R0 ( +  )  ( ) As R0 =  ( ) is curved and constant, then by intermediate value theorem f / there exists only one ϴ′ with  +      : R0 = R0 ( )  R0 ( )

/

f Which implies that, R0 (classical) and R0 (fuzzy) coincide with each

other when an amount of malware code ϴ′ is loaded in network. Also, the medium amount of malicious code R0 ( ) is less than the medium value of the f number of secondary cases R0 .

348

Yerra Shankar Rao, Binayak Dihudi, Subash Chandra Mishra et al.

Local Stability for Virus Free Equilibrium Theorem 1: If R01, the variation matrix becomes

V

**

 − S * I *   S*I *  =  0   0

− S *  S* − 



0 0 −

0



 

 0  0   − 

The characteristic equation becomes

− S * I * −  − S *  S*I * ( S * −  ) −  0  0 0

0 0 − − 



 0 =0 0 − − 

Here all the eigenvalues can be calculated by solving the fourth degree polynomial equations.

 4 + A 3 + B 2 + C + D = 0 where

A =  S * I * +  +  − ( S * −  )

Computer Viruses and Their Defences in Computer Networks

351

B = ( (−  S * +  )  S * I * +  S * I * +  S * I * − (  S * −  ) − (  S * −  ) +  S *  S * I * ) C = ( (−  S * +  )  S * I * + (−  S * +  )  S * I * +  S * I * + (−  S * +  ) −  S *  S * I * −  S *  S * I * )

D = ( (−  S * +  )  S * I * +  S *  S * I * +  S * I * ) Since all the eigenvalues have negative realparts so according to the Routh-hurwitz criteria the system will be stable for ABC  C + A D . Hence the system(1) is locally asymptotically stable if R0>1. 2

2

Global Stability Analysis Theorem 2: If R01 then the endemic equilibrium is globally asymptotically stable in the region. Proof: For deriving the global stability analysis for system (1) at virus free equilibrium when R0