GeNeDis 2020: Computational Biology and Bioinformatics (Advances in Experimental Medicine and Biology, 1338) 3030787745, 9783030787745

The 4th World Congress on Genetics, Geriatrics and Neurodegenerative Diseases Research (GeNeDis 2020) focuses on the lat

114 69 14MB

English Pages 303 [283] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgment
Contents
Editor Biography
Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and Hierarchal Clustering Algorithm
1 Introduction
2 Methodology
2.1 Hierarchical Clustering Algorithm
2.2 The Systematic Sampling
3 Data Sets
3.1 Adjusted Rand Index (ARI)
4 The Proposed Algorithm (SSCH)
5 Results and Discussion
6 Conclusion
References
A DSS for Predicting Lymphoma in Primary Sjogren’s Syndrome Patients
1 Introduction
2 Materials and Methods
2.1 Data
2.2 Feature Selection
2.3 Models
3 Results
4 Web Application Implementation
5 Discussion
References
Decision Support System for Breast Cancer Detection Using Biomarker Indicators
1 Introduction
2 Materials and Methods
2.1 Breast Cancer Coimbra Dataset
2.2 Model Extraction Approach
3 Performance Assessment
3.1 Experimental Phases
3.2 Results
4 Application Design
4.1 Client Side: iOS Application
4.2 Server Side
5 Conclusions
References
Hepatocellular Carcinoma Detection Using Machine Learning Techniques
1 Introduction
2 Related Work
3 Data Preprocessing
4 Experiments and Results
5 Web Platform
6 Conclusions
References
Web-Based Decision Support System for Coronary Heart Disease Diagnosis
1 Introduction
2 Materials and Methods
2.1 Dataset Description and Data Preprocessing
2.2 Machine Learning Algorithms
2.3 Web-Based Decision Support System Development
3 Results
3.1 Models Evaluation
3.2 Web-Based Decision Support System
4 Discussion
References
A Decision Support System for the Prediction of Drug Predisposition Through Personality Traits
1 Introduction
2 Materials and Methods
2.1 Dataset
2.2 Feature Selection and Preprocessing
2.3 Classification Algorithms
3 Results
3.1 Classification
3.2 Decision Support System
4 Discussion
References
Development of a Diagnostic Tool for Balance Disorders Based on Machine Learning Techniques
1 Introduction
2 Related Work
3 Data
4 Methods
4.1 Feature Selection
4.2 Machine Learning Experiments
5 Results
5.1 Determinant Features
5.2 Evaluation Metrics
6 Conclusions
References
Systems Approaches in the Common Metabolomics in Acute Lymphoblastic Leukemia and Rhabdomyosarcoma Cells: A Computational Approach
1 Introduction
1.1 Bioinformatics and the New Advent of Systems Biology
1.2 Systems Theory and Modeling
1.3 Systems Theory and Systems Biology
2 Materials and Methods
2.1 Experimental Setup
2.2 Microarray Experimentation
2.3 Bioinformatics and Systems Analyses
2.4 Simulation
3 Results
3.1 The Ontology of the Transcriptional Profile
3.2 The Time-Dependent Evolution of Molecules
3.3 Polar Transformation of Time-Dependent Evolution
3.4 Species-Dependent Evolution of the Total System for All Time Points
4 Discussion
5 Conclusions
References
Bioinformatics Analyses of Spatial Peripheral Circadian Clock-Mediated Gene Expression of Glucocorticoid Receptor-Related Genes
1 Introduction
2 Materials and Methods
2.1 Subjects Enrolled and Study Design
2.2 Total RNA Isolation and SYBR Green Real-Time PCR
2.3 Data Analysis
2.3.1 Real-Time Data Analysis
2.3.2 Unsupervised Classifying Methodologies
2.3.3 Polynomial and Exponential Regression Analysis
2.3.4 Correlation Analysis
2.3.5 Statistical Analysis
3 Results
3.1 Statistical and Correlation Analysis
3.2 HCL Analysis
3.3 K-Means Clustering Analysis
3.4 ROC Classification
3.5 Naïve Bayes Classification
3.6 Regressions of Estimated Variables
4 Discussion
5 Conclusions
References
Machine Learning for Autistic Spectrum Disorder Risk Screening
1 Introduction
2 Past Related Work
3 Dataset
3.1 Features
3.2 Processing
3.3 Statistics
4 Algorithms
5 Results
6 Application Stack
7 Autism Spectrum Disorder Screening Application
7.1 Technical Description
7.2 Layout
8 Conclusions
References
Mobile Application for Monitoring and Preventing Cognitive Decline Through Lifestyle Intervention
1 Introduction
2 Related Work
3 Models
4 Application
4.1 Architecture
4.2 Implementation
5 Discussion
References
Virtual Reality Zoo Therapy for Alzheimer’s Disease Using Real-Time Gesture Recognition
1 Introduction
2 Related Works
2.1 Animal Therapy for Alzheimer’s Disease
2.2 Virtual Reality and Alzheimer’s Disease
2.3 Gesture Detection
3 Our Approach: Zoo Therapy System
3.1 Zoo VR
3.2 EEG Measures
3.3 Gesture Recognition System
3.4 Intelligent Agent
4 Experiments
5 Results
6 Conclusion
References
Validation of the Greek Version of Social Media Disorder Scale
1 Introduction
2 Methods
2.1 Translation Procedure
2.2 Participants and Procedures
2.3 Ethical Considerations
2.4 Measures
2.5 Statistical Analysis
3 Results
4 Discussion
References
Adiponectin and Its Effects on Acute Leukemia Cells: An Experimental and Bioinformatics Approach
1 Introduction
1.1 Adiponectin
1.2 Childhood Acute Lymphoblastic Leukemia
1.3 Adiponectin and Childhood Acute Lymphoblastic Leukemia
1.4 Scope
2 Materials and Methods
2.1 Cell Culture and Adiponectin Treatment
2.2 Microarray Experimentation
2.2.1 RNA Extraction
2.2.2 cDNA Microarrays
2.3 Data Analysis
2.3.1 Flow Cytometry Data Analysis
2.3.2 Microarray Data Analysis
3 Results
3.1 Proliferation Induced by Adiponectin
3.2 Cytotoxicity and Cell Cycle Distribution Induced by Adiponectin
3.3 Gene Expression Profiling
4 Discussion
5 Conclusions
References
Nature and Quantum-Inspired Procedures – A Short Literature Review
1 Introduction
2 Literature Review
2.1 Quantum-Inspired Procedures in the Literature
3 Conclusion and Further Work
References
Handling the Cellular Complex Systems in Alzheimer’s Disease Through a Graph Mining Approach
1 Introduction
1.1 Alzheimer’s Disease and Systems Biology
1.2 Basics to Molecular Biology
1.3 Pathway Analysis
2 Methods
3 Results and Discussion
3.1 Ensemble DEG Evaluation to Simulated Data
3.2 Application to Real Data
References
Debunking the Neuromyth of Learning Style
1 Introduction
2 Literature Review
2.1 Neuromyths
2.2 Multiple Representations
3 Methods
3.1 Participants
3.2 Procedure
4 Analysis and Findings
5 Limitations
6 Conclusions
Annex I: Visual Version
Annex II: Auditory Version
Annex III: Kinesthetic Version
Annex IV: Structured Interview
References
Expert Characteristics: Implications for Expert Systems
1 Introduction
2 Inside the Expert
2.1 Specificity and Creativity in Experts
2.2 Abilities
2.2.1 G-Factor
2.2.2 Theory of Multiple Intelligences
2.2.3 Triarchic Theory of Successful Intelligence
2.2.4 Other Theories
3 A Brief Study of Expert Memory
3.1 Expert Processing of Stimuli
4 Expert(ise) Characteristics
4.1 A Compilation of Expert Characteristics
5 Expert Characteristics and Expert Systems
6 The Contributory Expert Generalist
7 Conclusion
References
Improving the Run-Time of Space-Efficient n-Gram Data Structures Using Apache Spark
1 Introduction
2 Related Work
3 Proposed Method
3.1 Two-Level Inverted Index Example
4 Implementation
4.1 Dataset
4.2 Apache Spark Framework
5 Results
5.1 Dataset of 5 MB
5.2 Dataset of 20 MB
6 Conclusions and Future Work
References
Development of a Protein Biochip Platform for Parkinson’s Disease
1 Introduction
2 Biosensor Research and Characterization of Detection System Properties
3 Conclusions
References
The Use of Data Collection and Big Data Analysis in Neurodegenerative Disease Prevention
Fractal Dynamics in the RR Interval of Craniopharyngioma and Adrenal Tumor in Adolescence
1 Introduction
1.1 Craniopharyngiomas and Adrenal Tumors
1.2 Fractals
1.3 Scope
2 Materials and Methods
2.1 Subjects Enrolled and Study Design
2.2 Measurements and Data Collection
2.3 Data Analysis
3 Results
3.1 The Time Series
3.2 The 2D Phase Space of the Time Series
3.3 The 3D Phase Space of the Time Series
4 Discussion
4.1 Conclusions
References
Bioinformatics Approaches for Parkinson’s Disease in Clinical Practice: Data-Driven Biomarkers and Pharmacological Treatment
1 Introduction
2 Parkinson’s Disease and Related Biomarkers
3 Bioinformatics Analysis and Dynamic Network Models
4 Therapeutic Strategies in PD
References
Emerging Machine Learning Techniques for Modelling Cellular Complex Systems in Alzheimer’s Disease
1 Introduction
2 Basics of Machine Learning
3 Basics of Molecular Biology Networks
3.1 Gene Regulatory Network
3.2 Protein Interaction Networks
3.3 Metabolic Networks
3.4 Signaling Pathways
3.5 Molecular Biology Databases
4 Methods for Gene Regulatory Network Reconstruction: Related Works
4.1 Methods Based on Correlation and Mutual Information
4.2 Methods Based on Machine Learning and Regression
5 GRN Reconstruction Through Recent Machine Learning Tools
6 Discussion: Challenges and Future Prospects
References
Cognitive Enhancement Through Mathematical Problem-Solving
1 Introduction
2 Research Methodology
3 Findings
4 Conclusions
References
Cognitive Tasks of an Information System for Memory Training and Cognitive Enhancement Using Mobile Devices
1 Introduction
2 Utilization of Mobile Devices
3 Cognitive Training Information System
4 Conclusion
References
An Application for Exploring Visual Perception: A Pilot Neuroeducational Study
1 Introduction
2 Related Work
3 Explain the Application
3.1 Basic Functionality
3.2 Graphical User Interface
4 Explain the Algorithm
4.1 Scoring Method
4.2 Image Comparison
5 Obstacles to Create a Digital Version of the ROCF Test
6 Future Work
7 Conclusion
References
Gene Classification Based on Multi-Class SVMs with Systematic Sampling and Hierarchical Clustering (SSHC) Algorithm
1 Introduction
2 Methodology
2.1 The Hierarchical Clustering (SSHC) with Systematic Sampling Algorithm
2.2 Multi-Class Support Vector Machines (MCSVMs)
2.2.1 One-Against-One (OAO) Support Vector Machines
2.2.2 Directed Acyclic Graph (DAG) Support Vector Machines
2.2.3 Binary Tree (BT) Support Vector Machines
2.2.4 One-Against-All (OAA) Support Vector Machines
2.3 Evaluation Metrics
3 Microarray Expression Datasets
3.1 Adipose Tissue Macrophages (ATMs)
3.2 Adipose Tissue Macrophages (EATMs) with Metabolic Signatures
4 The Proposed SSHC-MCSVM Algorithm
5 Result and Discussion
6 Conclusion
References
Multinetwork Motor Learning as a Model for Dance in Neurorehabilitation
1 Introduction
2 Dance and Neuroplasticity: Review of Dance Elements
3 Motor Learning in the Real World
4 Motor Control Training
5 Spatial Orienting
6 Timing and Entrainment
7 Action Observation
8 Future Topics: More Than Movement Training
References
Controlling the Chimera Form in the Leaky Integrate-and-Fire Model
1 Introduction
2 The Model
2.1 Coupling Scheme of LIF Network
2.2 Modification in the Threshold Potentials
2.3 The Kuramoto Synchronization Index
3 Chimera States under Broken Connectivity
3.1 Single Chimera
3.2 Multichimera States
4 The Effects of Frequency Disorder
5 Conclusions and Open Problems
References
Qualitative Differences Between the Semi-separable and the “Almansi-Type” Stokes Stream Function Expansions in the Study of Biological Fluids
1 Introduction
2 Theoretical Background
2.1 Stokes Operators E2 and E4 in Prolate Spheroidal Coordinates: Eigenfunctions in Separable and Semiseparable Forms
2.2 The “Almansi-Type” Approach
3 Results
3.1 Connection Formula for the Eigenfunction Expansions: Reduction of the Almansi-Type Solutions to the Semiseparable Ones
3.2 The Oblate Spheroidal Case
3.3 Sample Calculations: Eigenflow Patterns
4 Discussion
References
A Multiscale Mathematical Model for Tumor Growth, Incorporating the GLUT1 Expression
1 Introduction
2 Formulation of the Problem
3 Materials and Method
4 Results and Discussion
5 Conclusions
References
Correction to: Development of a Protein Biochip Platform for Parkinson’s Disease
Index
Recommend Papers

GeNeDis 2020: Computational Biology and Bioinformatics (Advances in Experimental Medicine and Biology, 1338)
 3030787745, 9783030787745

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Experimental Medicine and Biology 1338

Panayiotis Vlamos   Editor

GeNeDis 2020

Computational Biology and Bioinformatics

Advances in Experimental Medicine and Biology Volume 1338 Series Editors Wim E. Crusio, CNRS and University of Bordeaux UMR 5287, Institut de Neurosciences Cognitives et Intégratives d’Aquitaine, Pessac Cedex, France John D. Lambris, University of Pennsylvania, Philadelphia, PA, USA Heinfried H. Radeke, Clinic of the Goethe University Frankfurt Main, Institute of Pharmacology & Toxicology, Frankfurt am Main, Germany Nima Rezaei, Research Center for Immunodeficiencies, Children’s Medical Center, Tehran University of Medical Sciences, Tehran, Iran

More information about this series at http://www.springer.com/series/5584

Panayiotis Vlamos Editor

GeNeDis 2020 Computational Biology and Bioinformatics

Editor Panayiotis Vlamos Department of Informatics Ionian University Corfu, Greece

ISSN 0065-2598     ISSN 2214-8019 (electronic) Advances in Experimental Medicine and Biology ISBN 978-3-030-78774-5    ISBN 978-3-030-78775-2 (eBook) https://doi.org/10.1007/978-3-030-78775-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, Corrected Publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my son, Michail, who is always motivating me.

Preface

The 4th World Congress on Genetics, Geriatrics and Neurodegenerative Diseases Research (GeNeDis 2020) focuses on the latest major challenges in scientific research, new drug targets, development of novel biomarkers, new imaging techniques, novel protocols for early diagnosis of neurodegenerative diseases, and several other scientific advances, with the aim of better and safe health aging. Computational methodologies for implementation on the discovery of biomarkers for neurodegenerative diseases are extensively discussed. This volume focuses on the sessions from the conference regarding computational biology and bioinformatics. Corfu, Greece

Panayiotis Vlamos

vii

Acknowledgment

I would like to thank Konstantina Skolariki for the help she provided during the editorial process.

ix

Contents

 Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and Hierarchal Clustering Algorithm����������   1 Nwayyin N. Mohammed  DSS for Predicting Lymphoma in Primary Sjogren’s A Syndrome Patients ����������������������������������������������������������������������������������   7 Nikos Avgoustis and Themis Exarchos  Decision Support System for Breast Cancer Detection Using Biomarker Indicators��������������������������������������������������������������������  13 Spiridon Vergis, Konstantinos Bezas, and Themis P. Exarchos  Hepatocellular Carcinoma Detection Using Machine Learning Techniques������������������������������������������������������������������������������������������������  21 Ioannis Angelis and Themis Exarchos  Web-Based Decision Support System for Coronary Heart Disease Diagnosis��������������������������������������������������������  31 Aikaterini Georgia Alvanou, Andreana Stylidou, and Themis P. Exarchos  Decision Support System for the Prediction of Drug A Predisposition Through Personality Traits��������������������������������������������  39 Alexandros Zervopoulos, Asterios Papamichail, and Themis P. Exarchos  Development of a Diagnostic Tool for Balance Disorders Based on Machine Learning Techniques ����������������������������������������������  47 Maria Nefeli Nikiforos, Maria Malakopoulou, and Themis Exarchos  Systems Approaches in the Common Metabolomics in Acute Lymphoblastic Leukemia and Rhabdomyosarcoma Cells: A Computational Approach����������������������������������������������������������  55 Tselios C, Apostolos Zaravinos, Athanasios N. Tsartsalis, Anna Tagka, Athanasios Kotoulas, Styliani A. Geronikolou, Maria Braoudaki, and George I. Lambrou

xi

xii

 Bioinformatics Analyses of Spatial Peripheral Circadian Clock-­Mediated Gene Expression of Glucocorticoid Receptor-­Related Genes��������������������������������������������������������������������������  67 George I. Lambrou, Tomoshige Kino, Hishashi Koide, Sinnie Sin Man Ng, Styliani A. Geronikolou, Flora Bacopoulou, Evangelia Charmandari, and Chrousos G  Machine Learning for Autistic Spectrum Disorder Risk Screening����  81 Constantine Xipolitopoulos, Maria Nefeli Nikiforos, and Themis Exarchos  Mobile Application for Monitoring and Preventing Cognitive Decline Through Lifestyle Intervention ������������������������������������������������  89 Ioannis Angelis, Aikaterini Georgia Alvanou, Nikolaos Avgoustis, Spiridon Vergis, Alexandros Zervopoulos, Maria Malakopoulou, Konstantinos Bezas, Maria Nefeli Nikiforos, Asterios Papamichail, Andreana Stylidou, Themis P. Exarchos, and Panayiotis Vlamos  Virtual Reality Zoo Therapy for Alzheimer’s Disease Using Real-­Time Gesture Recognition��������������������������������������������������������������  97 Hamdi Ben Abdessalem, Yan Ai, K. S. Marulasidda Swamy, and Claude Frasson  Validation of the Greek Version of Social Media Disorder Scale�������� 107 Ioulia Kokka, Iraklis Mourikis, Maria Michou, Dimitrios Vlachakis, Christina Darviri, Ioannis Zervas, Christina Kanaka-Gantenbein, and Flora Bacopoulou  Adiponectin and Its Effects on Acute Leukemia Cells: An Experimental and Bioinformatics Approach���������������������������������� 117 Athanasios N. Tsartsalis, Anna Tagka, Athanasios Kotoulas, Daphne Mirkopoulou, Styliani A. Geronikolou, and Lambrou G  Nature and Quantum-Inspired Procedures – A Short Literature Review������������������������������������������������������������������������������������ 129 Christos Papalitsas, Kalliopi Kastampolidou, and Theodore Andronikos  Handling the Cellular Complex Systems in Alzheimer’s Disease Through a Graph Mining Approach������������������������������������������������������ 135 Aristidis G. Vrahatis, Panagiotis Vlamos, Maria Gonidi, and Antigoni Avramouli Debunking the Neuromyth of Learning Style �������������������������������������� 145 Alexandra Yfanti and Spyridon Doukakis  Expert Characteristics: Implications for Expert Systems�������������������� 155 Konstantinos G. Papageorgiou  Improving the Run-Time of Space-­Efficient n-Gram Data Structures Using Apache Spark���������������������������������������������������� 165 Fotios Kounelis, Andreas Kanavos, and Phivos Mylonas

Contents

Contents

xiii

 Development of a Protein Biochip Platform for Parkinson’s Disease���������������������������������������������������������������������������������� 175 Christos Tzouvelekis, Marios Krokidis, Themis Exarchos, and Panayiotis Vlamos  The Use of Data Collection and Big Data Analysis in Neurodegenerative Disease Prevention�������������������������������������������������� 181 Georgios Nikiforakis  Fractal Dynamics in the RR Interval of Craniopharyngioma and Adrenal Tumor in Adolescence�������������������������������������������������������� 183 Geronikolou S, Flora Bacopoulou, George I. Lambrou, and Dennis Cokkinos  Bioinformatics Approaches for Parkinson’s Disease in Clinical Practice: Data-Driven Biomarkers and Pharmacological Treatment ���������������� 193 Marios G. Krokidis, Themis Exarchos, and Panayiotis Vlamos  Emerging Machine Learning Techniques for Modelling Cellular Complex Systems in Alzheimer’s Disease���������������������������������������������� 199 Aristidis G. Vrahatis, Panagiotis Vlamos, Antigoni Avramouli, Themis Exarchos, and Maria Gonidi  Cognitive Enhancement Through Mathematical Problem-Solving���� 209 Ioannis Saridakis and Spyridon Doukakis  Cognitive Tasks of an Information System for Memory Training and Cognitive Enhancement Using Mobile Devices������������������������������ 217 Panagiota Giannopoulou  Application for Exploring Visual Perception: An A Pilot Neuroeducational Study ������������������������������������������������������������ 225 Stavros Panakakis, Sophia Tsivoula, and Spyridon Doukakis  Gene Classification Based on Multi-Class SVMs with Systematic Sampling and Hierarchical Clustering (SSHC) Algorithm������������������ 231 Nwayyin Najat Mohammed  Multinetwork Motor Learning as a Model for Dance in Neurorehabilitation������������������������������������������������������������������ 239 Rebecca Barnstaple  Controlling the Chimera Form in the Leaky Integrate-and-Fire Model������������������������������������������������������������������������ 247 A. Provata, Ch. G. Antonopoulos, and P. Vlamos  Qualitative Differences Between the Semi-separable and the “Almansi-Type” Stokes Stream Function Expansions in the Study of Biological Fluids ������������������������������������������������������������ 259 Maria Hadjinicolaou and Eleftherios Protopapas

xiv

 Multiscale Mathematical Model for Tumor Growth, A Incorporating the GLUT1 Expression �������������������������������������������������� 273 Pantelis Ampatzoglou, Foteini Kariotou, and Maria Hadjinicolaou  Correction to: Development of a Protein Biochip Platform for Parkinson’s Disease ����������������������������������������������������������� C1 Index���������������������������������������������������������������������������������������������������������� 283

Contents

Editor Biography

Panagiotis Vlamos  is the Chairman of the University Research Center of the Ionian University and Director of the Laboratory of Bioinformatics and Human Electrophysiology (BiHELab) of the Ionian University. He completed his undergraduate studies in Mathematics at the University of Athens and obtained his Doctorate in Mathematics from the National Technical University of Athens. He has received many awards and has authored more than 250 papers in scientific journals and conference proceedings, as well as 16 educational books. BiHELab’s goal is to help bridge the translation gap from data to models and from models to drug discovery and personalized therapy, promoting collaborations and developing novel approaches to biological and clinical problems, using high-­performance computational methods. Professor Panagiotis Vlamos is a pioneer at the evaluation, improvement, and application of digital and computational biomarkers, while he first introduced the notion of meta-biomarkers.

xv

Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and Hierarchal Clustering Algorithm Nwayyin N. Mohammed

ray datasets and to enhance the clustering quality. This technique is a simple and conductive sampling technique. The proposed algorithm, that is, Systematic Sampling with Hierarchal Clustering (SSHC), could detect significant gene patterns in the datasets, and the proposed system (SSHC) shows a better performance. The validity index utilized to evaluate the SSHC algorithm is adjusted Rand index (ARI).

Abstract

Obesity is a worldwide health problem. Eating habits have changed during this decade and an increase in high-fat foods as well as sugar intake has been observed, which is associated with obesity and weight gaining. Therefore, in this chapter, we have analysed microarray expression data for obese and lean individuals. The microarray technology simultaneously records the expression levels of thousands of genes across related samples and during biological process. The microarray data sets are enriched with crucial information which have to be examined. In the study discussed in this chapter, the microarray data sets are pre-­ processed prior to analysis, in which upregulated and downregulated gene groups have been identified. Clustering is one of the learning techniques and it is applied in different fields of study. Clustering with microarray data can be accomplished based on genes or samples and depending on the type of datasets. Hierarchal clustering algorithm was used to detect gene patterns in our candidate datasets, since microarray data are considered big and complex. Systematic sampling technique was used to reduce the complexity of microar-

N. N. Mohammed () University of Sulaimani, Collage of Science, Computer Department, Sulaymaniyah, Iraq

Keywords

Systematic Sampling · Hierarchal Clustering · Microarray expression data · Pre-processing · Validity Index · Principal Component Analysis (PCA)

1

Introduction

The DNA microarray technology initiated the field of studying, analysing and identifying gene behaviours in groups. A single experiment of DNA microarray generates expression profiles of thousands of genes, which is known as gene expression data. The gene expression profiles are a rich knowledge source of genes. Thus, exploring the gene expression data is crucial for finding insights into gene regulation mechanisms and detecting variations in gene expression patterns under different disease conditions [1]. Therefore,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 P. Vlamos (ed.), GeNeDis 2020, Advances in Experimental Medicine and Biology 1338, https://doi.org/10.1007/978-3-030-78775-2_1

1

N. N. Mohammed

2

the development and adaptation of computational methodology and tools are required to analyse these data types [2]. Pre-processing microarray data is an essential step in gene expression data analysis. Pre-processing addresses several issues, for example, transformation of data into a scale suitable for analysis and removal of the effects of systematic sources of variation [3]. Normalization is the first transformation applied to gene expression data. It adjusts the individual intensities in order to balance them so that meaningful biological comparisons can be made [4]. In this analysis, both microarray datasets as first pre-processing step are normalized. The identification of differentially expressed (DE) gene is an important task in microarray data analysis. There are different statistical approaches for identifying differentially expressed genes, for example, t-test. The identification of DE genes helps to study disease-related, cell-specific gene expression patterns. The statistical approach which we used is t-test from which P-values were obtained . The fold change (FC) is also calculated for the gene expression data. In this analysis, the selection of the differentially expressed genes is made by combining both P-values and fold changes (FC) measures. The threshold for the combination criterion (P-values and fold changes) is dependent on the microarray datasets [5]. The microarray datasets are considered as big and complex and analysis of such big datasets is time-consuming. In addition, big data are new and key components of recent statistical methodologies. One of the challenges in working with microarray data is their high dimensionality with a small sample size. While there is abundance of information that can be obtained from such a data set, the high dimensionality can cover the information of interest. Principal component analysis (PCA) reduces the dimensions of the big datasets. The classical principal component analysis is not useful when the number of variables is larger than the number of observations, and this condition holds true with microarray datasets. Therefore, in this analysis, principal component analysis based on the decomposition of real data matrix (singular value decomposition (SVD)) is used [6]. In many stud-

ies, there is a huge interest in grouping or clustering objects with empirical similarities measures. Hierarchical clustering is one of the grouping approaches which is used widely with microarray data [7] and this clustering is represented as a tree. It is a set of nested clusters which transformed from a proximity matrix [8]. The hierarchical clustering is one of the grouping approaches applied widely on microarray data. The common approach to visualize the hierarchical clustering of a dataset is the dendogram. It is a branching diagram which represents the relationship of similarities between clusters [9]. Sampling is an important approach for exploring large datasets. The sampling methods are dependent on accessible population. Sampling decreases the time it takes to explore big datasets and make insights and also improves the quality of the insights [10]. Systematic sampling is a non-biased and a probability sampling method. For example, if all of the population is part of a list, systematic sampling would take every third item until the desired sample size is received. It is better to randomize the starting place of the sampling in the list. There are three types of cluster validity: external, internal and relative. The Rand index and adjusted Rand index are external validity index types [11]. The adjusted Rand is a form of the Rand index that is corrected for the grouping of the stimuli by chance [12]. In this analysis, the adjusted Rand index is the cluster validity index which estimates the quality of k partitions of Systematic Sampling with Hierarchal Clustering (SSHC) algorithm.

2

Methodology

2.1

Hierarchical Clustering Algorithm

The hierarchical clustering is a popular clustering algorithm. It is depended on the similarity measure between samples and genes. The hierarchical clustering is either agglomerative or divisive. The agglomerative hierarchal clustering builds from the bottom up and each data point is considered as an individual cluster. This type of cluster-

Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and…

3

ing generates an interested data visualization 3 . Until only one cluster remains [14]. form. The clusters do not overlap and have a tree-­ like structure. However, in divisive hierarchal clustering, all the data points form one single 2.2 The Systematic Sampling cluster that is successively divided into multiple clusters [9, 13]. The main goal of this sampling method is to form the best accredited population of interest. Agglomerative Hierarchical Clustering Systematic sampling is a common sampling Algorithm (Fig. 1) method and it is used when the sampling frame is Input: X = {x1, …, xm} a set of m data points. not obtainable. Systematic sampling legitimate Output: A single cluster {c}. as a probability sampling since the selection proSteps: cedure can avoid bias as there are no specific patterns existing from the sequence. 1. Calculate the m × m proximity matrix D conThe main advantage of systematic sampling is taining Distances d(x1, xj) ∀I,j = 1,2,. ..,m. its simplicity and operational convenience and it 2. Iterate: is more efficient than simple random sampling (i) Discover the most analogous pair of [11]. clusters ci and cj using the Eq. (5) To take a systematic sample: dmin  ci ,c j   min x1ci , x2cj d  x1,x 2  (5)



(ii) Combine clusters ci and cj into a single cluster. (iii) Revise the proximity matrix D, by removing the rows and columns corresponding to ci and cj and append a row and column corresponding to the newly formed cluster. The proximity between this new cluster c and an old cluster k is defined as in Eq. (6) dmin  c,ck   min dmin  ci , ck  , dmin  c j ,ck  (6)

Systematic sampling is used in many applications with any population which can be put in list [15, 16].

3

S i m i l a r i t y

Data Sets

The microarray datasets used in this study are gene expression profiles of mice adipose tissue macrophage (ATMs) and adipose tissue macrophages (ATMs) throughout obesity promotes inflammatory responses. This section illustrates each of these datasets.

A

Fig. 1 The dendogram

1. Define the study population as a fixed set of units labeled u = 1,. .., N. 2. A systematic sample of size n is obtained by drawing a random integer r from 1,. ..,N and sampling the set of units given by s = {r, r + k, r + 2 k,. ..,r + (n − 1)k}. 3. The term k is called the sampling interval. 4. For any j for which r + jk > N, then the unit selected is r + jk − N.

B

hierarchical

C

D

E

agglomerative

F

G

clustering

A. Adipose tissue macrophages (ATMs). The first data set is, an expansion of adipose tissue mass induces accumulation of adipose tissue macrophages (ATMs), In mammals which is isolated from lean and

N. N. Mohammed

4

obese mice. The transcripts whose expression was differentially expressed are identified from lean and obese mice. The dataset is composed of 16 samples with a 45,102 genes expression basis. This dataset is referred to as the AATMs dataset [17]. B. Adipose tissue macrophages (ATMs) with metabolic signatures. The study set out the metabolic signatures of adipose tissue macrophages (ATMs) in lean and obese conditions. Transcriptome analysis, real-time flux measurements, ELISA and several other approaches were used to determine the metabolic signatures and inflammatory status of macrophages. The dataset is composed of 8 samples with a 35,557 genes expression basis. This dataset is referred to as the Adipose tissue macrophages with metabolic signatures (EATMs) dataset [18].

3.1

Figure 2 shows the flowchart of the Hierarchical Clustering with Systematic Sampling algorithm. The algorithm manage the genes expression data sets (raw data sets) as input; the raw data sets are pre-processed, from which the downregulated and upregulated genes are selected based on statistical approaches, that is, based on the a combination of P-values and fold changes obtained for the genes. In addition, hierarchical clustering algorithm shows better performance when it is used with few first genes principle components (PCs) based SVD.  However, still we have numerous genes data. Thus, the expression data of the desired genes are sampled based on systematic sampling (SS) method. The Hierarchical Clustering (HC) algorithm enhancement achieved with systematic sampling method and the principle component analysis (PCA) based

Adjusted Rand Index (ARI)

A measure of agreement is needed to compare clustering results against external criteria. Since each gene is assigned to only one class in the external criterion and to one cluster, measures of agreement can be used between two partitions [19]. The adjusted Rand index (ARI) measures partitions similarities of two different allocations, ignoring permutations and adjusting the score near to 0 when the partitions are generated by chance. The adjusted Rand index is defined by Eq. 1. The ARI score takes values between 0 and 1. The higher ARI values indicates better partitions than the lower ones [20].

4

ARI 

RI  E  RI 

max  RI   E  RI 

(1)

 he Proposed Algorithm T (SSCH)

The proposed approach, that is Hierarchical Clustering with Systematic Sampling (SSHC), is illustrated in the below flowchart.

Fig. 2 The Hierarchical Clustering Algorithm with Systematic Sampling (SSHC)

Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and…

on decomposition of real data matrix (SVD). The hierarchical agglomerative clustering is a strategy which treats each gene as a single cluster and iteratively merges (or agglomerates) subsets of disjoint gene groups until the stop criterion is met. The Hierarchical Agglomerative Clustering with Systematic Sampling (SSHAC) bottom-up algorithm creates suboptimal clustering solutions, which are typically visualized in the structure of a dendrogram. The dendrogram represents the level of similarity between two adjacent genes clusters. It allows to rebuild step-by-step the resulting merging process. Any desired number of gene clusters can be obtained by cutting the dendrogram properly. The partitions with clusters which result from SSHAC algorithm have been evaluated using adjusted Rand index (ARI). The evaluation step is the final step in this proposed algorithm, in which k partitions with higher ARI value determine better gene clusters quality.

5

Results and Discussion

The hierarchical clustering on large data sizes sense defectively in memory and computing time. Evaluation of the hierarchical clustering performance in order to define the quality of the partitions, in another word the number of natural patterns in the data sets, is one of the main goals of microarray data clustering algorithm. The tables below show the efficiency of the proposed Systematic Sampling with Hierarchical Clustering (SSHC) algorithm when it is used on the microarray expression datasets in this analyze, both ATMs and EATMs data sets. The index used in this analysis is the adjusted Rand index, and for different K partition values of microarray data sets, the Systematic Sampling with Hierarchical Clustering (SSHC) algorithm performed differently according to the ARI values. Table  1 shows that a low ARI index was obtained for both the data sets (ATMs and EATM) when hierarchical algorithm was used to cluster the datasets compared to the developed algorithm SSHAC. Table  1 shows the

5

Table 1  Performance of hierarchical clustering for the candidate datasets Dataset ATMs EATMs

K partitions 10, 5 10, 5

Adjusted Rand index 0.6569 0.5183

Table 2  Performance of systematic sampling with hierarchical clustering for the candidate datasets Dataset ATMs EATMs

K Partitions 5, 4 5, 3

Adjusted Rand index 0.8148 0.6493

v­alues obtained using hierarchical clustering algorithm without being hybrid with systematic sampling. Table 2 shows the performance of Systematic Sampling with Hierarchical Clustering (SSHC) algorithm. The two data sets ATMs and EATMs are used and the proposed algorithm showed better performance and higher Adjusted Rand Indices. In addition, higher performance of the proposed SSCH algorithm was observed when used with ATMs dataset in which the adjusted Rand Index validity is 0.8148. The computing time was much better, less computing time has been recorded.

6

Conclusion

The proposed Hierarchical Agglomerative Algorithm based on Systematic Sampling (SSHAC) reduces the time and space complexity of the common agglomerative hierarchical clustering algorithm. The detection of significant genes patterns within the two microarray datasets was much better with SSHAC algorithm compared to ordinary agglomerative hierarchical clustering  algorithm, in which k partitions with higher adjusted Rand index (ARI) were detected. The effect of K partition numbers on the performance of SSHAC algorithm depends on the candidate microarray datasets. Appropriate K partition numbers are determined with SSHAC algorithm in which higher validity index is recorded and groups of significant genes are identified.

6

References 1. Bhattacharyya R, Bhattacharyya B (2008) Gene expression mining for cohesive pattern discovery. In: International conference on bioinformatics research and development, pp 221–234 2. Dubitzky W, Granzow M, Berrar D (2002) “Data mining and machine learning methods for microarray analysis,” in Methods of microarray data analysis, ed: Springer, pp 5–22 3. Amaratunga D, Cabrera J, Shkedy Z (2014) Exploration and analysis of DNA microarray and other high-dimensional data: John Wiley & Sons, 2014 4. Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501 5. Dalman MR, Deeter A, Nimishakavi G, Duan Z-H (2012) Fold change and p-value cutoffs significantly alter microarray interpretations. BMC Bioinformatics:S11 6. Praus P (2005) SVD-based principal component analysis of geochemical data. Cent Eur J Chem 3(4):731–741 7. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254 8. Patel S, Sihmar S, Jatain A (2015) A study of hierarchical clustering algorithms. In: 2015 2nd international conference on computing for sustainable global development (INDIACom), pp 537–541 9. Embrechts MJ, Gatti CJ, Linton J, Roysam B (2013) Hierarchical clustering for large data sets. In: Advances in intelligent signal processing and data mining. Springer, pp 197–233 10. Rojas JAR, Kery MB, Rosenthal S, Dey A (2017) Sampling techniques to improve big data exploration. In: 2017 IEEE 7th symposium on large data analysis and visualization (LDAV), pp 26–35 11. Huang K-C (2004) Mixed random systematic sampling designs. Metrika 59:1–11

N. N. Mohammed 12. Qannari EM, Courcoux P, Faye P (2014) Significance test of the adjusted Rand index. Application to the free sorting task. Food Qual Prefer 32:93–97 13. Liu W, Wang T, Chen S, Tang A (2009) Hierarchical clustering of gene expression data with divergence measure. In: 2009 3rd international conference on bioinformatics and biomedical engineering, pp 1–3 14. Hassan SI, Samad A, Ahmad O, Alam A (2019) Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time. Int J Inf Technol:1–8 15. Bujang MA, Ab Ghani P, Zolkepali NA, Adnan TH, Ali MM, Selvarajah S et  al (2012) A comparison between convenience sampling versus systematic sampling in getting the true parameter in a population: explore from a clinical database: the Audit Diabetes Control Management (ADCM) registry in 2009. In: 2012 international conference on statistics in science, business and engineering (ICSSBE), pp 1–5 16. Bellhouse D (2014) Systematic sampling methods. Wiley StatsRef: Statistics Reference Online 17. Xu X, Grijalva A, Skowronski A, van Eijk M, Serlie MJ, Ferrante AW Jr (2013) Obesity activates a program of lysosomal-dependent lipid metabolism in adipose tissue macrophages independently of classic activation. Cell Metab 18:816–830 18. Boutens L, Hooiveld GJ, Dhingra S, Cramer RA, Netea MG, Stienstra R (2018) Unique metabolic activation of adipose tissue macrophages in obesity promotes inflammatory responses. Diabetologia 61:942–953 19. Yeung KY, Ruzzo WL (2001) Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics 17:763–774 20. Ortiz-Bejar J, Tellez ES, Graff M, Ortiz-Bejar J, Jacobo JC, Zamora-Mendez A (2019) Performance analysis of K-means seeding algorithms. In: 2019 IEEE international autumn meeting on power, electronics and computing (ROPEC), pp 1–6

A DSS for Predicting Lymphoma in Primary Sjogren’s Syndrome Patients Nikos Avgoustis and Themis Exarchos

Abstract

Primary Sjogren’s syndrome (pSS) is a chronic autoimmune disease followed by exocrine gland dysfunction. In this work, a web application was developed as a screening test based on a machine learning model that was trained on clinical data and is used to predict lymphoma outcomes in pSS patient. The results of the final model reveal a sensitivity of 100%, accuracy of 82%, and area under the curve of 98% and confirms the importance of C4 value, lymphadenopathy, and rheumatoid factor as prominent lymphoma predictors. Keywords

Primary Sjogren’s Syndrome · Screening test · Machine Learning

1

Introduction

Primary Sjogren’s syndrome (pSS) is a chronic systemic autoimmune disease-causing exocrine gland dysfunction, with clinical manifestations varying from dry eye and mouth to the development of non-Hodgkin lymphoma (NHL), which N. Avgoustis · T. Exarchos () Ionian University, Department of Informatics, Corfu, Greece e-mail: [email protected]; [email protected]

occurs in about 5% of the pSS patients [4, 5]. pSS is primarily affecting women near the menopausal age with reported ratio of 9 females affected per 1 male, making pSS one of the chronic autoimmune diseases with the highest imbalanced gender ratio [4, 5]. Moreover, the study of autoimmune lesions in minor salivary gland tissues has revealed cellular pathways that are highly associated with aggressive disease phenotypes which are prone to lymphoma development [2]. Previously suggested histopathological, as well as clinical, laboratory risk factors for lymphoma development in terms of prognostic and diagnostic purposes include salivary gland enlargement (SGE), the rheumatoid factor (RF), the cryoglobulinemia, the C4 hypocomplementemia, and the purpura, among others [3, 7, 10]. In this work, a web application based on a supervised machine learning model was built to aid in the decision-making process of the clinicians.

2

Materials and Methods

2.1

Data

The data were acquired from 449 anonymous patients that have been diagnosed with primary Sjogren’s syndrome (pSS) from the University of Athens. The dataset is the curated version of the dataset used in [9]. The dataset consists of 65 col-

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 P. Vlamos (ed.), GeNeDis 2020, Advances in Experimental Medicine and Biology 1338, https://doi.org/10.1007/978-3-030-78775-2_2

7

N. Avgoustis and T. Exarchos

8

umns with 449 rows, 64 of which are features plus the diagnoses of the patient (lymphoma or not). The features are either discrete (0 or 1) or numeric positive values and consist of laboratory measurements (e.g., C3, C4, WBC baseline) and demographic data. The dataset presented contains an imbalance in target class lymphoma (0–1) with 76 instances in the positive class and 373  in the negative class. The features of the dataset are presented in Table 1. First, we reserve 10% of our dataset for testing which gives us a testing set of 45 instances with 38 of them belonging in the negative class and 7  in the positive class. That leaves us with 404 instances for training 335 of them in the negative class and 69 in the positive class. To deal with the imbalance of our classes, we divide the negative class into five subsets and then pair them with the positive class which gives us five total training sets for our models each with 69 lymphoma and 69 not lymphoma.

2.2

Feature Selection

In order to select, the optimal feature set ranking was applied, and we retained the 15 with the highest score. The method used for ranking was Recursive Feature Elimination with ten-Fold Cross Validation (RFECV) [8], with an Extra Trees Classifier as the estimator on each of the five datasets. RFECV is a feature ranking with recursive feature elimination and cross-validated selection of the best number of features (Table 2); from those, the 15 highest were chosen from each dataset.

2.3

Models

The machine learning models AdaBoost, Gradient Boost, Logistic Regression, and Gaussian Naive-Bayes were used, with tenfold cross validated on each of the five datasets, both with the features RFECV suggested and the

15-best that we choose, and the one that produces the highest accuracy was then trained on the specific dataset (Table 3).

3

Results

In Tables 4 and 5, the results from testing the models on the reserved 10% of the dataset are presented. The selected features are as follows: “SGE,” “C4,” “FS 1st biopsy,” “chronic fatigue,” “low C4,” “lymphadenopathy,” “CRP,” “ESR,” “urine-specific gravity at diagnosis,” “C3,” “urine PH at first visit,” “Tarpley,” “RF (20 = 1),” “lymphocyte number,” and “RF+.”

4

Web Application Implementation

The application was built on ask, which is a framework for developing web applications in python and is currently hosted on Heroku at https://medicaldss.herokuapp.com/. The application consists of a form that the user can insert the features. When the user submits the form, the data are used by our model, and the prediction probability is displayed to the user in the form of a chart (Fig. 1).

5

Discussion

In this work, Logistic Regression model based on clinical data was used to create a screening web application for predicting binary lymphoma outcomes on pSS patients. The model predicts lymphoma outcomes with accuracy 82%, sensitivity 100%, specificity 79%, and AUC score 0.98 using 15 features. It also confirms the importance of C4 value, lymphadenopathy, and rheumatoid factor as prominent lymphoma predictors as previous works suggested [1, 6]. Prior to the evaluation of the model on a subset of clinical data that were reserved from the original dataset, great

A DSS for Predicting Lymphoma in Primary Sjogren’s Syndrome Patients

9

Table 1  Data description Feature Sex (female = 1) Dry mouth, subjective Dry eyes, subjective Abnormal Schirmer’s Ana+ RF (< 20 = 0;> 20 = 1) Anti-Ro (0–1) Anti-La (0–1) Tarpley SGE Raynaud Ro/La RF+ Monoclonal gammopathy (blood) LOW C4 (