137 71 14MB
English Pages 303 [283] Year 2021
Advances in Experimental Medicine and Biology 1338
Panayiotis Vlamos Editor
GeNeDis 2020
Computational Biology and Bioinformatics
Advances in Experimental Medicine and Biology Volume 1338 Series Editors Wim E. Crusio, CNRS and University of Bordeaux UMR 5287, Institut de Neurosciences Cognitives et Intégratives d’Aquitaine, Pessac Cedex, France John D. Lambris, University of Pennsylvania, Philadelphia, PA, USA Heinfried H. Radeke, Clinic of the Goethe University Frankfurt Main, Institute of Pharmacology & Toxicology, Frankfurt am Main, Germany Nima Rezaei, Research Center for Immunodeficiencies, Children’s Medical Center, Tehran University of Medical Sciences, Tehran, Iran
More information about this series at http://www.springer.com/series/5584
Panayiotis Vlamos Editor
GeNeDis 2020 Computational Biology and Bioinformatics
Editor Panayiotis Vlamos Department of Informatics Ionian University Corfu, Greece
ISSN 0065-2598 ISSN 2214-8019 (electronic) Advances in Experimental Medicine and Biology ISBN 978-3-030-78774-5 ISBN 978-3-030-78775-2 (eBook) https://doi.org/10.1007/978-3-030-78775-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021, Corrected Publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my son, Michail, who is always motivating me.
Preface
The 4th World Congress on Genetics, Geriatrics and Neurodegenerative Diseases Research (GeNeDis 2020) focuses on the latest major challenges in scientific research, new drug targets, development of novel biomarkers, new imaging techniques, novel protocols for early diagnosis of neurodegenerative diseases, and several other scientific advances, with the aim of better and safe health aging. Computational methodologies for implementation on the discovery of biomarkers for neurodegenerative diseases are extensively discussed. This volume focuses on the sessions from the conference regarding computational biology and bioinformatics. Corfu, Greece
Panayiotis Vlamos
vii
Acknowledgment
I would like to thank Konstantina Skolariki for the help she provided during the editorial process.
ix
Contents
Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and Hierarchal Clustering Algorithm���������� 1 Nwayyin N. Mohammed DSS for Predicting Lymphoma in Primary Sjogren’s A Syndrome Patients ���������������������������������������������������������������������������������� 7 Nikos Avgoustis and Themis Exarchos Decision Support System for Breast Cancer Detection Using Biomarker Indicators�������������������������������������������������������������������� 13 Spiridon Vergis, Konstantinos Bezas, and Themis P. Exarchos Hepatocellular Carcinoma Detection Using Machine Learning Techniques������������������������������������������������������������������������������������������������ 21 Ioannis Angelis and Themis Exarchos Web-Based Decision Support System for Coronary Heart Disease Diagnosis�������������������������������������������������������� 31 Aikaterini Georgia Alvanou, Andreana Stylidou, and Themis P. Exarchos Decision Support System for the Prediction of Drug A Predisposition Through Personality Traits�������������������������������������������� 39 Alexandros Zervopoulos, Asterios Papamichail, and Themis P. Exarchos Development of a Diagnostic Tool for Balance Disorders Based on Machine Learning Techniques ���������������������������������������������� 47 Maria Nefeli Nikiforos, Maria Malakopoulou, and Themis Exarchos Systems Approaches in the Common Metabolomics in Acute Lymphoblastic Leukemia and Rhabdomyosarcoma Cells: A Computational Approach���������������������������������������������������������� 55 Tselios C, Apostolos Zaravinos, Athanasios N. Tsartsalis, Anna Tagka, Athanasios Kotoulas, Styliani A. Geronikolou, Maria Braoudaki, and George I. Lambrou
xi
xii
Bioinformatics Analyses of Spatial Peripheral Circadian Clock-Mediated Gene Expression of Glucocorticoid Receptor-Related Genes�������������������������������������������������������������������������� 67 George I. Lambrou, Tomoshige Kino, Hishashi Koide, Sinnie Sin Man Ng, Styliani A. Geronikolou, Flora Bacopoulou, Evangelia Charmandari, and Chrousos G Machine Learning for Autistic Spectrum Disorder Risk Screening���� 81 Constantine Xipolitopoulos, Maria Nefeli Nikiforos, and Themis Exarchos Mobile Application for Monitoring and Preventing Cognitive Decline Through Lifestyle Intervention ������������������������������������������������ 89 Ioannis Angelis, Aikaterini Georgia Alvanou, Nikolaos Avgoustis, Spiridon Vergis, Alexandros Zervopoulos, Maria Malakopoulou, Konstantinos Bezas, Maria Nefeli Nikiforos, Asterios Papamichail, Andreana Stylidou, Themis P. Exarchos, and Panayiotis Vlamos Virtual Reality Zoo Therapy for Alzheimer’s Disease Using Real-Time Gesture Recognition�������������������������������������������������������������� 97 Hamdi Ben Abdessalem, Yan Ai, K. S. Marulasidda Swamy, and Claude Frasson Validation of the Greek Version of Social Media Disorder Scale�������� 107 Ioulia Kokka, Iraklis Mourikis, Maria Michou, Dimitrios Vlachakis, Christina Darviri, Ioannis Zervas, Christina Kanaka-Gantenbein, and Flora Bacopoulou Adiponectin and Its Effects on Acute Leukemia Cells: An Experimental and Bioinformatics Approach���������������������������������� 117 Athanasios N. Tsartsalis, Anna Tagka, Athanasios Kotoulas, Daphne Mirkopoulou, Styliani A. Geronikolou, and Lambrou G Nature and Quantum-Inspired Procedures – A Short Literature Review������������������������������������������������������������������������������������ 129 Christos Papalitsas, Kalliopi Kastampolidou, and Theodore Andronikos Handling the Cellular Complex Systems in Alzheimer’s Disease Through a Graph Mining Approach������������������������������������������������������ 135 Aristidis G. Vrahatis, Panagiotis Vlamos, Maria Gonidi, and Antigoni Avramouli Debunking the Neuromyth of Learning Style �������������������������������������� 145 Alexandra Yfanti and Spyridon Doukakis Expert Characteristics: Implications for Expert Systems�������������������� 155 Konstantinos G. Papageorgiou Improving the Run-Time of Space-Efficient n-Gram Data Structures Using Apache Spark���������������������������������������������������� 165 Fotios Kounelis, Andreas Kanavos, and Phivos Mylonas
Contents
Contents
xiii
Development of a Protein Biochip Platform for Parkinson’s Disease���������������������������������������������������������������������������������� 175 Christos Tzouvelekis, Marios Krokidis, Themis Exarchos, and Panayiotis Vlamos The Use of Data Collection and Big Data Analysis in Neurodegenerative Disease Prevention�������������������������������������������������� 181 Georgios Nikiforakis Fractal Dynamics in the RR Interval of Craniopharyngioma and Adrenal Tumor in Adolescence�������������������������������������������������������� 183 Geronikolou S, Flora Bacopoulou, George I. Lambrou, and Dennis Cokkinos Bioinformatics Approaches for Parkinson’s Disease in Clinical Practice: Data-Driven Biomarkers and Pharmacological Treatment ���������������� 193 Marios G. Krokidis, Themis Exarchos, and Panayiotis Vlamos Emerging Machine Learning Techniques for Modelling Cellular Complex Systems in Alzheimer’s Disease���������������������������������������������� 199 Aristidis G. Vrahatis, Panagiotis Vlamos, Antigoni Avramouli, Themis Exarchos, and Maria Gonidi Cognitive Enhancement Through Mathematical Problem-Solving���� 209 Ioannis Saridakis and Spyridon Doukakis Cognitive Tasks of an Information System for Memory Training and Cognitive Enhancement Using Mobile Devices������������������������������ 217 Panagiota Giannopoulou Application for Exploring Visual Perception: An A Pilot Neuroeducational Study ������������������������������������������������������������ 225 Stavros Panakakis, Sophia Tsivoula, and Spyridon Doukakis Gene Classification Based on Multi-Class SVMs with Systematic Sampling and Hierarchical Clustering (SSHC) Algorithm������������������ 231 Nwayyin Najat Mohammed Multinetwork Motor Learning as a Model for Dance in Neurorehabilitation������������������������������������������������������������������ 239 Rebecca Barnstaple Controlling the Chimera Form in the Leaky Integrate-and-Fire Model������������������������������������������������������������������������ 247 A. Provata, Ch. G. Antonopoulos, and P. Vlamos Qualitative Differences Between the Semi-separable and the “Almansi-Type” Stokes Stream Function Expansions in the Study of Biological Fluids ������������������������������������������������������������ 259 Maria Hadjinicolaou and Eleftherios Protopapas
xiv
Multiscale Mathematical Model for Tumor Growth, A Incorporating the GLUT1 Expression �������������������������������������������������� 273 Pantelis Ampatzoglou, Foteini Kariotou, and Maria Hadjinicolaou Correction to: Development of a Protein Biochip Platform for Parkinson’s Disease ����������������������������������������������������������� C1 Index���������������������������������������������������������������������������������������������������������� 283
Contents
Editor Biography
Panagiotis Vlamos is the Chairman of the University Research Center of the Ionian University and Director of the Laboratory of Bioinformatics and Human Electrophysiology (BiHELab) of the Ionian University. He completed his undergraduate studies in Mathematics at the University of Athens and obtained his Doctorate in Mathematics from the National Technical University of Athens. He has received many awards and has authored more than 250 papers in scientific journals and conference proceedings, as well as 16 educational books. BiHELab’s goal is to help bridge the translation gap from data to models and from models to drug discovery and personalized therapy, promoting collaborations and developing novel approaches to biological and clinical problems, using high-performance computational methods. Professor Panagiotis Vlamos is a pioneer at the evaluation, improvement, and application of digital and computational biomarkers, while he first introduced the notion of meta-biomarkers.
xv
Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and Hierarchal Clustering Algorithm Nwayyin N. Mohammed
ray datasets and to enhance the clustering quality. This technique is a simple and conductive sampling technique. The proposed algorithm, that is, Systematic Sampling with Hierarchal Clustering (SSHC), could detect significant gene patterns in the datasets, and the proposed system (SSHC) shows a better performance. The validity index utilized to evaluate the SSHC algorithm is adjusted Rand index (ARI).
Abstract
Obesity is a worldwide health problem. Eating habits have changed during this decade and an increase in high-fat foods as well as sugar intake has been observed, which is associated with obesity and weight gaining. Therefore, in this chapter, we have analysed microarray expression data for obese and lean individuals. The microarray technology simultaneously records the expression levels of thousands of genes across related samples and during biological process. The microarray data sets are enriched with crucial information which have to be examined. In the study discussed in this chapter, the microarray data sets are pre- processed prior to analysis, in which upregulated and downregulated gene groups have been identified. Clustering is one of the learning techniques and it is applied in different fields of study. Clustering with microarray data can be accomplished based on genes or samples and depending on the type of datasets. Hierarchal clustering algorithm was used to detect gene patterns in our candidate datasets, since microarray data are considered big and complex. Systematic sampling technique was used to reduce the complexity of microar-
N. N. Mohammed () University of Sulaimani, Collage of Science, Computer Department, Sulaymaniyah, Iraq
Keywords
Systematic Sampling · Hierarchal Clustering · Microarray expression data · Pre-processing · Validity Index · Principal Component Analysis (PCA)
1
Introduction
The DNA microarray technology initiated the field of studying, analysing and identifying gene behaviours in groups. A single experiment of DNA microarray generates expression profiles of thousands of genes, which is known as gene expression data. The gene expression profiles are a rich knowledge source of genes. Thus, exploring the gene expression data is crucial for finding insights into gene regulation mechanisms and detecting variations in gene expression patterns under different disease conditions [1]. Therefore,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 P. Vlamos (ed.), GeNeDis 2020, Advances in Experimental Medicine and Biology 1338, https://doi.org/10.1007/978-3-030-78775-2_1
1
N. N. Mohammed
2
the development and adaptation of computational methodology and tools are required to analyse these data types [2]. Pre-processing microarray data is an essential step in gene expression data analysis. Pre-processing addresses several issues, for example, transformation of data into a scale suitable for analysis and removal of the effects of systematic sources of variation [3]. Normalization is the first transformation applied to gene expression data. It adjusts the individual intensities in order to balance them so that meaningful biological comparisons can be made [4]. In this analysis, both microarray datasets as first pre-processing step are normalized. The identification of differentially expressed (DE) gene is an important task in microarray data analysis. There are different statistical approaches for identifying differentially expressed genes, for example, t-test. The identification of DE genes helps to study disease-related, cell-specific gene expression patterns. The statistical approach which we used is t-test from which P-values were obtained . The fold change (FC) is also calculated for the gene expression data. In this analysis, the selection of the differentially expressed genes is made by combining both P-values and fold changes (FC) measures. The threshold for the combination criterion (P-values and fold changes) is dependent on the microarray datasets [5]. The microarray datasets are considered as big and complex and analysis of such big datasets is time-consuming. In addition, big data are new and key components of recent statistical methodologies. One of the challenges in working with microarray data is their high dimensionality with a small sample size. While there is abundance of information that can be obtained from such a data set, the high dimensionality can cover the information of interest. Principal component analysis (PCA) reduces the dimensions of the big datasets. The classical principal component analysis is not useful when the number of variables is larger than the number of observations, and this condition holds true with microarray datasets. Therefore, in this analysis, principal component analysis based on the decomposition of real data matrix (singular value decomposition (SVD)) is used [6]. In many stud-
ies, there is a huge interest in grouping or clustering objects with empirical similarities measures. Hierarchical clustering is one of the grouping approaches which is used widely with microarray data [7] and this clustering is represented as a tree. It is a set of nested clusters which transformed from a proximity matrix [8]. The hierarchical clustering is one of the grouping approaches applied widely on microarray data. The common approach to visualize the hierarchical clustering of a dataset is the dendogram. It is a branching diagram which represents the relationship of similarities between clusters [9]. Sampling is an important approach for exploring large datasets. The sampling methods are dependent on accessible population. Sampling decreases the time it takes to explore big datasets and make insights and also improves the quality of the insights [10]. Systematic sampling is a non-biased and a probability sampling method. For example, if all of the population is part of a list, systematic sampling would take every third item until the desired sample size is received. It is better to randomize the starting place of the sampling in the list. There are three types of cluster validity: external, internal and relative. The Rand index and adjusted Rand index are external validity index types [11]. The adjusted Rand is a form of the Rand index that is corrected for the grouping of the stimuli by chance [12]. In this analysis, the adjusted Rand index is the cluster validity index which estimates the quality of k partitions of Systematic Sampling with Hierarchal Clustering (SSHC) algorithm.
2
Methodology
2.1
Hierarchical Clustering Algorithm
The hierarchical clustering is a popular clustering algorithm. It is depended on the similarity measure between samples and genes. The hierarchical clustering is either agglomerative or divisive. The agglomerative hierarchal clustering builds from the bottom up and each data point is considered as an individual cluster. This type of cluster-
Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and…
3
ing generates an interested data visualization 3 . Until only one cluster remains [14]. form. The clusters do not overlap and have a tree- like structure. However, in divisive hierarchal clustering, all the data points form one single 2.2 The Systematic Sampling cluster that is successively divided into multiple clusters [9, 13]. The main goal of this sampling method is to form the best accredited population of interest. Agglomerative Hierarchical Clustering Systematic sampling is a common sampling Algorithm (Fig. 1) method and it is used when the sampling frame is Input: X = {x1, …, xm} a set of m data points. not obtainable. Systematic sampling legitimate Output: A single cluster {c}. as a probability sampling since the selection proSteps: cedure can avoid bias as there are no specific patterns existing from the sequence. 1. Calculate the m × m proximity matrix D conThe main advantage of systematic sampling is taining Distances d(x1, xj) ∀I,j = 1,2,. ..,m. its simplicity and operational convenience and it 2. Iterate: is more efficient than simple random sampling (i) Discover the most analogous pair of [11]. clusters ci and cj using the Eq. (5) To take a systematic sample: dmin ci ,c j min x1ci , x2cj d x1,x 2 (5)
(ii) Combine clusters ci and cj into a single cluster. (iii) Revise the proximity matrix D, by removing the rows and columns corresponding to ci and cj and append a row and column corresponding to the newly formed cluster. The proximity between this new cluster c and an old cluster k is defined as in Eq. (6) dmin c,ck min dmin ci , ck , dmin c j ,ck (6)
Systematic sampling is used in many applications with any population which can be put in list [15, 16].
3
S i m i l a r i t y
Data Sets
The microarray datasets used in this study are gene expression profiles of mice adipose tissue macrophage (ATMs) and adipose tissue macrophages (ATMs) throughout obesity promotes inflammatory responses. This section illustrates each of these datasets.
A
Fig. 1 The dendogram
1. Define the study population as a fixed set of units labeled u = 1,. .., N. 2. A systematic sample of size n is obtained by drawing a random integer r from 1,. ..,N and sampling the set of units given by s = {r, r + k, r + 2 k,. ..,r + (n − 1)k}. 3. The term k is called the sampling interval. 4. For any j for which r + jk > N, then the unit selected is r + jk − N.
B
hierarchical
C
D
E
agglomerative
F
G
clustering
A. Adipose tissue macrophages (ATMs). The first data set is, an expansion of adipose tissue mass induces accumulation of adipose tissue macrophages (ATMs), In mammals which is isolated from lean and
N. N. Mohammed
4
obese mice. The transcripts whose expression was differentially expressed are identified from lean and obese mice. The dataset is composed of 16 samples with a 45,102 genes expression basis. This dataset is referred to as the AATMs dataset [17]. B. Adipose tissue macrophages (ATMs) with metabolic signatures. The study set out the metabolic signatures of adipose tissue macrophages (ATMs) in lean and obese conditions. Transcriptome analysis, real-time flux measurements, ELISA and several other approaches were used to determine the metabolic signatures and inflammatory status of macrophages. The dataset is composed of 8 samples with a 35,557 genes expression basis. This dataset is referred to as the Adipose tissue macrophages with metabolic signatures (EATMs) dataset [18].
3.1
Figure 2 shows the flowchart of the Hierarchical Clustering with Systematic Sampling algorithm. The algorithm manage the genes expression data sets (raw data sets) as input; the raw data sets are pre-processed, from which the downregulated and upregulated genes are selected based on statistical approaches, that is, based on the a combination of P-values and fold changes obtained for the genes. In addition, hierarchical clustering algorithm shows better performance when it is used with few first genes principle components (PCs) based SVD. However, still we have numerous genes data. Thus, the expression data of the desired genes are sampled based on systematic sampling (SS) method. The Hierarchical Clustering (HC) algorithm enhancement achieved with systematic sampling method and the principle component analysis (PCA) based
Adjusted Rand Index (ARI)
A measure of agreement is needed to compare clustering results against external criteria. Since each gene is assigned to only one class in the external criterion and to one cluster, measures of agreement can be used between two partitions [19]. The adjusted Rand index (ARI) measures partitions similarities of two different allocations, ignoring permutations and adjusting the score near to 0 when the partitions are generated by chance. The adjusted Rand index is defined by Eq. 1. The ARI score takes values between 0 and 1. The higher ARI values indicates better partitions than the lower ones [20].
4
ARI
RI E RI
max RI E RI
(1)
he Proposed Algorithm T (SSCH)
The proposed approach, that is Hierarchical Clustering with Systematic Sampling (SSHC), is illustrated in the below flowchart.
Fig. 2 The Hierarchical Clustering Algorithm with Systematic Sampling (SSHC)
Digging for Significant Genes in Microarray Expression Data Based on Systematic Sampling and…
on decomposition of real data matrix (SVD). The hierarchical agglomerative clustering is a strategy which treats each gene as a single cluster and iteratively merges (or agglomerates) subsets of disjoint gene groups until the stop criterion is met. The Hierarchical Agglomerative Clustering with Systematic Sampling (SSHAC) bottom-up algorithm creates suboptimal clustering solutions, which are typically visualized in the structure of a dendrogram. The dendrogram represents the level of similarity between two adjacent genes clusters. It allows to rebuild step-by-step the resulting merging process. Any desired number of gene clusters can be obtained by cutting the dendrogram properly. The partitions with clusters which result from SSHAC algorithm have been evaluated using adjusted Rand index (ARI). The evaluation step is the final step in this proposed algorithm, in which k partitions with higher ARI value determine better gene clusters quality.
5
Results and Discussion
The hierarchical clustering on large data sizes sense defectively in memory and computing time. Evaluation of the hierarchical clustering performance in order to define the quality of the partitions, in another word the number of natural patterns in the data sets, is one of the main goals of microarray data clustering algorithm. The tables below show the efficiency of the proposed Systematic Sampling with Hierarchical Clustering (SSHC) algorithm when it is used on the microarray expression datasets in this analyze, both ATMs and EATMs data sets. The index used in this analysis is the adjusted Rand index, and for different K partition values of microarray data sets, the Systematic Sampling with Hierarchical Clustering (SSHC) algorithm performed differently according to the ARI values. Table 1 shows that a low ARI index was obtained for both the data sets (ATMs and EATM) when hierarchical algorithm was used to cluster the datasets compared to the developed algorithm SSHAC. Table 1 shows the
5
Table 1 Performance of hierarchical clustering for the candidate datasets Dataset ATMs EATMs
K partitions 10, 5 10, 5
Adjusted Rand index 0.6569 0.5183
Table 2 Performance of systematic sampling with hierarchical clustering for the candidate datasets Dataset ATMs EATMs
K Partitions 5, 4 5, 3
Adjusted Rand index 0.8148 0.6493
values obtained using hierarchical clustering algorithm without being hybrid with systematic sampling. Table 2 shows the performance of Systematic Sampling with Hierarchical Clustering (SSHC) algorithm. The two data sets ATMs and EATMs are used and the proposed algorithm showed better performance and higher Adjusted Rand Indices. In addition, higher performance of the proposed SSCH algorithm was observed when used with ATMs dataset in which the adjusted Rand Index validity is 0.8148. The computing time was much better, less computing time has been recorded.
6
Conclusion
The proposed Hierarchical Agglomerative Algorithm based on Systematic Sampling (SSHAC) reduces the time and space complexity of the common agglomerative hierarchical clustering algorithm. The detection of significant genes patterns within the two microarray datasets was much better with SSHAC algorithm compared to ordinary agglomerative hierarchical clustering algorithm, in which k partitions with higher adjusted Rand index (ARI) were detected. The effect of K partition numbers on the performance of SSHAC algorithm depends on the candidate microarray datasets. Appropriate K partition numbers are determined with SSHAC algorithm in which higher validity index is recorded and groups of significant genes are identified.
6
References 1. Bhattacharyya R, Bhattacharyya B (2008) Gene expression mining for cohesive pattern discovery. In: International conference on bioinformatics research and development, pp 221–234 2. Dubitzky W, Granzow M, Berrar D (2002) “Data mining and machine learning methods for microarray analysis,” in Methods of microarray data analysis, ed: Springer, pp 5–22 3. Amaratunga D, Cabrera J, Shkedy Z (2014) Exploration and analysis of DNA microarray and other high-dimensional data: John Wiley & Sons, 2014 4. Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501 5. Dalman MR, Deeter A, Nimishakavi G, Duan Z-H (2012) Fold change and p-value cutoffs significantly alter microarray interpretations. BMC Bioinformatics:S11 6. Praus P (2005) SVD-based principal component analysis of geochemical data. Cent Eur J Chem 3(4):731–741 7. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254 8. Patel S, Sihmar S, Jatain A (2015) A study of hierarchical clustering algorithms. In: 2015 2nd international conference on computing for sustainable global development (INDIACom), pp 537–541 9. Embrechts MJ, Gatti CJ, Linton J, Roysam B (2013) Hierarchical clustering for large data sets. In: Advances in intelligent signal processing and data mining. Springer, pp 197–233 10. Rojas JAR, Kery MB, Rosenthal S, Dey A (2017) Sampling techniques to improve big data exploration. In: 2017 IEEE 7th symposium on large data analysis and visualization (LDAV), pp 26–35 11. Huang K-C (2004) Mixed random systematic sampling designs. Metrika 59:1–11
N. N. Mohammed 12. Qannari EM, Courcoux P, Faye P (2014) Significance test of the adjusted Rand index. Application to the free sorting task. Food Qual Prefer 32:93–97 13. Liu W, Wang T, Chen S, Tang A (2009) Hierarchical clustering of gene expression data with divergence measure. In: 2009 3rd international conference on bioinformatics and biomedical engineering, pp 1–3 14. Hassan SI, Samad A, Ahmad O, Alam A (2019) Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time. Int J Inf Technol:1–8 15. Bujang MA, Ab Ghani P, Zolkepali NA, Adnan TH, Ali MM, Selvarajah S et al (2012) A comparison between convenience sampling versus systematic sampling in getting the true parameter in a population: explore from a clinical database: the Audit Diabetes Control Management (ADCM) registry in 2009. In: 2012 international conference on statistics in science, business and engineering (ICSSBE), pp 1–5 16. Bellhouse D (2014) Systematic sampling methods. Wiley StatsRef: Statistics Reference Online 17. Xu X, Grijalva A, Skowronski A, van Eijk M, Serlie MJ, Ferrante AW Jr (2013) Obesity activates a program of lysosomal-dependent lipid metabolism in adipose tissue macrophages independently of classic activation. Cell Metab 18:816–830 18. Boutens L, Hooiveld GJ, Dhingra S, Cramer RA, Netea MG, Stienstra R (2018) Unique metabolic activation of adipose tissue macrophages in obesity promotes inflammatory responses. Diabetologia 61:942–953 19. Yeung KY, Ruzzo WL (2001) Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics 17:763–774 20. Ortiz-Bejar J, Tellez ES, Graff M, Ortiz-Bejar J, Jacobo JC, Zamora-Mendez A (2019) Performance analysis of K-means seeding algorithms. In: 2019 IEEE international autumn meeting on power, electronics and computing (ROPEC), pp 1–6
A DSS for Predicting Lymphoma in Primary Sjogren’s Syndrome Patients Nikos Avgoustis and Themis Exarchos
Abstract
Primary Sjogren’s syndrome (pSS) is a chronic autoimmune disease followed by exocrine gland dysfunction. In this work, a web application was developed as a screening test based on a machine learning model that was trained on clinical data and is used to predict lymphoma outcomes in pSS patient. The results of the final model reveal a sensitivity of 100%, accuracy of 82%, and area under the curve of 98% and confirms the importance of C4 value, lymphadenopathy, and rheumatoid factor as prominent lymphoma predictors. Keywords
Primary Sjogren’s Syndrome · Screening test · Machine Learning
1
Introduction
Primary Sjogren’s syndrome (pSS) is a chronic systemic autoimmune disease-causing exocrine gland dysfunction, with clinical manifestations varying from dry eye and mouth to the development of non-Hodgkin lymphoma (NHL), which N. Avgoustis · T. Exarchos () Ionian University, Department of Informatics, Corfu, Greece e-mail: [email protected]; [email protected]
occurs in about 5% of the pSS patients [4, 5]. pSS is primarily affecting women near the menopausal age with reported ratio of 9 females affected per 1 male, making pSS one of the chronic autoimmune diseases with the highest imbalanced gender ratio [4, 5]. Moreover, the study of autoimmune lesions in minor salivary gland tissues has revealed cellular pathways that are highly associated with aggressive disease phenotypes which are prone to lymphoma development [2]. Previously suggested histopathological, as well as clinical, laboratory risk factors for lymphoma development in terms of prognostic and diagnostic purposes include salivary gland enlargement (SGE), the rheumatoid factor (RF), the cryoglobulinemia, the C4 hypocomplementemia, and the purpura, among others [3, 7, 10]. In this work, a web application based on a supervised machine learning model was built to aid in the decision-making process of the clinicians.
2
Materials and Methods
2.1
Data
The data were acquired from 449 anonymous patients that have been diagnosed with primary Sjogren’s syndrome (pSS) from the University of Athens. The dataset is the curated version of the dataset used in [9]. The dataset consists of 65 col-
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 P. Vlamos (ed.), GeNeDis 2020, Advances in Experimental Medicine and Biology 1338, https://doi.org/10.1007/978-3-030-78775-2_2
7
N. Avgoustis and T. Exarchos
8
umns with 449 rows, 64 of which are features plus the diagnoses of the patient (lymphoma or not). The features are either discrete (0 or 1) or numeric positive values and consist of laboratory measurements (e.g., C3, C4, WBC baseline) and demographic data. The dataset presented contains an imbalance in target class lymphoma (0–1) with 76 instances in the positive class and 373 in the negative class. The features of the dataset are presented in Table 1. First, we reserve 10% of our dataset for testing which gives us a testing set of 45 instances with 38 of them belonging in the negative class and 7 in the positive class. That leaves us with 404 instances for training 335 of them in the negative class and 69 in the positive class. To deal with the imbalance of our classes, we divide the negative class into five subsets and then pair them with the positive class which gives us five total training sets for our models each with 69 lymphoma and 69 not lymphoma.
2.2
Feature Selection
In order to select, the optimal feature set ranking was applied, and we retained the 15 with the highest score. The method used for ranking was Recursive Feature Elimination with ten-Fold Cross Validation (RFECV) [8], with an Extra Trees Classifier as the estimator on each of the five datasets. RFECV is a feature ranking with recursive feature elimination and cross-validated selection of the best number of features (Table 2); from those, the 15 highest were chosen from each dataset.
2.3
Models
The machine learning models AdaBoost, Gradient Boost, Logistic Regression, and Gaussian Naive-Bayes were used, with tenfold cross validated on each of the five datasets, both with the features RFECV suggested and the
15-best that we choose, and the one that produces the highest accuracy was then trained on the specific dataset (Table 3).
3
Results
In Tables 4 and 5, the results from testing the models on the reserved 10% of the dataset are presented. The selected features are as follows: “SGE,” “C4,” “FS 1st biopsy,” “chronic fatigue,” “low C4,” “lymphadenopathy,” “CRP,” “ESR,” “urine-specific gravity at diagnosis,” “C3,” “urine PH at first visit,” “Tarpley,” “RF (20 = 1),” “lymphocyte number,” and “RF+.”
4
Web Application Implementation
The application was built on ask, which is a framework for developing web applications in python and is currently hosted on Heroku at https://medicaldss.herokuapp.com/. The application consists of a form that the user can insert the features. When the user submits the form, the data are used by our model, and the prediction probability is displayed to the user in the form of a chart (Fig. 1).
5
Discussion
In this work, Logistic Regression model based on clinical data was used to create a screening web application for predicting binary lymphoma outcomes on pSS patients. The model predicts lymphoma outcomes with accuracy 82%, sensitivity 100%, specificity 79%, and AUC score 0.98 using 15 features. It also confirms the importance of C4 value, lymphadenopathy, and rheumatoid factor as prominent lymphoma predictors as previous works suggested [1, 6]. Prior to the evaluation of the model on a subset of clinical data that were reserved from the original dataset, great
A DSS for Predicting Lymphoma in Primary Sjogren’s Syndrome Patients
9
Table 1 Data description Feature Sex (female = 1) Dry mouth, subjective Dry eyes, subjective Abnormal Schirmer’s Ana+ RF (< 20 = 0;> 20 = 1) Anti-Ro (0–1) Anti-La (0–1) Tarpley SGE Raynaud Ro/La RF+ Monoclonal gammopathy (blood) LOW C4 (