131 47 8MB
English Pages 334 [319] Year 2021
Studies in Computational Intelligence 981
Nguyen Hoang Phuong Vladik Kreinovich Editors
Soft Computing: Biomedical and Related Applications
Studies in Computational Intelligence Volume 981
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/7092
Nguyen Hoang Phuong Vladik Kreinovich
•
Editors
Soft Computing: Biomedical and Related Applications
123
Editors Nguyen Hoang Phuong Informatics Division Thang Long University Hanoi, Vietnam
Vladik Kreinovich Department of Computer Science University of Texas at El Paso El Paso, TX, USA
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-76619-1 ISBN 978-3-030-76620-7 (eBook) https://doi.org/10.1007/978-3-030-76620-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In medical decision making, it is very important to take into account the experience of medical doctors and thus, to supplement traditional statistics-based data processing techniques with methods of computational intelligence, methods that allow us to take this experience into account. In some cases, the existing computational intelligence techniques—often, after creative modifications—can be efficiently used in biomedical applications. Examples of such applications are given in the first part of this book. The corresponding applications deal with diagnostics and treatment of different types of cancer, cardiac diseases, pneumonia, stroke, and many other diseases—including COVID-19. Biomedical problems are difficult. As a result, in many situations, the existing computational intelligence techniques are not sufficient to solve the corresponding problems. In such situations, we need to develop new techniques and, ideally, first show their efficiency on other applications, to make sure that these techniques are indeed efficient. Such techniques and their applications are described in the second part of this book. The corresponding applications include optimization (i.e., singlecriterion decision making), multi-criteria decision making, applications to agriculture, to computer networks, to economics and business, to pavement engineering, to politics, to quantum computing, to robotics, and to many other areas. The fact that these techniques are efficient in so many different areas makes us hope that they will be useful in biomedical applications as well. We hope that this volume will help practitioners and researchers to learn more about computational intelligence techniques and their biomedical applications and to further develop this important research direction. We want to thank all the authors for their contributions and all anonymous referees for their thorough analysis and helpful comments. The publication of this volume was partly supported by Thang Long University and by the Institute of Information Technology, Vietnam Academy of Science and Technology—both in Hanoi, Vietnam. Our thanks to the leadership and staff of these institutions for providing crucial support. Our special thanks to Prof. Hung T. Nguyen for his valuable advice and constant support. v
vi
Preface
We would also like to thank Prof. Janusz Kacprzyk (Series Editor) and Dr. Thomas Ditzinger (Senior Editor, Engineering/Applied Sciences) for their support and cooperation with this publication. December 2020
Nguyen Hoang Phuong Vladik Kreinovich
Contents
Biomedical Applications of Computational Intelligence Techniques Bilattice CADIAG-II: Theory and Experimental Results . . . . . . . . . . . . Paolo Baldi, Agata Ciabattoni, and Klaus-Peter Adlassnig A Combination Model of Robust Principal Component Analysis and Multiple Kernel Learning for Cancer Patient Stratification . . . . . . Thanh Trung Giang, Thanh-Phuong Nguyen, Quang Trung Pham, and Dang Hung Tran Attention U-Net with Active Contour Based Hybrid Loss for Brain Tumor Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dang-Tien Nguyen, Thi-Thao Tran, and Van-Truong Pham Refining Skip Connections by Fusing Multi-scaled Context in Neural Network for Cardiac MR Image Segmentation . . . . . . . . . . . . . . . . . . . Nhu-Toan Nguyen, Minh-Nhat Trinh, Thi-Thao Tran, and Van-Truong Pham End-to-End Hand Rehabilitation System with Single-Shot Gesture Classification for Stroke Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai Kin Koh, Quang H. Nguyen, Youheng Ou Yang, Tianma Xu, Binh P. Nguyen, and Matthew Chin Heng Chua Feature Selection Based on Shapley Additive Explanations on Metagenomic Data for Colorectal Cancer Diagnosis . . . . . . . . . . . . . Nguyen Thanh-Hai, Toan Bao Tran, Nhi Yen Kim Phan, Tran Thanh Dien, and Nguyen Thai-Nghe Clinical Decision Support Systems for Pneumonia Diagnosis Using Gradient-Weighted Class Activation Mapping and Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thao Minh Nguyen Phan and Hai Thanh Nguyen
3
21
35
47
59
69
81
vii
viii
Contents
Improving 3D Hand Pose Estimation with Synthetic RGB Image Enhancement Using RetinexNet and Dehazing . . . . . . . . . . . . . . . . . . . . Alysa Tan, Bryan Kwek, Kenneth Anthony, Vivian Teh, Yifan Yang, Quang H. Nguyen, Binh P. Nguyen, and Matthew Chin Heng Chua
93
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Dang Xuan Tho and Dao Nam Anh Deep Learning Based COVID-19 Diagnosis by Joint Classification and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Tien-Thanh Tran, Thi-Thao Tran, and Van-Truong Pham General Computational Intelligence Techniques and Their Applications Why It Is Sufficient to Have Real-Valued Amplitudes in Quantum Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Isaac Bautista, Vladik Kreinovich, Olga Kosheleva, and Hoang Phuong Nguyen On an Application of Lattice-Valued Integral Transform to Multicriteria Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Michal Holčapek and Viec Bui Quoc Fine-Grained Network Traffic Classification Using Machine Learning: Evaluation and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Tuan Linh Dang and Van Chuong Do Soil Moisture Monitoring System Based on LoRa Network to Support Agricultural Cultivation in Drought Season . . . . . . . . . . . . . . . . . . . . . . 163 Tien Cao-Hoang, Kim Anh Su, Trong Tinh Pham Van, Viet Truyen Pham, Duy Can Nguyen, and Masaru Mizoguchi Optimization Under Fuzzy Constraints: Need to Go Beyond BellmanZadeh Approach and How It Is Related to Skewed Distributions . . . . . 175 Olga Kosheleva, Vladik Kreinovich, and Hoang Phuong Nguyen Towards Parallel NSGA-II: An Island-Based Approach Using Fitness Redistribution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Le Huy Hoang, Nguyen Viet Long, Nguyen Ngoc Thu Phuong, Ho Minh Hoang, and Quan Thanh Tho A Radial Basis Neural Network Approximation with Extended Precision for Solving Partial Differential Equations . . . . . . . . . . . . . . . . 201 Thi Thuy Van Le, Khoa Le-Cao, and Hieu Duc-Tran Why Some Power Laws Are Possible and Some Are Not . . . . . . . . . . . . 213 Edgar Daniel Rodriguez Velasquez, Vladik Kreinovich, Olga Kosheleva, and Hoang Phuong Nguyen
Contents
ix
How to Estimate the Stiffness of a Multi-layer Road Based on Properties of Layers: Symmetry-Based Explanation for Odemark’s Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Edgar Daniel Rodriguez Velasquez, Vladik Kreinovich, Olga Kosheleva, and Hoang Phuong Nguyen Need for Diversity in Elected Decision-Making Bodies: Economics-Related Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Nguyen Ngoc Thach, Olga Kosheleva, and Vladik Kreinovich Fuzzy Transform for Fuzzy Fredholm Integral Equation . . . . . . . . . . . 233 Irina Perfilieva and Pham Thi Minh Tam Constructing an Intelligent Navigation System for Autonomous Mobile Robot Based on Deep Reinforcement Learning . . . . . . . . . . . . . 251 Nguyen Thi Thanh Van, Ngo Manh Tien, Nguyen Manh Cuong, and Nguyen Duc Duy One-Class Support Vector Machine and LDA Topic Model Integration—Evidence for AI Patents . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Anton Thielmann, Christoph Weisser, and Astrid Krenz HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Tat-Huy Tran, Tuan-Dung Cao, and Thi-Thu-Huyen Tran Applying Deep Reinforcement Learning in Automated Stock Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Hieu Trung Nguyen and Ngoc Hoang Luong Telecommunications Services Revenue Forecast Using Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Quoc-Dinh Truong, Nam Van Nguyen, Thuy Thi Tran, and Hai Thanh Nguyen Product Recommendation System Using Opinion Mining on Vietnamese Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Quoc-Dinh Truong, Trinh Diem Thi Bui, and Hai Thanh Nguyen
Biomedical Applications of Computational Intelligence Techniques
Bilattice CADIAG-II: Theory and Experimental Results Paolo Baldi, Agata Ciabattoni, and Klaus-Peter Adlassnig
Abstract CADIAG-II is a functioning experimental fuzzy expert system for computer-assisted differential diagnosis in internal medicine. To overcome the current limitations of the system, we propose an extension based on bilattices. The proposed changes were implemented and reviewed in a retrospective evaluation of 3,131 patients with extended information about patient’s medical history, physical examination, laboratory test results, clinical investigations and—last but not least— clinically confirmed discharge diagnoses.
1 Introduction 1.1 Background Computer-based support of medical diagnosis and treatment has a long tradition. Early approaches were based on statistical methods such as Fisher’s discriminant analysis to classify symptom patterns into diseased or non-diseased categories [39]. Others used Bayes’ theorem to assign probabilities to the possible presence of diseases [20, 38]. The first medical expert system was MYCIN [34], whose purpose was to give advice for the diagnosis and the treatment of patients with infectious P. Baldi Department of Philosophy, University of Milan, Via Festa del Perdono 7, 20122 Milan, Italy e-mail: [email protected] A. Ciabattoni Institute of Logic and Computation, Vienna University of Technology, Favoritenstrasse 9, 1040 Vienna, Austria e-mail: [email protected] K.-P. Adlassnig (B) Section for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Spitalgasse 23, 1090 Vienna, Austria Medexter Healthcare, Borschkegasse 7/5, 1090 Vienna, Austria e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_1
3
4
P. Baldi et al.
diseases. Equipped with a well-grounded heuristic rule-based approach to determine diagnostic and therapeutic proposals, MYCIN was extensively tested; its performance was comparable to that of humans [41]. A variety of logical and probabilistic reasoning approaches in medical diagnosis have been compared in [36]. Artificial intelligence methods and systems for medical consultation were discussed, among others, in [11, 25]; an extended threaded bibliography was provided in [30, 40]. CADIAG-I and CADIAG-II were also early approaches to provide differential diagnostic support. Based on logical approaches described in the seminal paper by Ledley and Lusted [26], CADIAG-I was extended in several subsequent versions [4]. CADIAG-II employs fuzzy sets and fuzzy logic, as described in this report, and gave rise to a variety of refined modeling approaches [13–17, 22, 31, 32, 37]. It was extensively tested in various fields of clinical application [3, 6, 27–29]. Recent approaches to clinical decision support for the selection of diagnosis and therapy mainly consist of machine learning and “big” data approaches. Successful applications include image pattern recognition in fields such as radiology [10] and pathology [33]. IBM’s Watson for oncology is one of many recent machine learning system approaches; its aim is to provide recommendations for the treatment of breast cancer. However, its success appears to be limited [35].
1.2 CADIAG Systems Computer-Assisted DIAGnosis (CADIAG) systems are data-driven, rule-based expert systems for computer-assisted consultation in internal medicine [1, 4, 7, 12]. Their development dates back to the early 1980s at the University of Vienna Medical School (now Medical University of Vienna). The systems provide diagnostic hypotheses as well as confirmed and excluded diagnoses in response to the input of a list of symptoms, signs, laboratory test results, and clinical findings pertaining to a patient. When possible, they also explain the indicated conclusions and propose further useful examinations. The first system of the family—CADIAG-I—dealt with three-valued logical variables (present, unknown, absent) and IF-THEN relationships between given threevalued input on the one hand and diagnoses on the other. Kleene’s logic provides all the necessary formal definitions, see also [4]. However, the real-world patient’s input (symptoms, interpreted signs, laboratory test results, and clinical findings) is usually inherently (linguistically) vague and necessarily includes borderline cases. Moreover, a large part of the given medical knowledge about definitional, causal, statistical, and heuristic relationships between a patient’s input and described diseases is intrinsically uncertain. Measurements are sometimes imprecise, linguistic categories are characterized by fuzzy borderlines, the co-occurrence of symptoms and diseases is stochastically uncertain, and both medical data and medical knowledge are often incomplete. Therefore, computer systems for medical decision-making usually cannot generate clinically accurate results when based on formal systems whose objects can only be either absolutely true, absolutely false, or unknown (as in Kleene’s logic).
Bilattice CADIAG-II: Theory and Experimental Results
5
The successor system CADIAG-II can process both definite and uncertain information. CADIAG-II is based on fuzzy set theory [43] to deal with linguistic medical terms, and on fuzzy logic to define and process weighted IF-THEN rules [14]. Despite this improvement, CADIAG-II has been criticized for its inability to deal with negative evidence [16], and with rules diminishing the certainty of a particular diagnosis apart from complete exclusion.
1.3 Objective The aim of the present report is to introduce an extension of CADIAG-II, which includes negative knowledge, and experimentally evaluate the presented proposal. An earlier attempt confined to theory was published in [17]. Here we introduce Bilattice CADIAG-II, an extension of CADIAG-II based on bilattices, and validate the theoretical results by means of a retrospective evaluation in a newly-programmed CADIAG-II implementation. The data of 3,131 patients, including extended medical histories, physical examinations, and laboratory and clinical test results were analyzed with the original CADIAG-II and with Bilattice CADIAG-II. The results were compared with the corresponding clinically confirmed diagnoses at the time of discharge. All underlying real patient data including the corresponding clinical discharge diagnoses originate from a hospital near Vienna. In addition to preserving all the inference results of CADIAG-II, Bilattice CADIAG-II was able to infer the absence of 679 diseases which could not be inferred by CADIAG-II previously. We believe that creating a new knowledge base explicitly designed for Bilattice CADIAG-II, which would make extensive use of negative rules and counter-evidence for a medical conclusion other than total exclusion, could still further improve the (already very good) performance of Bilattice CADIAG-II. After introducing the backgrounds of the CADIAG-II system, we discuss the basics of bilattices and present Bilattice CADIAG-II. We then provide an overview of its implementation, describe the results of the retrospective evaluation, and discuss the performance of the presented extension.
2 Background—CADIAG-II 2.1 Overall Consultation Process Inferring a diagnosis from a given set of patient’s medical data in all CADIAG implementations is achieved in four steps, which are shown in Fig. 1. Step 1: The physician (or some allied medical personnel) enters personal and medical data about the patient. These usually consist of detailed observational data, such as the medical history, signs from physical examinations, quantitative laboratory test
6
P. Baldi et al. additional examinations patient’s personal and medical information
(1) central admittance (2) medical documentation (3) basic laboratory program (4) special laboratory program (5) clinical investigations
patient data fuzzy interpreter
(1) aggregation of information (2) assignment of quantitative test results
diagnostic consultation system (1) confirmed diagnoses (2) excluded diagnoses (3) diagnostic hypotheses (4) explanations (5) examination proposals (6) unexplained findings
extended diagnostic report list (1) personal data (2) list of findings and pathological states (3) list of diagnostic results with explanations (4) list of examination proposals (5) list of unexplained findings
Fig. 1 The consultation process in CADIAG (from [7], p. 208)
results, and the outcome of clinical investigations (e.g., X-ray and ultrasonography). CADIAG makes a clear distinction between patient-recounted, physician-reported, and laboratory-measured data and their abstraction as clinical, usually linguistic terms applied in diagnostic discourse. Step 2: A transformation step named data-to-symbol conversion abstracts or aggregates patient information into clinical terms [8]. Aggregation combines one or more documentation items from the electronic health record into an abstract symptom, sign, laboratory or clinical result using logical operators. Here, two-valued Boolean logic is applied. Abstraction is used to transform quantitative test results into abstract medical concepts, and give them a particular evidence value ∈ [0, 1]. An example of an abstracted symptom is ‘elevated serum glucose level’, which is set according to the quantitative result of the glucose test and the definition of elevated. The formal modeling of semantic medical concepts such as ‘elevated’ that considers their inherent unsharpness of boundaries in linguistic concepts, visible in their gradual transition to adjacent medical concepts, is based on fuzzy set theory. Fuzzy sets are defined by membership functions, which assign to every symptom Si a degree of membership μ Si . These degrees express the level of compatibility of the measured concrete value with the semantic concepts under consideration. They range from zero to unity, wherein zero stands for ‘not compatible’ and unity for ‘fully compatible’ (see Fig. 2). Step 3: Starting with the set of medical entities and their corresponding evidence values generated by data-to-symbol conversion, CADIAG infers sets of confirmed diagnoses, diagnostic hypotheses, excluded diagnoses, and unexplained findings. The basic concept CADIAG-II’s inference mechanism relies upon is the compositional rule of fuzzy inference [42], which allows inference under uncertainty. The rules contained in the knowledge base are iteratively applied to the set of medical entities pertaining to the patient until a fixpoint is reached. Step 4: In addition to the diagnostic results, CADIAG proposes a list of useful examinations that will possibly confirm or exclude some of the generated diagnostic hypotheses. The generated diagnostic results are explained in detail by a separate explanatory system.
Bilattice CADIAG-II: Theory and Experimental Results
7
Fig. 2 Symbolic representation of medical entities using fuzzy sets (from [7], p. 211)
2.2 Knowledge Representation Definitional, causal, statistical, or heuristic relationships between single and compound fuzzy logical antecedents (left-hand side) and consequences (right-hand side) are represented as IF-THEN rules. Rules with a single medical entity as antecedent, such as a symptom or an abstracted laboratory test result, express associations between two medical entities. Compound antecedents are represented as combinations of medical entities connected by and, or, and not, as well as the operators at least and at most. They permit the definition of pathophysiological states as well as the incorporation of specific complex, but medically well-known criteria for diagnosing diseases. The associations between the IF- and the THEN-part of the rules are characterized by two kinds of relationships: the frequency of occurrence (FOO) of the antecedent with the consequence, and the strength of confirmation (SOC) of the antecedent for the consequence.
2.2.1
Rating of Medical Entities and Data-to-Symbol Conversion
Reported and measured medical data are always assigned their natural data type, i.e., integers or real numbers for laboratory findings, and one of the Boolean values TRUE or FALSE for binary data. In data-to-symbol conversion, CADIAG-II assigns a real number [0,1] or a ‘strength of evidence’ to every symptom by applying the two mechanisms described in Step 2 (abstraction and/or aggregation), wherein a value of 1 means that the corresponding symptom is fully present, while values in ]0, 1[ mean that the symptom is present in the patient to a certain degree. Symptoms that can definitely be excluded are assigned a value of 0. A value of is assigned to non-examined medical entities. Since data-to-symbol aggregation rules only operate
8
P. Baldi et al.
on Boolean data items, the operators and, or, not, at least and at most are interpreted and applied in their natural manner.
2.2.2
Interpretation of FOO and SOC and Type of Rules
Relationships between medical entities are represented as rules being attributed with the frequency of occurrence FOO and the strength of confirmation SOC. The interpretation for FOO and SOC as proposed in [7] is the following: given a set of patients P FOO =
a min{α(a), β(a)} a β(a)
SOC =
a min{α(a), β(a)} a α(a)
where α(a) and β(a) are the degrees to which the entities α and β apply to a patient a, and the sum a ranges over all patients in P. The patient database associated with CADIAG-II did not contain enough patients for calculating all numbers FOO and SOC by the above formulas. But this is true for any patient database, even for those of large hospitals—one does not have enough data to calculate all associations between all symptoms and all diseases! For this reason, most of these values were estimated by clinical experience of physicians and taken from published data in medical text books and scientific medical journals. Both, FOO and SOC are real numbers in [0, 1]. Similar to evidence values, the values 0 and 1 are also specifically interpreted in CADIAG-II. SOC = 1 ensures that the right-hand side of the rule holds, if the IF-part is true. FOO = 1 means that the left-hand side has to occur with the right-hand side, otherwise, the right-hand side is excluded. FOO = 0 and SOC = 0 says that the lefthand side never occurs with the right-hand side in this rule (and vice versa). If the IF-part is true, the right-hand side must be excluded. According to these definitions, rules in CADIAG-II may express the following IF-THEN relationships between two expressions α and β: 1. α implies β to the degree d ∈ (0, 1] 2. α excludes β 3. the exclusion of α implies the exclusion of β. Thus, a distinction is made between three groups of rules. This classification is based on the following interpretation of FOO and SOC: cd , representing ‘confirming to the degree d’ (cd ) when 0 < SOC = d ≤ 1 and 0 < FOO < 1 me, representing ‘mutually exclusive’ (me) SOC = 0 and FOO = 0
Bilattice CADIAG-II: Theory and Experimental Results
9
ao, representing ‘always occurring’ (ao) when 0 < SOC < 1 and FOO = 1 A prototype of a CADIAG-II rule would be: D77 : SY C7 with FOO = 1.0, SOC = 1.0 SY C7 : (D1 ∧ S602) ∧ ¬((S1001 ∨ S758) ∨ S761), where D77 is ‘seropositive rheumatoid arthritis, stage I’, D1 is ‘rheumatoid arthritis’, S602 is ‘Waaler Rose test, positive’, S1001 is ‘X-ray, joints, symptoms of arthritis, erosions’, S758 is ‘X-ray, joints, partial dislocation’, and S761 is ‘X-ray, joints, ankylosis of the peripheral joints’. This rule, which is of the type (cd ), with d = 1, is interpreted as follows IF
rheumatoid arthritis AND Waaler Rose test, positive AND NOT ( X-ray, joints, symptoms of arthritis, erosions OR X-ray, joints, partial dislocation OR X-ray, joints, ankylosis of the peripheral joints ) THEN seropositive rheumatoid arthritis, stage I
The left-hand side of the rule confirms the right-hand side or may confirm it to a certain degree, while the left-hand side obligatorily occurs with the right-hand side of the rule. Thus, if the IF-part is evaluated to 0, the right-hand side will be excluded.
2.2.3
Inference and Operator Usage
The central concept of CADIAG-II’s inference is the compositional rule of fuzzy inference [42]. Using the strength of evidence of medical entities after data-to-symbol conversion and all rules from the knowledge base as starting point, the inference mechanism calculates the degree of evidence μPD for a patient P and a particular disease D j using the following equations: For hypotheses generation and confirmation: μ1PD (P, D j ) = max min{μPS (P, Si ), μSOC SD (Si , D j )} for rules of type (cd ) Si
For exclusion by present, but excluding symptoms: μ2PD (P, D j ) = max min{μPS (P, Si ), 1 − μSOC SD (Si , D j )} for rules of type (me) Si
For exclusion by absent, but obligatory symptoms: μ3PD (P, D j ) = max min{1 − μPS (P, Si ), μFOO SD (Si , D j )} for rules of type (ao). Si
Here, μPS denotes the strength of evidence for patient P and a particular symptom Si , μSD denotes the FOO and SOC relationships, resp., between symptom Si and disease D j . For every symptom-disease relationship, i.e., for every rule in the knowledge base, the minimum of μPS and μSD is interpreted as the strength of evidence
10
P. Baldi et al.
implied by a particular symptom. The overall strength of evidence for a particular disease is calculated as the maximum of all evidences from rules indicating this particular disease. There is one exception to this procedure: if at least one rule infers an evidence of 0 (or exclusion), then the evidence of the corresponding disease is always set to 0 and is thus excluded. An additional evidence value ω is used during inference to represent contradictions. If a medical entity has been proven by the inference process, i.e., set to 1, and another rule infers exclusion or evidence of 0, then the evidence of the involved medical concept is set to ω and the inference process is stopped due to this contradiction for the involved entities. All other inferences continue to be processed. Inference steps applying all possible rules to the available evidence are repeated until the change of every evidence value within one inference step is less than a given threshold (e.g., 0.01), i.e., until a fixpoint is reached. Symptom-symptom, symptom combination-disease, and disease-disease relationships also exist, thus μSS (Si , S j ), μSCD (SCi , D j ), and μDD (Di , D j ) are part of the extensive CADIAG-II knowledge base (for details, see [2]). For the evaluation of the truth values of complex antecedents in inference rules, and is calculated as min, or as max, not as complement (1–x), at least i of n uses the i-th smallest of n evidence values, and at most i of n uses the i-th largest of n evidence values.
3 Bilattice CADIAG-II CADIAG-II can only express the total exclusion of a medical entity and is unable to provide for so-called negative evidence, i.e., indicating the absence of a particular medical entity not only with certainty but also to a certain degree. Moreover, the syntax of CADIAG-II rules impedes the definition of rules giving graded evidence against a medical entity, and the compositional rule of inference will always prefer the higher rating of an entity over a lower rating (except in case of exclusion). These properties are sometimes listed as weaknesses of the CADIAG-II system [16]. To overcome these limitations, we propose an extension of CADIAG-II which was mainly inspired by peculiar algebraic structures known as bilattices. We provide an introduction to the subject before explaining the proposal and its implementation.
3.1 Algebraic Preliminaries on Bilattices Let us start by briefly recalling the definition of lattices, which are among the most important algebraic structures in logic [9]. Definition 1. Let S be a non-empty set and ≤ an order relation on S. (S, ≤) is known as a lattice if, given any x, y ∈ S, there exist in S both the infimum (the greatest
Bilattice CADIAG-II: Theory and Experimental Results
11
element smaller than x, y according to the order ≤) and the supremum (the smallest element greater than x, y according to the order ≤) of {x, y} with respect to ≤. The operations ∧ and ∨ defined by x ∧ y = in f {x, y} and x ∨ y = sup{x, y} are known as lattice operations. A unary operation ¬ is named a negation on a lattice if, for all x, y ∈ S • ¬¬x = x, • x ≤ y ⇒ ¬y ≤ ¬x. Definition 2. A lattice (S, ≤) is considered bounded when the maximum and minimum element exist in (S, ≤), i.e., elements in S which are greater (or smaller) than any other elements of S. A bounded lattice (S, ≤) is complete if, for every non-empty X ⊆ S, inf X and sup X belong to S. Example 1. A very natural example of a complete lattice with negation is the structure ([0, 1], ≤, ¬, 0, 1) where [0, 1] is the interval of real numbers between 0 and 1; ≤ is the usual ordering of the real numbers, and ¬x = 1 − x. We will refer to this structure as a standard real lattice. Bilattices were introduced by Ginsberg [21] as a general framework for various applications in artificial intelligence. The underlying concept is to deal with two order relations. The first represents a ‘degree of truth’ and is, in fact, just a generalization of the usual ordering of truth values in classical and in multi-valued logic. The second ordering is meant to represent the quantity of information obtained for a proposition. Degrees of knowledge permit representation of the difference between ‘not knowing if a proposition is true or false’ (the proposition is evaluated with the minimum of the knowledge order) and ‘knowing that a proposition is false’ (the proposition is evaluated with the minimum of the truth order). More formally, we have the following: Definition 3. Let Bt = (B, ≤t , False, T r ue) and Bk = (B, ≤k , ⊥, ⊤) be complete lattices, where B is a non-empty set, False, T r ue are the minimum and maximum for ≤t , and ⊥ and ⊤ are the minimum and maximum for ≤k . We refer to the structure B = (B, ≤t , ≤k , False, T r ue, ⊥, ⊤) as bilattice. A negation over B is a unary operation ¬ such that: • ¬¬x = x • x ≤t y ⇔ ¬y ≤t ¬x • x ≤k y ⇔ ¬x ≤k ¬y. As Bt and Bk are lattices, for each one of them we will have two corresponding lattice operations, denoted with ∧t , ∨t and ∧k , ∨k respectively. Note that the intended meaning of the orderings is only revealed by the notion of negation: the truth order ≤t is indeed reverted by negation, while the knowledge order ≤k is preserved. A prominent example of bilattice, which will be used in the sequel, is the so called product bilattice; see [19]. The elements of this structure are pairs, which are intended to represent reasons for and reasons against the truth of a given proposition.
12
P. Baldi et al.
Example 2. Let L = (L , ≤, 0, 1) be a complete lattice. We refer to the following structure as the product bilattice over L B(L) = (L × L , ≤t , ≤k , (0, 1), (1, 0), (0, 0), (1, 1)) where: • • • •
(x, y) ≤t (x , y ) ⇔ x ≤ x and y ≤ y (x, y) ≤k (x , y ) ⇔ x ≤ x and y ≤ y (0, 1) and (1, 0) are minimum and maximum, respectively, for ≤t (0, 0) and (1, 1) are minimum and maximum, respectively, for ≤k . We may introduce a negation over B(L) by letting: ¬(x, y) = (y, x).
Informally, given two elements of a product bilattice (i.e., two pairs of values) a and b, the example above says that “a is less true than b” when for a there are fewer reasons for and more reasons against than for b, while “a is less known than b” when for a there are both, fewer reasons for and fewer reasons against than for b. From the relation between bilattices and lattice orderings, it is easy to establish how bilattice operations in a product bilattice relate to the original lattice ones. Indeed, we have: • • • •
(x, y) ∧t (x , y ) = (x ∧ x , y ∨ y ) (x, y) ∨t (x , y ) = (x ∨ x , y ∧ y ) (x, y) ∧k (x , y ) = (x ∧ x , y ∧ y ) (x, y) ∨k (x , y ) = (x ∨ x , y ∨ y ).
3.2 An Extension of CADIAG-II Based on Bilattices—bCADIAG-II We have now introduced all prerequisites to describe the proposed extension of CADIAG-II based on bilattices (hence the name Bilattice CADIAG-II or bCADIAGII, for short). By applying the concept of product bilattices, we simply associate each basic entity with not just a single degree in [0, 1] but a pair, representing reasons for and reasons against the truth of the entity. The interpretation of these values is as follows: a value of 0 means that we do not have evidence (or counter-evidence) of this medical entity, while a value of 1 is interpreted as full evidence (or full exclusion). Intermediate values denote insufficient evidence to either fully confirm or fully exclude the entity in question. Since data-to-symbol conversion in CADIAG-II uses fuzzy sets and rules for more or less evident (borderline) symptoms (and none for more or less excluded symptoms), we do not have any direct initial evaluation of counter-evidence that we could directly use in an extended version of the system. Therefore, whenever the data-to-symbol conversion issues the value c to a particular entity, we associate with that entity the pair of evidence and counter-evidence (c, 1 − c).
Bilattice CADIAG-II: Theory and Experimental Results
13
From now on, each basic entity of the system will be represented as a α (s, t), where α is an atomic formula and (s, t) an element of the product bilattice, where the value s stands for reasons for α, and the value t for reasons against α. Recall that compound rules of CADIAG-II deal with complex logical formulas. Therefore, after an initial evaluation of the entities, i.e., an association of a pair of values to them, compound formulas will be obtained as follows: For any α, β basic entities in CADIAG-II, and (s, t), (u, v) ∈ B(L) (∧)
α (s, t) β (u, v) (α ∧k β)(s ∧ u, t ∨ v))
(¬)
α (s, t) ¬α ¬(s, t) = (t, s)
Let us now focus on the IF-THEN rules of the system. We will focus on their role in transmitting knowledge (in the sense of bilattices) rather than merely truth. We will represent each such rule in the format IF α THEN β (x, y) where α and β are compound formulas, denoting symptoms and disease, respectively, and (x, y) is a pair of values in the product bilattice. Given a compound formula α evaluated with the pair (u, v) in the bilattice product, and an IF-THEN rule as the one above, we will then compute the value of the conclusion β simply as follows: (k)
IF α THEN β (s, t) α (u, v) β (s, t) ∧k (u, v)
The form of the rule might suggest a sort of ‘knowledge modus ponens’, where the value associated with the antecedent and the one associated with the rule are combined in order to obtain the value of the consequent, by means of the operation ∧k . However, it should be noted that the pair of values attached to IF α THEN β should not be regarded as ‘reasons for’ and ‘reasons against’ an implication α → β, but rather as a measure of how much, from the values of evidence and counter-evidence of the antecedent, we can infer evidence and counter-evidence of the consequent, respectively. The crucial issue is then to associate such pairs of values with each of the IF-THEN rules, since such values are not immediately provided from the dataset. Depending on the type of rules, we proceed as follows: • We represent rules of type cd and ao, as: IF α THEN β (SOC, FOO).
14
P. Baldi et al.
Note that rules of type ao, i.e., those rules where the full exclusion of the premise implies the full exclusion of the conclusion, are just a particular case of rules of type cd with FOO = 1. • We generalize rules of type me as IF α THEN ¬β (1 − SOC, 1 − FOO). Note that the mutually exclusive rule of CADIAG-II is a particular case of the me rule above, with SOC = 0 and FOO = 0. For each of the above rules cd , ao, and me, we may use (k) to n combine the pair associated with the premises and the one associated with the IF-THEN rule in order to obtain the relevant pair associated with the conclusion. The use of FOO in the cd rules is a new feature in our proposal. It expands the inferential power of our system with respect to CADIAG-II: values of FOO different from 1 were indeed present in rules of type cd of CADIAG-II, but were not previously used at all. The use of FOO is justified by the following consideration: (1) FOO is a generalization of the conditional probability of the premise of the rule, given the conclusion; (2) such a value is directly proportional to the conditional probability of the negation of the conclusion, given the negation of the premises; (3) the latter is a measure of how much the exclusion of the premises allows inference of the exclusion of the conclusion. Note that, as a limit case, the rules of type (ao), i.e., those rules where the full exclusion of the premise implies the full exclusion of the conclusion, are actually those with FOO = 1. Finally, the use of the pair (1 − SOC, 1 − FOO) rather than (SOC, FOO) for the rules of type me, is justified by the fact that we are using ¬β rather than β as the conclusion of the rule. So far, we have presented a different way of dealing with rules and inferences for CADIAG-II. Since the bilattice operation ∧k is used for combining the premise in the rule (k), the focus of the inference process will no longer be on truth, but on knowledge order, aiming to maximize the latter. In this spirit, it appears reasonable to require that, for any entity β, all pairs of values produced for β by the system via applications of (k) are then combined through ∨k . Remark 1. All logical operations, including negation are monotone with respect to knowledge order, so that a fixpoint can be found for each entity. Remark 2. The generalization of the rules of type me will only have an effect if negative rules are incorporated into knowledge bases. Let us recall that, in CADIAG-II, the value 0 (for falsity) was treated in a different way than other values, because it was given preference over higher non-zero results, while the highest value was always chosen for all remaining truth values. This shows that a knowledge order was already implicitly involved there. The value 0, which stands for ‘totally false’, was indeed taken to provide more knowledge than other intermediary values, namely the full exclusion of a given entity. This is treated in a more elegant and coherent way in our approach.
Bilattice CADIAG-II: Theory and Experimental Results
15
4 Implementation and Experimental Results CADIAG-II was originally developed and run on an IBM host computer system [18] (at the Department of Medical Computer Sciences, University of Vienna Medical School) which is no longer in operation. In order to evaluate the described improvements, both CADIAG-II and bCADIAG-II were implemented within the PC-based medical expert system shell MedFrame [24], and comparatively tested against a set of patients with clinically confirmed discharge diagnoses.
4.1 MedFrame, CADIAG-II, and bCADIAG-II MedFrame [24] is an expert system shell designed especially for implementing medical expert systems using fuzzy concepts. It provides the medical knowledge engineer with • various knowledge representation formalisms to store medical knowledge and reflect adequate inference mechanisms, • concepts for modeling and handling uncertainty in medical terminology and medical relations, with special emphasis on fuzzy methodologies, • mechanisms for storing patient data and history in a standardized manner, • concepts for representing medical knowledge in a standardized way, and • utilities for implementing inference mechanisms easily and rapidly. The rheumatological knowledge base of the original CADIAG-II [5] which currently contains 1,126 symptoms and 170 documented diagnoses was imported into MedFrame’s knowledge database, resulting in 658 fuzzy sets and 2,143 rules for data-to-symbol conversion, as well as 21,470 IF-THEN rules for inference (982 symptom-symptom, 368 disease-disease, 61 symptom-combination-disease, and 20,041 symptom-disease relationships). MedFrame’s utilities for developing inference mechanisms were used to re-implement CADIAG-II based upon the original IBM host implementation. In combination, the transferred knowledge base as well as the newly implemented inference mechanisms entirely comply with all the approaches described in Sect. 2. In addition, the modified inference process of bCADIAG-II described in Sect. 3.2 was implemented. The operators were rendered capable of dealing with unknown entities analogous to CADIAG-II. Therefore, bCADIAG-II incorporates the following improvements: • Inclusion of negative evidence. In addition to the strength of evidence, for every medical concept the strength of counter-evidence is also maintained in form of a product bilattice. • Advanced handling of FOO. In addition to (me) rules, FOO and ∧k are also used in the evaluation of (cd ) rules. • Application of ∨k for calculating the overall evidence of a medical concept.
16
P. Baldi et al.
Table 1 Classification of interpretations Class Categorization Full hit 75–99% 50–74% 25–49% 1–24% No hit
All discharge diagnoses were contained in the diagnostic results as confirmed or hypothetical Between 75% and 99% of all discharge diagnoses were contained in the diagnostic results as confirmed or hypothetical Between 50% and 74% of all discharge diagnoses were contained in the diagnostic results as confirmed or hypothetical Between 25% and 49% of all discharge diagnoses were contained in the diagnostic results as confirmed or hypothetical Between 1% and 24% of all discharge diagnoses were contained in the diagnostic results as confirmed or hypothetical None of the discharge diagnoses were contained in the diagnostic results as confirmed or hypothetical
4.2 Evaluation and Results For evaluating the performance of bCADIAG-II compared to that of CADIAG-II, the data of 3,131 anonymized hospitalized rheumatic patients were imported into MedFrame’s patient database, including extended clinical data from the patients’ histories, physical examinations, laboratory tests, and clinical investigations. Furthermore, all the 8,445 clinically confirmed discharge diagnoses of these patients were also transferred to MedFrame and used as a diagnostic gold standard. The number of discharge diagnoses in the set of available patient data ranged from 0–9 (mean 2.59, median 2). The 3,131 patients were then analyzed by both implementations, CADIAG-II and bCADIAG-II, applying the same knowledge base. This was done in batch mode. Since the patient data cases only contained clinically confirmed existing diseases and no information about definitely absent diseases, the evaluation was focused on opposing the discharge diagnoses to confirmed diagnoses and diagnostic hypotheses. In this context, confirmed is equivalent to a strength of evidence of 1.0, and hypothetical is equivalent to a strength of evidence between 0.4 and 0.99. Moreover, for bCADIAG-II the strength of counter-evidence of a concept was required to be less than 0.4 in order to be considered hypothetical. The interpretation for each patient was assigned to one of six classes, as shown in Table 1. The results of the evaluation are listed in Table 2.
4.3 Discussion of Results The original CADIAG-II/RHEUMA implementation was evaluated several times focusing on confirmation of the correctness and soundness of the generated diagnostic
Bilattice CADIAG-II: Theory and Experimental Results Table 2 Interpretation results Class Full hit 75–99% 50–74% 25–49% 1–24% No hit Number of confirmed diagnoses Number of diagnostic hypotheses Number of excluded diagnoses Mean/median/maximum of confirmed diagnoses Mean/median/maximum of diagnostic hypotheses Mean/median/maximum of excluded diagnoses
17
CADIAG-II
bCADIAG-II
937/29.89% 209/6.68% 1,217/38.87% 367/11.72% 15/0.48% 387/12.36% 569 20,777 50,789 0.17/0/3 6.37/6/22 15.57/16/36
937/29.89% 209/6.68% 1,217/38.87% 367/11.72% 15/0.48% 387/12.36% 569 20,779 50,765 0.17/0/3 6.37/6/22 15.57/16/36
results in retrospective and prospective studies [23, 28]. In contrast, the evaluation at hand did not check the results for correctness, but concentrated solely on the impact of the undertaken change of the underlying inference process on the outcome. Table 2 clearly shows that bCADIAG-II performs just as well as CADIAG-II. The inferred results are identical, except for differences in the number of generated excluded diagnoses. Since neither of the modifications has an impact on the calculation of positive evidences, it is no wonder that the inference results are identical with respect to confirmed and hypothetical diagnoses. Apart from the 50,765 excluded diagnoses, bCADIAG-II additionally infers 703 hypothetical absent diagnoses, i.e., diagnoses with a strength of evidence less than 0.4 and a strength of counter-evidence more than 0.4. Twenty-four of these cases are the reason for the difference in the number of excluded diagnoses (50,789 in CADIAG-II and 50,765 in bCADIAG-II). In bCADIAG-II, the ‘knowledge modus ponens’ assigns strength of evidence to these concepts, which is an improvement of the inference process. Apart from these 24 cases, an additional 679 hypothetical absent diagnoses for 258 patients are provided by bCADIAG-II. These numbers demonstrate the high potential of using negative evidence, especially in the process of differential diagnosis. A computer-based differential diagnostic system, accordingly equipped, would provide the physician with information about diseases which are most likely not present, and thus direct the physician’s attention to other diseases. Yet, since the CADIAG-II/RHEUMA knowledge base does not utilize these concepts (except for full exclusion) and the improvements in the results are only due to advancements in the inference process, there is a clear need to re-design the respective knowledge base. It should include the use of the concept of negative evidence.
18
P. Baldi et al.
Apart from comparative results, the evaluation confirms the results of previous studies. bCADIAG-II was able to infer at least one of the available discharge diagnoses for 88% of the reference patients, and more than 75% of all discharge diagnoses for more than 36% of them. While the given evaluation employed a threshold of 0.4 for diagnostic hypotheses, it was set to 0.2 in [28]. An evaluation of 3,131 reference patients with bCADIAG-II and a threshold of 0.2 resulted in the detection of at least one of the available discharge diagnoses for 95.5% of the reference patients, and more than 75% of all discharge diagnoses for more than 68% of patients.
5 Conclusions After some further steps in the formalization of the CADIAG-II inference process [12], bCADIAG-II should be another measure towards putting CADIAG-II onto an extended solid formal basis. By applying the concept of product bilattices, the prerequisites for including negative evidence (i.e., rules diminishing the certainty of a particular diagnosis) into the CADIAG-II system was established. The experimental results proved the identical behavior of CADIAG-II and bCADIAG-II and confirmed the quality of inference results in comparison with former evaluations. In addition, bCADIAG-II increased the quantity of generated results in the form of indications to absent diseases other than those excluded with the existing knowledge base. Nevertheless, the evaluation clearly showed that significant improvements can only be achieved by re-creating the knowledge base and make extensive use of negative rules and counter-evidence other than total exclusion. Acknowledgements This research was partly supported by the Vienna Science and Technology Fund (WWTF), Grant no. MA07-016. We are indebted to Andrea Rappelsberger for her extended assistance in formatting and finalizing this report.
References 1. Adlassnig, K.-P.: A fuzzy logical model of computer-assisted medical diagnosis. Methods Inf. Med. 19(3), 141–148 (1980) 2. Adlassnig, K.-P.: Fuzzy set theory in medical diagnosis. IEEE Trans. Syst. Man Cybern. SMC16(2), 260–265 (1986) 3. Adlassnig, K.-P., Akhavan-Heidari, M.: CADIAG-2/GALL: an experimental expert system for the diagnosis of gallbladder and biliary tract diseases. Artif. Intell. Med. 1(2), 71–77 (1989) 4. Adlassnig, K.-P., Grabner, G.: The Viennese computer-assisted diagnostic system. Its principles and values. Automedica 3, 141–150 (1980) 5. Adlassnig, K.-P., Kolarz, G.: CADIAG-2: computer-assisted medical diagnosis using fuzzy subsets. In: Gupta, M.M., Sanchez, E. (eds.) Approximate Reasoning in Decision Analysis, pp. 219–247. North-Holland Publishing Company, Amsterdam (1982) 6. Adlassnig, K.-P., Scheithauer, W.: Performance evaluation of medical expert systems using ROC curves. Comput. Biomed. Res. 22(4), 297–313 (1989)
Bilattice CADIAG-II: Theory and Experimental Results
19
7. Adlassnig, K.-P., Kolarz, G., Scheithauer, W., Grabner, G.: Approach to a hospital-based application of a medical expert system. Med. Inform. 11(3), 205–223 (1986) 8. Boegl, K., Leitich, H., Kolousek, G., Rothenfluh, T., Adlassnig, K.-P.: Clinical data interpretation in MedFrame/CADIAG-4 using fuzzy sets. Biomed. Eng. Appl. Basis Commun. 8(6), 488–495 (1996) 9. Burris, S., Sankappanavar, H.P.: A Course in Universal Algebra. Graduate Texts in Mathematics. Springer, New York (1981) 10. Choy, G., et al.: Current applications and future impact of machine learning in radiology. Radiology 288(2), 318–328 (2018) 11. Chunyan, A., Shunshan, Y., Hui, D., Quan, Z., Liang, Y.: Application and development of artificial intelligence and intelligent disease diagnosis. Current Pharm. Des. 26(26), 3069– 3075 (2020) 12. Ciabattoni, A., Vetterlein, T.: On the (fuzzy) logical content of CADIAG-2. Fuzzy Sets Syst. 161(14), 1941–1958 (2010) 13. Ciabattoni, A., Vetterlein, T., Adlassnig, K.-P.: A formal framework for Cadiag-2. In: Adlassnig, K.-P., Blobel, B., Mantas, J., Masic, I. (eds.) Medical Informatics in a United and Healthy Europe – Proceedings of MIE 2009 – The XXIInd International Congress of the European Federation for Medical Informatics, Studies in Health Technology and Informatics, vol. 150, pp. 648–652. IOS Press, Amsterdam (2009) 14. Ciabattoni, A., Picado-Muiño, D., Vetterlein, T., El-Zekey, M.: Formal approaches to rulebased systems in medicine: the case of CADIAG-2. Int. J. Approx. Reason. 54(1), 132–148 (2013) 15. Daniel, M.: Remarks on a cyclic inference in the fuzzy expert system CADIAG-IV. In: Phuong, N.H., Ohsato, A. (eds.) VJFuzzy 1998: Vietnam-Japan Bilateral Symposium on Fuzzy Systems and Applications, Halong Bay, Vietnam, 30th September – 2nd October 1998, Proceedings, pp. 619–627, Hanoi (1998) 16. Daniel, M.: Theoretical comparison of inference in CADIAG and MYCIN-like systems. Tatra Mountains Math. Publ. 16(2), 255–272 (1999) 17. Daniel, M., Hájek, P., Nguyen, P.H.: CADIAG-2 and MYCIN-like systems. Artif. Intell. Med. 9(3), 241–259 (1997) 18. Fischler, F.: Die Wissensbasis und der Inferenzprozeß des medizinischen Expertensystems CADIAG-II/E. Diploma thesis, University of Vienna, Vienna (1994) 19. Fitting, M.: Kleene’s logic, generalized. J. Logic Comput. 1(6), 797–810 (1990) 20. Fryback, D.G.: Bayes’ theorem and conditional nonindependence of data in medical diagnosis. Comput. Biomed. Res. 11(5), 423–434 (1978) 21. Ginsberg, M.: Multivalued logics: a uniform approach to inference in artificial intelligence. Comput. Intell. 4(3), 265–316 (1988) 22. Hajek, P., Phuong, N.H.: Möbius transform for CADIAG-2. J. Comput. Sci. Cybern. 13(3), 103–122 (1997) 23. Kolarz, G., Adlassnig, K.-P.: Problems in establishing the medical expert systems CADIAG-1 and CADIAG-2 in rheumatology. J. Med. Syst. 10(4), 395–405 (1986) 24. Kopecky, D., Adlassnig, K.-P.: A framework for clinical decision support in internal medicine. In: Schreier, G., Hayn, D., Ammenwerth, E. (eds.) Tagungsband der eHealth2011 – Health Informatics meets eHealth – von der Wissenschaft zur Anwendung und zurueck, Grenzen ueberwinden – Continuity of Care, 26.–27. Mai 2011, Wien, pp. 253–258. Oesterreichische Computer Gesellschaft, Wien (2011) 25. Kulikowski, C.A.: Artificial intelligence methods and systems for medical consultation. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-2(5), 464–476 (1980) 26. Ledley, R.S., Lusted, L.B.: Reasoning foundations of medical diagnosis. Symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science 130(3366), 9–21 (1959) 27. Leitich, H., Adlassnig, K.-P., Kolarz, G.: Development and evaluation of fuzzy criteria for the diagnosis of rheumatoid arthritis. Methods Inf. Medicine 35(4–5), 334–342 (1996)
20
P. Baldi et al.
28. Leitich, H., Kiener, H.P., Kolarz, G., Schuh, C., Graninger, W., Adlassnig, K.-P.: A prospective evaluation of the medical consultation system CADIAG-II/RHEUMA in a rheumatological outpatient clinic. Methods Inf. Medicine 40(3), 213–220 (2001) 29. Leitich, H., Adlassnig, K.-P., Kolarz, G.: Evaluation of two different models of semiautomatic knowledge acquisition for the medical consultant system CADIAG-II/RHEUMA. Artif. Intell. Medicine 25(3), 215–225 (2002) 30. Miller, R.A.: Medical diagnostic decision support systems—Past, present, and future: a threaded bibliography and brief commentary. J. Am. Medical Inform. Assoc. 1(1), 8–27 (1994) 31. Picado Muiño, D., Ciabattoni, A., Vetterlein, T.: (2013) Towards an interpretation of the medical expert system CADIAG 2. In: Seising, R., Tabacchi, M.E. (eds.) Fuzziness and Medicine: Philosophical Reflections and Application Systems in Health Care, Studies in Fuzziness and Soft Computing 302, pp. 323–338. Springer, Berlin (2013) 32. Rusnok, P., Vetterlein, T., Adlassnig, K.-P.: Cadiag-2 and fuzzy probability logics. In: Adlassnig, K.-P., Blobel, B., Mantas, J., Masic, I. (eds.) Medical Informatics in a United and Healthy Europe – Proceedings of MIE 2009 – The XXIInd International Congress of the European Federation for Medical Informatics, Studies in Health Technology and Informatics,vol. 150, p. 773. IOS Press, Amsterdam (2009) 33. Serag, A., et al.: Translational AI and deep learning in diagnostic pathology. Front. Medicine 6, 185 (2019) 34. Shortliffe, E.H.: Computer-Based Medical Consultations: MYCIN. Artificial Intelligence Series 2. Elsevier, New York (1976) 35. Somashekhar, S.P., et al.: Watson for oncology and breast cancer treatment recommendations: agreement with an expert multidisciplinary tumor board. Ann. Oncol. 29(2), 418–423 (2018) 36. Szolovits, P., Pauker, S.G.: Categorical and probabilistic reasoning in medical diagnosis. Artif. Intell. 11(1-2), 115–144 (1978) 37. Vetterlein, T., Adlassnig, K.-P.: The medical expert system CADIAG-2, and the limits of reformulation by means of formal logics. In: Schreier, G., Hayn, D., Ammenwerth, E. (eds.) eHealth2009 – Health Informatics meets eHealth – von der Wissenschaft zur Anwendung und zurück, Tagungsband eHealth2009 & eHealth Benchmarking 2009, pp. 123–128. Österreichische Computer Gesellschaft, Wien (2009) 38. Warner, H.R., Toronto, A.F., Veasey, L.G., Stephenson, R.: A mathematical approach to medical diagnosis. Application to congenital heart disease. J. Am. Medical Assoc. 177(3), 177–183 (1961) 39. Wernecke, K.D.: On the application of discriminant analysis in medical diagnostics. In: Bock, H.H., Lenski, W., Richter, M.M. (eds.) Information Systems and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 267–279. Springer, Berlin (1994) 40. Yanase, J., Triantaphyllou, E.: A systematic survey of computer-aided diagnosis in medicine: past and present developments. Expert Syst. Appl. 138, 112821 (2019) 41. Yu, V.L., et al.: Antimicrobial selection by a computer. A blinded evaluation by infectious diseases experts. J. Am. Med. Assoc. 242(12), 1279–1282 (1979) 42. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. Syst. Man Cybern. SMC-3(1), 28–44 (1973) 43. Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning— I. Inf. Sci. 8(3), 199–249 (1975)
A Combination Model of Robust Principal Component Analysis and Multiple Kernel Learning for Cancer Patient Stratification Thanh Trung Giang, Thanh-Phuong Nguyen, Quang Trung Pham, and Dang Hung Tran Abstract In recent years, bioinformatics has been significantly contributing to patient stratification that is very crucial for early detection of cancer diseases. In particular, stratification or classification of patients is to divide patients into subgroups that will be offered effective treatment regimens. However, current methods have to face two major challenges in analyzing large biomedical datasets when stratifying cancer patients. Firstly, the datasets are very big with a high number of features. Secondly, because the public data is available and heterogeneous, there is a great need of combining multiple data sources, providing more comprehensive and informative datasets. A variety of methods has been proposed to tackle these challenges, but they have often solved one or the other separately. Handling noisy data encountered another difficulty in data integration. In this paper, we have proposed an efficient model, combining of the robust principal component analysis-based dimensionality reduction and feature extraction with classification based on multiple kernel learning. The proposed method resolved the above-mentioned problems in cancer patient stratification. The model obtained high accuracy with 92.92% and significant statistical tests. These results hold great promise, supporting cancer research, diagnosis, and treatment.
T. T. Giang (B) · Q. T. Pham Tay Bac University, Son La, Vietnam e-mail: [email protected] Q. T. Pham e-mail: [email protected] VNU University of Engineering and Technology, Hanoi, Vietnam T.-P. Nguyen Megeno S.A., University of Luxembourg, Luxembourg City, Luxembourg e-mail: [email protected]; [email protected] D. H. Tran Hanoi National University of Education, Hanoi, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_2
21
22
T. T. Giang et al.
1 Introduction Computational methods have significantly contributed to cancer diagnosis, prognosis, and treatment. Several remarkable works have proposed to solve various problems in cancer, classifying patients and determining the type or severity of the disease [1, 2], patients clustering and proposing appropriate treatment regimen for cancer patients [3, 4], identifying disease genes that assist in cancer diagnosis and treatment [5, 6], predicting cancer therapeutic drugs [7], prognosis of ability, survival time of cancer patients [8], etc. In particular, stratification or classification of patients is to divide patients into subgroups that will be offered effective treatment regimens. This crucial step ensures that patients are treated with the right medication at the right time [9–14]. Frohlich et al. proposed a model to predict high or low risk for premenopausal breast cancer patients based on multi-omics analysis [12]. Jang et al. developed a predictive model of gene markers to stratify cancer patients and analyzed survivals via The Cancer Genome Atlas (TCGA) [13]. Giang et al. proposed a patients stratification model using fast multiple kernel learning [14], etc. Researches in cancer patient stratification have been performed on diverse molecular biology datasets such as gene expression, DNA methylation, miRNA expression, protein expression, etc. [15]. Integrating those datasets remains challenging. Firstly, the size of the datasets is large, especially the number of features (also known as dimensional or variable) may be up to thousands or millions; thus, analysis methods require high computational costs and high complexity [16]. Secondly, the patient datasets are various in types, each of which provides particularly useful information. Therefore, these types need to be integrated into a unified system that is more consistent and more robust. Recently, there have been many studies on dimensionality reduction [17–20]. Alshamlan et al. combined minimum redundancy maximum relevance and artificial bee colony methods to identify meaningful genes in cancer classification [18], Taguchi et al. developed the principal component analysis method to predict miRNA/mRNA interactions for six types of cancers [19], Giang et al. integrated multiple kernel learning with dimensionality reduction to build a fast framework in dimensionality reduction and data integration [20], etc. There were numerous integrative methods which were proposed to enhance the utilization of multiple data sources. Wang et al. built a similarity network to synthesize data from three types, including gene expression, DNA methylation, and miRNA expression [21]; Liang et al. developed a data integration model based on deep learning method [22], etc. Even though the proposed methods solved the two above challenges in some aspects, these methods have shown disadvantages in applying many molecular biology datasets containing a lot of noise, outlier, or missing data. Besides, each data type uses different forms and measurements. As the result, there are many difficulties in integrating data. In this paper, we have proposed a cancer patient stratification model, combining the data dimensionality reduction based on robust principal component analysis (RPCA) and multiple kernel learning classification model based on MKBoost-S2 [25] and weighted multiple kernel learning (wMKL). The model consists two steps:
A Combination Model of RPCA and MKL for Cancer Patient Stratification ...
23
(i) Apply RPCA to reduce dimensions and to extract relevant genes (for three data types, gene expression, DNA methylation, miRNA expression), (ii) Build three classifiers based on MKBoost-S2 from three pre-processed datasets by RPCA and integrate them based on wMKL. The RPCA method was originally developed from the principal component analysis (PCA) and was adjusted to manipulate datasets with a lot of outlier data. We applied the proposed model to four cancer patient datasets, including Lung Squamous Cell Carcinoma (LUNG), Glioblastoma Multiforme (GBM), Breast Invasive Carcinoma (BREAST), Ovarian Serous Cytadeno Carcinoma (OV). We then conducted the experiment to evaluate the performance of our model. The experimental results have shown that our method obtained significantly better results in the comparison of the model running on the original dataset. In the BREAST dataset accuracy was 92.92%. Accuracy also increased by 10% in the OV dataset when combining multiple data.
2 Method 2.1 Robust Principal Component Analysis The idea of the principal component analysis (PCA) method is to reduce the dimension of a dataset with large number of interrelated variables while retaining the variation in the dataset. The reduction is achieved by transforming feature representation from the old feature space to a new one. The new feature space includes a new set of variables (the principal components), which are uncorrelated and ordered, therefore the first few retain most of the variation present in all of the original variables [23]. PCA derives from the idea of decomposing the original data matrix O ∈ R m×n into the sum of two O = L + N matrices, given a low-rank matrix L (containing most of the information called principal components) and a noise matrix N . N can be eliminated with minimal loss of information. The PCA method represents data in a new space by finding an orthogonal coordinate system. The PCA problem can easily be solved by using Singular Value Decomposition (SVD). However, SVD is very sensitive to outlier data, hence, the PCA is also ineffective on datasets containing corrupted, missing, or outlier data. The Robust PCA (RPCA) method is also derived from the idea of PCA with matrix decompressing O = L + S. However, while PCA reduces N small to minimize the loss of information, RPCA defines S as a sparse matrix with elements that have arbitrary large values. Thus, RPCA is very suitable for noisy datasets [24]. The RPCA solves the problem of finding the low-rank matrix L and the sparse matrix S, then O = L + S (Fig. 1). The RPCA problem is transferred to the optimization problem based on l0 -norm (l0 -norm is pseudo-norm, and l0 -norm of a vector is the number of its non-zero components. l0 -norm is used in sparse constraint problems) as follows:
24
T. T. Giang et al.
Fig. 1 Robust principal component analysis illustration
minimize rank(L) + λ||S||0 subject to O = L + S
(1)
with λ > 0 is the Lagrange factor. The problem (1) is a NP-hard problem. The problem of low-rank and sparse matrices decomposition is transformed based on the Principal Component Persuit [24] method and then Accelerated Proximal Gradient [25] or the Augmented Lagrange Multipliers [26] methods are used to solve the RPCA optimization problem. RPCA overcomes the limitation of PCA and has been successfully applied in many areas such as machine vision [24], image alignment [27], subspace recovery [28], and clustering [29] problems. We applied RPCA to pre-process several molecular biology datasets and then to extract relevant features (see more in Sect. 2.3).
2.2 Cancer Patient Stratification Model Our proposed model combined data dimensionality reduction and multiple kernel learning to extract the most relevant features to classify cancer patient. Figure 2 shows the illustration of our cancer stratification model which consists of two steps. Step 1 is to reduce data dimensions and extracting relevant features based on RPCA from the original datasets (see more details in Sect. 2.3). Step 2 is to develop the MKBoost-S2 based classifiers and to integrate them into a unified classifier by the wMKL method (see more details in Sect. 2.4). The inputs were the cancer patient datasets, specifically gene expression, DNA methylation, and miRNA expression. The outputs were the different subgroups of the cancer patients. The details of the two steps are presented in Sects. 2.3 and 2.4.
2.3 Dimensionality Reduction and Features Extraction Based on RPCA Most RPCA studies that are based on O = L + S matrix decompressing keep the low-rank matrix L containing highly similar features with low dimension and the
A Combination Model of RPCA and MKL for Cancer Patient Stratification ...
25
Fig. 2 Cancer patient stratifying model
remove matrix S. However, in many datasets, the sparse matrix S (with outlier data) which consists of differential features is better for classifying data than the low-rank matrix [30]. Reducing dimensionality using RPCA is suitable to manipulate cancer patient data. Although the expression levels of thousands of genes are measured simultaneously, a small number of genes is relevant to cancer which are essential in the cancer diagnosis and treatment. We defined that the most similar genes as the low-rank matrix L and the differential genes as the sparse matrix S. From the above hypothesis, we applied RPCA in the data preprocessing step to obtain two matrices, the low-rank matrix (containing genes similarly expressed) and the sparse matrix (containing genes differentially expressed). We later used the processed data in the
26
T. T. Giang et al.
Fig. 3 Gene expression preprocessing model based on RPCA
sparse matrix by rearranging and removing non-relevant features, creating an input data matrix for the next processing steps. The RPCA based data decomposition model for gene expression data type (similar to DNA methylation and miRNA expression) is described as follows. - Step 1. The decomposition of the original data matrix. Figure 3 illustrates the decompressing model of gene expression data based on RPCA. O is the original observation data matrix that presents the gene expression dataset, L is the low-rank matrix representing the similar genes, and S is the sparse matrix representing the differential genes. Each row of matrix corresponds to transcription level of a gene, and each column is a sample. White and grey blocks denote 0 and near-zero values, black blocks refer differential values of genes. As shown in Fig. 3, the matrix S of the differential expressed genes (the black blocks) can be recovered from the matrix O of gene expression data by RPCA. - Step 2. Sort genes based on their values. Each line of the matrix S represents a transcriptional response of an observed sample’s gene, each column of S represents expression levels of m genes in a sample. The matrix S is presented as follows. ⎡
s11 ⎢ s21 ⎢ S=⎢ . ⎣ .. sm1
s12 · · · s22 · · · .. . . . . sm2 · · ·
⎤ s1n s2n ⎥ ⎥ .. ⎥ . ⎦ smn
The values of elements of S can be positive or negative, reflecting genes are adjusted up or down of the corresponding expression. To discover the differential
A Combination Model of RPCA and MKL for Cancer Patient Stratification ...
27
expressed genes, only the absolute values of entries in S will be taken into account. The following two steps were performed: (1) calculate the absolute values of entries in the sparse matrix S; (2) sum by a row of the matrix to get the evaluating vector as shown in the following formula. E=
n i=1
|s1i | · · ·
n
|smi |
i=1
Next, E was arranged in a descending order of elements to obtain a new evaluation were Without loss of generality, we assumed that the first c1 elements in E vector E. non-zero: ⎤ ⎡
= ⎣eˆ1 , · · · , eˆc1 , 0, · · · , 0⎦ E m−c1
- Step 3. Extract relevant genes. One of the principles is that if an element’s value in the vector evaluated as 0, the deletion of that element does not affect the optimal of the remaining variables. Even deleting a non-zero element (with a small value) will not affect too much associations with the remaining variables. Based on the larger values are, the greater the differences in this principle, we see that, in E, gene expression are and more important that genes are. Therefore, we only selected as the input for the classification model num 1 (num 1 < c1 ) genes as the ordered in E in the next step.
2.4 Cancer Patient Stratification Model Based on Multiple Kernel Learning Recent studies have shown that the data models using different data sources produced better results than the ones using a single dataset. Besides, multiple kernel learning algorithms (MKL) with different kernel functions have been proven their efficiency in data analysis [31]. We proposed a model to stratify cancer patient from various data sources, applying two MKL steps (Step 2 of Fig. 2). Gene expression, DNA methylation, and miRNA expression datasets were preprocessed to extract relevant features based on the proposed model in Sect. 2.3. Preprocessed datasets then were the input of the MKL model at the later step. In the model, MKL was applied in two steps as following. Firstly, for each dataset, MKBoost-S2 (proposed by Hao et al. [25]) was used to create a hybrid kernel corresponding to a classifier achieving the best accuracy from a dataset and a set of kernel functions (details in [25]). Because the polynomial kernel function achieves good global optimal performance and the Gaussian kernel function obtains good local optimal performance, in our study, we used 13 base kernels as follows.
28
T. T. Giang et al.
• Polynomial kernel function: kPolynomial (x, y) = (x y + 1)d with x is transformation matrix of x and d is degree of polynomial. In this paper, we use 3 values of d, including d = {1, 2, 3} to build 3 kernel functions for MKBoost-S2. • Gaussian kernel function: ||x − y||2 kGaussian (x, y) = exp 2σ 2 We built 10 Gaussian kernel functions with σ = {2−4 , 2−3 , ..., 24 , 25 } for MKBoostS2. MKBoost-S2 uses 13 kernel functions to build a classifier. There are three data types, and we have three classifiers, namely CGE , CDNA , and CRNA . Secondly, we used weighted multiple kernel learning (wMKL) to combine 3 classifiers CGE , CDNA , and CRNA into a unified classifier (notation as CC ) as follows: CC =
3
λi Ci
i=1
with Ci is the classifier CGE , CDNA , and CRNA respectively, λi is the corresponding weight of these classifiers. Especially, the combination classifier was built based on the summation of the three classifiers with corresponding weights. In this study, we used the classifier’s accuracy as weight for each component.
3 Materials and Experiment 3.1 Materials We investigated different cancer datasets from The Cancer Genomie Atlas—TCGA 2020.1 The classifier attribute was Dead state of patients. The datasets were Lung Squamous Cell Carcinoma (LUNG), Glioblastoma Multiforme (GBM), Breast Invasive Carcinoma (BREAST), Ovarian Serous Cytadenocarcinoma (OV). The LUNG dataset includes 106 patients (42 dead). The GBM dataset includes 275 patients (73 dead). The BREAST dataset includes 435 patients (75 dead), and the OV dataset includes 541 patients (258 dead).
1 https://www.cancer.gov.
A Combination Model of RPCA and MKL for Cancer Patient Stratification ... Table 1 The Cancer dataset Cancer # samples
LUNG GBM BREAST OV
106 275 435 541
Alive/Dead
42/64 202/73 360/75 283/258
29
# features Gene expression
DNA methylation
miRNA expression
12,042 12,042 12,042 12,042
23,074 22,896 24,978 21,825
352 534 354 799
For each cancer patient dataset, we used three related data types, including gene expression, DNA methylation, and miRNA expression. Details of the data are shown in Table 1.
3.2 Experiment Firstly, each dataset in the gene expression, DNA methylation, miRNA expression was represented as O matrix. Each row corresponds to a feature, and each column is an observed sample (a patient). We applied the feature extracting model based on the RPCA to each dataset. We obtained three preprocessed data matrices. Next, we applied MKBoost-S2 to create CGE , CDNA , and CRNA classifiers from the corresponding datasets. We used 13 base kernel functions for each dataset to increase the performance of the MKBoost-S2 method. wMKL was employed to combine three CGE , CDNA , and CRNA classifiers into a unified CC classifier (denoted 3-combination classifiers). To evaluate the performance of the integrative model, we also carried out the experiments running on different combinations of two classifiers, specifically CGE-DNA , CGE-RNA , CDNA-RNA classifiers (denoted as 2-combination classifiers). To evaluate the effectiveness of our proposed model, we evaluated the performance of classifiers based on accuracy and ROC curve. We compared the computational performance of our RPCA-based model with the performance of the models running on original dataset (without any dimensionality reduction) and fMKL-DR [14] method results. We performed two comparisons to evaluate the accuracy of our proposed model. Firstly, accuracy of the classifier on the preprocessed dataset by RPCA was compared to the one obtained on the original dataset. Secondly, we compared the classifiers’ accuracy and AUC of the proposed model on only 2 of the 3 datasets to the one obtained on all 3 datasets. We ran the experiment 20 times. At each run, we randomly took 2/3 of the dataset to train the model and the remaining dataset to test. The average accuracy of 20 runs is the final accuracy of the classification model.
30
T. T. Giang et al.
Table 2 Accuracy of classifier based on the original dataset and the pre-processing dataset by RPCA Cancer
# samples
Accuracy (%) Original dataset Gene expression
Pre-processing dataset by RPCA
DNA methylation
miRNA expression
CGE
CDNA
CRNA
LUNG
106
61.88
64.85
70.94
64.68
67.81
71.41
GBM
275
74.22
75.28
76.39
80.83
76.39
80.88
BREAST
435
88.10
88.03
91.48
90.14
90.1
91.51
OV
541
59.22
58.22
54.92
68.72
67.61
66.50
Table 3 Accuracy of classifiers based on two or three component classifiers Cancer # samples Accuracy (%) CGE-DNA CGE-RNA CDNA-RNA CC LUNG GBM BREAST OV
106 275 435 541
69.22 81.67 91.44 69.25
72.66 82.72 91.43 68.70
72.35 81.61 92.17 67.64
77.35 85.23 92.92 69.80
C∗C 76.65 84.80 92.73 69.56
4 Results and Discussion Table 2 shows accuracy of the classifiers on the original datasets and preprocessed datasets by the proposed method. The results showed that, on most of cancer patient datasets, accuracy of the proposed method was significantly higher than the one on the original dataset. Especially, for GBM and OV disease, accuracy of all three data types was considerably higher, and the largest accuracy value increased from 54.92% to 66.5% for the miRNA expression dataset for OV disease. It shows that our dimensionality reduction and relevant features extraction model based on RPCA better stratified cancer patients than all of the classifiers on the original datasets. Table 3 illustrates accuracy of the classifiers on two or three data types based on wMKL. The results have shown that 2-combination classifiers achieved better accuracy than the ones on a single data type. For each type of cancer, the combination of the classifiers has a different impact, for example, in LUNG disease, the combination of CGE and CRNA or CDNA and CRNA has better accuracy (with 72.66%) compared with the accuracy of combination between CGE and CDNA is 69.22%. For OV disease, the combination of CGE and CDNA produced the best accuracy with 69.25%. Although the 2-combination classifiers are better than the single one, the best results are obtained by 3-combination classifiers in all type of cancers. Noticeably, in GBM and BREAST diseases, accuracy increased to 85.23% and 84.80% respectively. These results are better than the results of fMKL-DR [14] on GBM (81.11%) and OV (62.22%) datasets. The CC∗ column in Table 3 presents the statis-
A Combination Model of RPCA and MKL for Cancer Patient Stratification ... Table 4 AUC of classifiers Cancer # samples LUNG GBM BREAST OV
106 275 435 541
AUC CGE-DNA 0.7324 0.7383 0.7241 0.6217
31
CGE-RNA
CDNA-RNA
CC
0.7093 0.7066 0.7624 0.6255
0.7225 0.7251 0.7498 0.6132
0.8135 0.7683 0.7925 0.6746
Fig. 4 The ROC curve of classifiers on four cancer diseases
tical hypothesis tested with confident value at 95%. The findings showed that our classification model is statistically significant. Table 4 shows AUC of the classification models, and Fig. 4 illustrates the ROC curves. The results show that the 3-combination classifiers CC returned better AUC than 2-combination classifiers. Especially, for LUNG disease, AUC of the CC is 0.8135, that is much higher than the other classifiers, while the largest value of the
32
T. T. Giang et al.
2-combination classifiers is 0.7324. Similar to other cancers, the largest AUC is achieved when combining three data types. Comparing to fMKL-DR [14], on all four datasets, our model archived significantly better results of AUC values. This result shows that all of three data type are important, and the integration analysis produced better results when integrating them.
5 Conclusion In this paper, we have proposed the cancer patient stratifying model that consists of dimensionality reduction and feature extraction based on RPCA, and multiple kernel learning using a combination of MKBoost-S2 and wMKL. Experiment results showed that our proposed model was efficiently pre-processed the cancer patient data. Especially, our method can be easily reproduced for other cancer datasets and other disease datasets. Integrating multiple classifiers returned higher accuracy and AUC. Our method is benificial in stratifying cancer patient. Acknowledgments The results published here are in whole or part based upon data generated by TCGA Research Network (https://www.cancer.gov/tcga). This research was supported by the Vietnam Ministry of Education and Training, Project No. B2021-SPH-01.
References 1. Soh, K.P., Szczurek, E., Sakoparnig, T., Beerenwinkel, N.: Predicting cancer type from tumour DNA signatures. Genome Med. 9(1), 1–11 (2017) 2. Couture, H.D., et al.: Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. NPJ Breast Cancer 4(1), 1–8 (2018) 3. Pekic, S., et al.: Familial cancer clustering in patients with prolactinoma. Hormones Cancer 10(1), 45–50 (2019) 4. Hussain, F., Saeed, U., Muhammad, G., Islam, N., Sheikh, G.S.: Classifying cancer patients based on DNA sequences using machine learning. J. Med. Imag. Health Inform. 9(3), 436–443 (2019) 5. Gkountela, S., et al.: Circulating tumor cell clustering shapes DNA methylation to enable metastasis seeding. Cell 176(1–2), 98–112 (2019) 6. Speicher, N.K., Pfeifer, N.: Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31(12), i268–i275 (2015) 7. Li, K., Du, Y., Li, L., Wei, D.Q.: Bioinformatics approaches for anti-cancer drug discovery. Curr. Drug Targets 21(1), 3–17 (2020) 8. Bashiri, A., Ghazisaeedi, M., Safdari, R., Shahmoradi, L., Ehtesham, H.: Improving the prediction of survival in cancer patients by using machine learning techniques: experience of gene expression data: a narrative review. Iran. J. Public Health 46(2), 165–172 (2017) 9. Sorbye, H., Köhne, C.H., Sargent, D.J., Glimelius, B.: Patient characteristics and stratification in medical treatment studies for metastatic colorectal cancer: a proposal for standardization of patient characteristic reporting and stratification. Ann. Oncol. 18(10), 1666–1672 (2007)
A Combination Model of RPCA and MKL for Cancer Patient Stratification ...
33
10. Chand, M., et al.: Novel biomarkers for patient stratification in colorectal cancer: a review of definitions, emerging concepts, and data. World J. Gastrointestinal Oncol. 10(7), 145–158 (2018) 11. Kalinin, A.A., et al.: Deep learning in pharmacogenomics: from gene regulation to patient stratification. Pharmacogenomics 19(7), 629–650 (2018) 12. Fröhlich, H., Patjoshi, S., Yeghiazaryan, K., Kehrer, C., Kuhn, W., Golubnitschaja, O.: Premenopausal breast cancer: potential clinical utility of a multi-omics based machine learning approach for patient stratification. EPMA J. 9(2), 175–186 (2018) 13. Jang, Y., Seo, J., Jang, I., Lee, B., Kim, S., Lee, S.: CaPSSA: visual evaluation of cancer biomarker genes for patient stratification and survival analysis using mutation and expression data. Bioinformatics 35(24), 5341–5343 (2019) 14. Giang, T.T., Nguyen, T.P., Tran, D.H.: Stratifying patients using fast multiple kernel learning framework: case studies of Alzheimer’s disease and cancers. BMC Med. Inform. Decis. Mak. 20(1), 1–15 (2020) 15. Pavlopoulou, A., Spandidos, D.A., Michalopoulos, I.: Human cancer databases. Oncol. Rep. 33(1), 3–18 (2014) 16. Fan, J., Li, R.: Statistical challenges with high dimensionality: feature selection in knowledge discovery, vol. 2006, Article ID 0602133, pp. 1–27, arXiv preprint math/0602133 (2006) 17. Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, 1–13 (2015). Article ID 198363 18. Alshamlan, H., Badr, G., Alohali, Y.: mRMR-ABC: a hybrid gene selection algorithm for cancer classification using microarray gene expression profiling. Biomed. Res. Int. 2015, 1–15 (2015). Article ID 604910 19. Taguchi, Y.H.: Identification of more feasible microRNA-mRNA interactions within multiple cancers using principal component analysis based unsupervised feature extraction. Int. J. Mol. Sci. 17(5), 1–12 (2016) 20. Giang, T.T., Nguyen, T.P., Nguyen, T.Q.V., Tran, D.H.: fMKL-DR: a fast multiple kernel learning framework with dimensionality reduction. In: International Symposium on Integrated Uncertainty in Knowledge Modelling and Decision Making, vol. 10758, pp. 153–165. Springer, Cham (2018) 21. Wang, B., et al.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333–337 (2014) 22. Liang, M., Li, Z., Chen, T., Zeng, J.: Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(4), 928–937 (2014) 23. Jolliffe, I.: Principal Component Analysis. Springer, New York (1986) 24. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 11–48 (2011) 25. Xia, H., Hoi, S.C.: Mkboost: a framework of multiple kernel boosting. IEEE Trans. Knowl. Data Eng. 25(7), 1574–1586 (2013) 26. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 27. Peng, Y., Ganesh, A., Wright, J., Xu, W., Ma, Y.: RASL: Robust alignment by sparse and lowrank decomposition for linearly correlated images. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2233–2246 (2012) 28. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2013) 29. Shahid, N., Kalofolias, V., Bresson, X., Bronstein, M., Vandergheynst, P.: Robust principal component analysis on graphs. In: Proceedings of the ICCV, pp. 2812–2820 (2015) 30. Chen, M., Ganesh, A., Lin, Z., Ma, Y., Wright, J., Wu, L.: Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. Coordinated Science Laboratory Report, No. UILU-ENG-09-2214, pp. 1–18 (2009) 31. Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211– 2268 (2011)
Attention U-Net with Active Contour Based Hybrid Loss for Brain Tumor Segmentation Dang-Tien Nguyen, Thi-Thao Tran, and Van-Truong Pham
Abstract Brain tumor (BT) segmentation from brain magnetic resonance imaging (MRI) plays an important role in diagnosis and treatment planning for patients. In this study, we proposed a new approach for brain tumor segmentation based on deep neural networks. The paper proposes to use Attention U-Net architecture which can handle the shape variety with the attention gate for brain tumor segmentation from MRI images. Especially, instead of using cross-entropy loss function, dice coefficient loss function or both, we propose to utilize a new loss function based on activate contour loss that is known to overcome the limitation of pixel-wise fitting of the segmentation map on the loss functions used before, to train the network. We evaluated and compared our approach and other approaches on a dataset of nearly 4000 brain MRI scans. Experiments demonstrate that the proposed method outperforms the state-of-the-art methods in terms of Dice coefficient and Jaccard indexes. Keywords Brain tumor segmentation · Activate contour model · Attention U-Net · Attention gate · U-Net
1 Introduction Brain tumor is one of the most fatal cancers [1]. Even under treatment, the patient’s survival does not prolong longer than 14 months after diagnosis [2]. According to the World Health Organization (WHO), it classifies brain tumors into 4 grades I–IV with increasing aggressiveness [3]. Grades I and II tumors may be considered as semi-malignant tumors that carry a better prognosis, while grades III and IV tumors are malignant tumors that almost certainly lead to a patient’s death [4]. Therefore, segmentation of the tumor is a crucial step to determine the survival and treatment plans. D.-T. Nguyen · T.-T. Tran · V.-T. Pham (B) School of Electrical Engineering, Hanoi University of Science and Technology, No. 1 Dai Co Viet, Hanoi, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_3
35
36
D.-T. Nguyen et al.
MRI is a non-invasive technique, which provides good soft-tissue contrast and standard technique for brain tumor diagnosis. Segmenting brain tumor from MRI images plays an important role in monitoring tumor growth or shrinkage in patients during therapy and measuring tumor volume. It also plays a vital role for surgical planning [5] or radiotherapy planning [4]. The best method to segment tumors is manual segmentation. However, it has certain limitations such as time-consuming, and results with high inter-rater variance [2]. Therefore, automatic BT segmentation from MRI might help doctors to improve their efficiency in diagnosis and treatment planning for patients. Nowadays, there are various automatic methods for BT segmentation such as level-set [6], active contour [7, 8], machine learning-based methods [9–11]. Among them, machine learning-based methods is a promising approach, because they do not require prior knowledge of anatomy, they use imaging features extracted from MRI rather than the original MRI data for segmentation. The brain tumor’s feature extraction and selection are automated during model training. Therefore, in recent years, the development of deep learning in general and convolutional neural network in particular, progress in automatic brain segmentation mature to a level that achieves the performance of a skilled radiologist [12]. Especially, in image segmentation, the appearance of Fully Convolutional Network (FCN) developed by Long et al. [13] attracts lots of researchers in medical image segmentation [14, 15]. One of the methods improved from the elegant architecture of FCN is U-Net proposed by Ronneberger et al. [16]. U-Net model uses the concept of deconvolution introduced by [17], and is one of the most well-known structures for medical image segmentation. It is also used in segmentation of brain tumors [11, 18]. In this study, we propose another approach for brain tumor segmentation based on attention U-Net [19]. Characteristics of attention U-Net architecture are based on U-Net but it is improved via use of attention gate to handle the shape variation, which is an issue in brain tumors. In addition, we also propose a new loss function based on Active Contour loss [20]. The advantage of this loss function is that it can overcome the limitation of pixel-wise fitting of the segmentation map on the loss functions which have used before such as in cross-entropy and dice loss functions [20]. The remainder of this paper is organized as follows: In Sect. 2, the proposed approach is described in detail. In Sect. 3, some experimental results are presented, including a comparison with state-of-the-art methods. Finally, we conclude this work and discuss future applications in Sect. 4.
2 Materials and Methods The attention U-Net proposed by Oktay et al. [19] is presented in Fig. 1. The attention U-Net is an improvement version of U-Net architecture that has been shown to be applicable to multiple medical image segmentation problems. The attention U-Net architecture includes three parts: encoder, decoder parts, and attention mechanism. The encoder part composes of convolutional and max-pooling layers. The decoder
AG
Segmentation map
37
Input image
Attention U-Net with Active Contour Based Hybrid Loss …
AG
conv 3x3, ReLU AG
copy and crop max pool 2x2 max pool 2x2 conv 1x1 Gating signal
Fig. 1 Basic structure of the U-Net with Attention Gate (AG), adapted from [18]
Fig. 2 Schematic of the Attention Gate [18]
part consists of the aggregation of the encoder intermediate features, upsampling and convolutional layers. In addition, the skip connection is also proposed to recover fine-grained features that may be lost in the downsampling process. The attention mechanism aims at finding the most important information before propagating it to the decoder path. Thus, it can help improve the performance of the network. The structure of the attention gate is shown in Fig. 2. Denote x l the feature map of the output of layer l, and g the gating signal vector, and let α ∈ [0, 1 ] be the attention coefficient. The output of the attention gate structure is element-wise multiplication of x l and α expressed as. l = xl α xout
(1)
38
D.-T. Nguyen et al.
The attention coefficient is computed based on parameters including linear transformations Wx , Wg , ψ and bias bg , bψ as: αi = σ2 ψ T σ1 WxT x l + WgT gi + bg + bψ
(2)
where σ1 is the Rectified Linear Unit activation function, and σ2 is the sigmoid activation function defined as: σ1 (x) = max(0, x); σ2 (x) =
1 1 + e−x
3 The Proposed Approach 3.1 The Pipeline of the Proposed Approach In this study, we propose an approach for brain tumor segmentation from MRI images. The pipeline of the proposed approach for brain tumor segmentation is shown in Fig. 3. The train data including the training and validation sets are fed to the Attention U-Net architecture. While training process, the images in the validation set are predicted and evaluated with their references (ground truths). The Dice score metric is used to evaluate the performance of the segmentation. After the training process,
Attention U-Net
Images
References
Training and Validation data
Weight
Attention U-Net
Test images
Predictions
Fig. 3 General pipeline of the proposed approach for brain tumor segmentation
Attention U-Net with Active Contour Based Hybrid Loss …
39
the weight with best performance for the validation set is chosen for the prediction of the test images. The test set is then predicted using the chosen weight. The output predictions are then compared with their ground truths for evaluation of the segmentation approach.
3.2 Loss Function For image segmentation using neural networks, the binary cross entropy (BCE) or its combination with the Dice loss are commonly used. In this study, to take into consideration the contour length as well as region fitting information, we proposed to incorporate into the BCE function an active contour (AC) loss term. The AC loss is inspired by the active contour model [21] that evolves a contour towards the desired object boundary. Let u ∈ [0,1] the prediction, and v ∈ [0,1] the reference segmentation (ground truth). To train the network, in this study, we propose a loss function as follows Loss = γBCE L BCE + γAC L AC
(3)
where L BCE is the binary cross entropy loss, L AC is the active contour loss; γ BCE , and γ AC are respectively the weighting parameters for the binary cross entropy and active contour loss terms. The BCE loss function term is defined as: L BCE =
M N 1 vi, j log u i, j + 1 − vi, j log 1 − u ii, j M × N i=1 j=1
(4)
where M and N are respectively the width and height of the image. The active contour loss in this study, is modified from the approach in [20] as L AC =
1 (μLength + λRegion) M×N
(5)
where μ and λ are the weighting parameters of the Length and Region terms of the AC loss, defined as the following. The Length term is defined as: M N (∇u x )2 + (∇u y )2 + ε Length = i, j i, j
(6)
i=1 j=1
where x and y are horizontal and vertical directions, respectively. u xi, j and u yi, j are indexes of pixel locations in those directions of the segmentation. ε > 0 is a parameter to avoid zero for the square root operation.
40
D.-T. Nguyen et al.
The Region term is calculated as follows: N N M M 2 2 Region = u i, j (vi, j − c1 ) + (1 − u i, j )(vi, j − c2 ) i=1 j=1 i=1 j=1
(7)
with c1, c2 are the means of the region inside and outside the segmenting contour C defined as: M N
c1 =
M N
vi, j u i, j
i=1 j=1 M N
, c2 = u i, j
i=1 j=1
vi, j (1 − u i, j )
i=1 j=1 M N
(8) (1 − u i, j )
i=1 j=1
3.3 Data Augmentation and Training We augment the data for training process by performing some affine transformations techniques like rotation, vertical and horizontal flipping. Data augmentation can be used to increase the size of the training set and reduce overfitting [22]. In particular, we used data augmentation techniques such as rotation random from 0 to 20 degrees; width shift, height shift, shear, zoom random from 0 to 5% compared with the original image. The optimal algorithm used is Adam [23] with hyper-parameters β1 = 0.9, β2 = 0.999 to finds optimal point of the loss function. The weights of the network are initialized according to the ‘Xavier initialization’. For the loss function, parameters are set as γ BCE = 0.5, γ AC = 0.5, λ = 3, and μ = 1 for all experiments. The network is trained with the maximum number of epochs chosen as 200. During training process, the weight of network model at each epoch is evaluated on the validation data. Then the weight of best performance measured on validation set is chosen to predict the segmentation maps of test set.
4 Evaluations and Results 4.1 Dataset The proposed has been evaluated on the dataset collected from 110 lower-grade gliomas (LGG) patients. The training dataset is obtained from The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA) [24], which includes 50 Grade II patients, 58 Grade III patients, and 2 patients with unknown’s tumor grades.
Attention U-Net with Active Contour Based Hybrid Loss …
41
The imaging data are given for T1 pre-contrast, FLAIR, and T1 post-contrast MRI sequences. The manual segmentation and the corresponding generated tumor masks are performed by Buda et al. [11] using the FLAIR MRI images, which are given in Tiff format images. The number of slices varies among patients from 20 to 88 with a size of 256 × 256 pixels. The MRI images are RGB images, and corresponding grayscale masks. The total number of images from 110 LGG patients is 3929, of which the image with tumor and without tumor is 1373, 2556, respectively. This data set is published and available for download at [25] by Buda et al. [11]. In the current study, in the training phase, we split the training dataset in an 80–20 ratio, the use of the small piece to be testing data set and the remaining to be training data set. During training, 10 percent of training data is used for the validation set. All images and masks are resized to resolution of 128 × 128 pixels for computational efficiency.
4.2 Results To demonstrate the segmentation results by our approach, we show some representative samples of the results for testing data set in Fig. 4. In addition, to evaluate the quantitative accuracy of segmentation results, we used the Dice similarity coefficient. To evaluate the performances of the automatic segmentation, we compare its accuracy with the ground truth (manual segmentation). Particularly, we used Dice similarity coefficient (DSC), and the Jaccard similarity coefficient (JAC). The DSC measures the similarity between automatic and manual segmentations and is calculated as [26]: DSC =
2Sam (Sa + Sm )
(9)
where Sa , Sm , and Sam are, respectively, the automatically segmented region, the manually segmented region, and the intersection between the two regions. The Jaccard similarity coefficient is used to measure the similarity between two sets, defined as: J AC =
Sam Sa + Sm − Sam
(10)
As can be observed from Fig. 4, there is a good agreement between the results by our approach and the ground truths.
4.3 Evaluation on other Networks We now evaluate the performances of the proposed model with other models. To this end, we reproduced the SegNet [27], and U-Net [16] to segment all images from
42
D.-T. Nguyen et al.
Fig. 4 Representative segmentation by the proposed approach for the test data. First column: Input images; Second column: Ground truths (References); Third column: Predictions; Last column: The overlap presented by contours between the Ground truths (in green) and predictions (in red)
dataset using the proposed loss function. Then, we compared the results with those by the Attention U-Net. Representative segmentations by the above networks using the proposed loss function are shown in Fig. 5. From this figure, by visualization, the tumor regions segmented by the Attention U-Net are most close to the ground truths. For quantitative comparison, we also provided the Dice similarity coefficient and Jaccard similarity coefficient between the prediction results by networks with ground truths in Table 1. The all networks are trained on the same training set and tested on the testing set from the dataset. In addition, the training time (in minutes) by the networks are also given in this table. From this table, we observe that the Attention U-Net gives the highest scores for both DSC and JAC metrics.
4.4 Performance of the Proposed Loss Function To evaluate the performance of the proposed hybrid loss function, we trained the three networks, SegNet, U-Net, and Attention U-Net with other common losses. The comparative losses including the binary cross entropy (BCE) and the combination
Attention U-Net with Active Contour Based Hybrid Loss …
43
Fig. 5 Representative segmentation for the test data by comparative networks. First column: Input images; Second column: Ground truths; Third column: Predictions by SegNet; Fourth column: Predictions by U-Net; Fifth column: Predictions by Attention U-Net
Table 1 The mean of Dice similarity coefficient (DSC) and Jaccard similarity coefficient (JAC) between the prediction results by networks with ground truths on the test data by the proposed loss function Method
Training time (min)
DSC
JAC
SegNet
133
0.862
0.778
U-Net
71.7
0.882
0.803
Attention U-Net
56.2
0.890
0.811
of the BCE with Dice loss (BCE-Dice) loss. The quantitative results including the DSC and JAC scores by other loss functions are given in Table 2. From this table, we can see that for all networks, when training by the proposed loss functions, the results obtained are with highest DSC and JAC scores.
44 Table 2 The mean of obtained Dice similarity coefficient (DSC) and Jaccard similarity coefficient (JAC) between the prediction results by networks with ground truths on the test data with different loss functions
D.-T. Nguyen et al. Method
Loss function
DSC
JAC
SegNet
BCE
0.854
0.760
BCE-Dice
0.859
0.762
U-Net
Attention U-Net
Proposed
0.862
0.778
BCE
0.871
0.780
BCE-Dice
0.873
0.786
Proposed
0.882
0.803
BCE
0.873
0.781
BCE-Dice
0.875
0.792
Proposed
0.890
0.811
5 Conclusion This study proposed an approach for automatic segmentation of the brain tumor from MRI images. The paper introduced a hybrid loss function inspired from active contour model. The proposed loss function is then trained on recent neural networks and show comparative performances while compared with other common loss functions. We also found that the Attention U-Net architecture gives better segmentations than other state of the arts, especially trained on the proposed loss function. Experiments showed that the proposed approach achieves high segmentation performances on the benchmark of MRI brain tumor datasets. Acknowledgements This research is funded by the Hanoi University of Science and Technology (HUST) under project number T2020-PC-017.
References 1. De Angelis, L.M.: Brain Tumors. N. Engl. J. Med. 344(2), 114–123 (2001) 2. Menze, B.H., et al.: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) IEEE Trans. Med. Imaging 34(10), 1993–2024 (2015). 3. Kleihues, P., Burger, P., Scheithauer, B.: The new WHO classification of brain tumours. Brain Pathol. 3(3), 255–268 (1993) 4. Bauer, S., Wiest, R., Nolte, L.P., Reyes, M.: A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol. 58(13), 58(13), R97–129 (2013) 5. Khan, A., Perez, J., Wells, C., Fuentes, O.: Computer vision evidence supporting craniometric alignment of rat brain atlases to streamline expert-guided, first-order migration of hypothalamic spatial datasets related to behavioral control. Front Syst Neurosci. 12, 1–29 (2018) 6. Taheria, S., Ongb, S.H., Chong. V.F.H.: Level-set segmentation of brain tumors using a threshold-based speed function. Image Vis. Comput. 28(1), 26–37 (2010) 7. Sachdeva, J., Kumar, V., Gupta, I., Khandelwal, N., Ahuja, C.K.: A novel content-based active contour model for brain tumor segmentation. Magn. Reson. Imaging 30(5), 694–715 (2012)
Attention U-Net with Active Contour Based Hybrid Loss …
45
8. Shyu, K.K., Pham, V.T., Tran, T.T., Lee, P.L.: Unsupervised active contours driven by density distance and local fitting energy with applications to medical image segmentation. Mach. Vis. Appl. 23(6), 1159–1175 (2012) 9. Bauer, S., Nolte, L.P., Reyes, M.: fully automatic segmentation of brain tumor images using support vector machine classification in combination with hierarchical conditional random field regularization. In: International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI), vol. 20101, pp. 354–361 (2011) 10. Havaei, M., Guizard, N., Larochelle, H., Jodoin, P.: Deep learning trends for focal brain pathology segmentation in MRI. In: Machine Learning for Health Informatics, vol. 125–148 (2016) 11. Buda, M., Saha, A., A. Mazurowski, M.: Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. Comp. Bio. Med. 109, 218–225 (2019) 12. Havaei, M., et al.: Brain tumor segmentation with Deep Neural Networks. Med. Image Anal. 35, 18–31 (2017) 13. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015) 14. Pham, V.T., Tran, T.T., Wang, P.C., Lo, M.T.: Tympanic membrane segmentation in otoscopic images based on fully convolutional network with active contour loss. Signal Image Video Process. (2020). https://doi.org/10.1007/s11760-020-01772-7 15. Ninh, Q.C., Tran, T.T., Tran, T.T., Tran, T.A.X., Pham, V.T.: Skin lesion segmentation based on modification of segnet neural networks. In: Proceedings of the 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi, , pp. 575–578 (2020) 16. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, pp. 234–241 (2015) 17. Zeiler, D.M., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833 (2014) 18. Chen, W., Liu, B., Peng. S., Sun, J., Qiao, X.: S3D-UNet: separable 3D U-Net for brain tumor segmentation. In: International MICCAI Brainlesion Workshop, pp. 358–368 (2018) 19. Oktay, O., et al.: Attention U-Net: learning where to look for the pancreas. In: Proceedings of the 1st Conference on Medical Imaging with Deep Learning (2018). https://arxiv.org/abs/ 1804.03999 20. Chen, X.,Williams, B. M.,Vallabhaneni, S. R., Czanner, G., Williams, R., Zheng, Y.: Learning active contour models for medical image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11623–11640 (2019) 21. Chan, T., Vese, L.: Active contours without edges. IEEE Trans. Image Process. 10(2), 266–277 (2001) 22. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Neural Information Processing Systems (NIPS) (2012) 23. Kingma, D., Ba, J.: ADAM: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015) 24. TCGA-LGG - the cancer imaging archive (TCIA) public access -cancer imaging. https://wiki. cancerimagingarchive.net/display/Public/TCGA-LGG. Acessed 29 Aug 2020 25. Brain-segmentation-pytorch Kaggle. www.kaggle.com/mateuszbuda/brain-segmentation-pyt orch/data. Accessed 29 Aug 2020 26. Lynch, M., Ghita, O., Whelan, P.F.: Segmentation of the left ventricle of the heart in 3-D+t MRI data using an optimized nonrigid temporal model. IEEE Trans. Med. Imaging 27(2), 195–203 (2008) 27. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481– 2495 (2017)
Refining Skip Connections by Fusing Multi-scaled Context in Neural Network for Cardiac MR Image Segmentation Nhu-Toan Nguyen, Minh-Nhat Trinh, Thi-Thao Tran, and Van-Truong Pham
Abstract Applying convolutional neural network (CNN) for medical image segmentation has been well known for several years. The classical method normally is based on encoder-decoder architecture. The main drawback of encoder-decoder architecture is the long-range feature dependencies that are not preserved when model goes deeper and deeper. One way to overcome this problem is using complementary layers, called skip layers, from contracting path. However, the use of only skip layers that have the similar shape in contracting path seems to be insufficient and inefficient. Therefore, some skip methods have been released to boost the performance such as Unet++, Mask R-CNN++, … In this study, we concentrate on improving skip layer method by applying attention mechanism and multi-scaled context fusion. This approach is able to associate the local features with global dependencies, weighting information between layers, so reducing unnecessary and noisy information, and simultaneously illuminate important features for segmentation. We evaluate our proposed method on 2017 ACDAC dataset. The results show that our model achieves the remarkable performance in term of Dice coefficient and Jaccard index. This demonstrates the efficiency of our approach to precisely segment the target regions in medical images. Keywords Convolutional neural network · Medical image segmentation · U-net · Skip layer · Attention box
1 Introduction In recent years, segmentation task for medical images has gained more and more attention day by day [1–3]. Although manual annotation is regarded as the most reliable and authentic method, it necessarily requires the professional knowledge N.-T. Nguyen · M.-N. Trinh · T.-T. Tran · V.-T. Pham (B) School of Electrical Engineering, Hanoi University of Science and Technology, No. 1 Dai Co Viet, Hanoi, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_4
47
48
N.-T. Nguyen et al.
and is obviously time-consuming. Therefore, the method of automatic segmentation, which is able to handle this task efficiently and reliably, is putted great demand. Fortunately, the rapid development and powerful performance of convolutional neural network (CNN) has made this problem become possible and applicable. Thanks to the capability of extracting non-linear feature and improvement from data, a great number of researches for medical image segmentation using CNN have been published and also achieved the noticeable performance [1, 4–6]. Among all of them, the encoder-decoder architecture is normally selected as the optimal method. The basic idea behind encoder-decoder architecture is that the input image is compressed into a set of high-level features in contracting path, then during expanding path, these are upsampled to reconstruct a pixel-wise segmentation mask, so matching the ground truth. During expanding path, the upsampled layers are normally combined with some fine-grained layers in contracting path, which are called skip layers and have similar shape with the respective upsampled layers, to retrieve the fine-grained feature of target objects. Some studies have demonstrated that aggregating skip layers from contracting path resulting better performance, with higher accuracy and less training time [7–9]. However, the compressed high-level features often contain redundant information and lose the global dependencies between each pixel when model goes deeper and deeper. In contracting path, the model tries to highlight the important feature of input image, such as edge or boundary. By contrast, when upsampling, the localization information of the high-level features is needed to be provided in order to re-arrange these features correctly. In addition, the normal approach for skip connections in encoder-decoder network is compulsorily aggregating the same scale encoder and decoder feature map. However, it does not ensure that these same scale feature maps play well together and contain sufficient information for upsampling. Moreover, the attention mechanisms have been demonstrated to efficiently integrate the local information and global information in many computer vision tasks such as object detection [10], and image classification [11, 12]. It helps model pay more attention to the important and relevant features, instead of redundant and noisy ones by weighting each layer with appropriate parameters. In segmentation tasks, attention module has been exploited in some researches and got promising results [13–15]. In this study, we leveraged the attention mechanism combined with multi-scaled context feature to build the skip module, which gathers all the skip layers from different scale to reproduce the more informative and representative skip layers for decoder. Firstly, the input image is fed to the encoder blocks, which contain the squeeze-and-excitation (SE) block [11], working like attention block to regulate the role of every layers, and simultaneously dispense the skip layers. These layers then penetrate through our skip module. Inside this module, the skip layers are upsampled to have the similar size as the largest skip layer, then these are concatenated and fed into small convolution block to refine the number of features. After that, it is scaled back to previous size and then multiplied with original skip layers. Finally, these fine-tuned skip layers are concatenated with upsampled layers from decoder to spawn segmented image like the ground truth.
Refining Skip Connections by Fusing Multi-scaled Context …
49
The remainder of this paper is organized as follows: In the following section, the proposed approach is described in detail. Next, some experimental results are presented, including a comparison with state-of-the-art methods. Finally, we conclude this work and discuss its future applications.
2 Materials and Methods 2.1 Network Architecture The structure of the proposed model is given in Fig. 1. As illustrated in Fig. 1, our proposed model is based on U-net architecture with two paths: encoders and decoders. The primary mission of encoder is to extract informative feature from input image and also dispense necessary skip layers for decoder. Initially, input image is standardized by removing the mean and scaling to unit variance in height and width channels, Image
Ground Truth
Segmentation
Decoder1
Encoder1
Decoder2
Encoder2
SKIP MODULE Encoder3
Decoder3
Encoder4
Decoder4
Encoder5
Decoder5
Fig. 1 General structure of the proposed model
50
N.-T. Nguyen et al.
Fig. 2 Swish function
I ∈ R H ×W ×1 , then fed into encoder block. It is divided into five small downsample blocks, which includes 2D convolution layer, following by batch normalization layer and Swish activation [16]. We use Swish activation, as in Fig. 2, for some reasons: • Firstly, it is bounded below. Swish therefore benefits from sparsity similar to ReLU activation[17]. Very negative weights are simply zeroed out. • Secondly, it is unbounded above. This means that the outputs do not saturate to the maximum value for large value. • Thirdly, the fact that it is a smooth curve means that its output landscape will be smooth, which directly correlates with the error landscape. A smoother error space is easier to traverse and find the minima. • Swish is non-monotonic, meaning that there is not always a singularly and continually positive (or negative) derivative throughout the entire function.
2.2 The Proposed Skip Module For skip module path, the inputs of skip module are 4 skip layers dispensed from 4 aforementioned downsample blocks, called S1 , S2 , S3 , S4 , Si ∈ R Hi ×Wi ×Csi Hi = 2 × Hi+1 , Wi = 2 × Wi+1 with i = 1, 4 and Hi , Wi is height, width of feature maps. Then, S1 , S2 , S3 , S4 are fed into convolution block to reduce the number of channels to 512, 256, 128, 64 respectively. After that, S2 , S3 , S4 are upsampled to be similar shape as S1 before all these layers go into convolution block again to reduce the number of channels to 64 and concatenate into a “blob” with 4 × 64 feature channels. The reason behind is to add the global information as well as local information for every skip layer. Once “blob” has been created, it go through sequential series of convolution block to spawn B1 , B2 , B3 , B4 , Bi ∈ R Hi ×Wi ×Cbi with ____ i = 1, 4, which have the same size as S1 , S2 , S3 , S4 . Finally, Sx , Bx , with x = 1, 4, are fed into an attention box to form the final skip layers for the decoder S1 , S2 , S3 , S4 , with Si ∈ R Hi ×Wi ×C pi , i = 1, 4. The overall view of the skip module is pointed out in Fig. 3.
Refining Skip Connections by Fusing Multi-scaled Context …
51
Fig. 3 Skip module
2.3 The Proposed Attention Box The general structure for the attention box is described in Fig. 4. Applying an attention module before this concatenation allows the network to put more weight on the features of the skip layers that will be relevant. It allows for the direct connection to focus on a particular part of the input, rather than feeding in every feature. Therefore, the attention distribution is multiplied by the skip connection feature map to only keep the important parts.
52
N.-T. Nguyen et al.
Fig. 4 Attention box
Sx
Bx
Conv,BatchNorm
Conv,BatchNorm ADD Swish
Conv,BatchNorm Sigmoid MULTIPLY
Regarding decoder path, it includes 4 upsample blocks with the output are D1 , D2 , D3 , D4 , Di ∈ R Hi ×Wi ×Cdi with i = 1, 4 and the output to compare with ground truth is O ∈ R H ×W ×1 . Di and Si are concatenated into [Di , Si ], i = 1, 4, then go through each upsample block to form Di+1 . Each upsample block contains a SE block, following by soft residual block which is modified from original residual block in Resnet [8] (by substituting the convolution by depth wise convolution) to lower the model parameters and speed up training time.
2.4 Performance Metrics We used Dice Similarity Coefficient (DSC) and Jaccard Coefficient (JAC) to evaluate the performance of the network. Denote gic ∈ {0, 1} and pic ∈ [0, 1] be respectively the ground truth and the predicted labels. The total number of pixels in an image is denoted by N. The 2-class DSC and JAC variants for class c is expressed in the following equations: N
DSCc =
pic gic + ε
i=1 N i=1
pic +
N i=1
(1) gic + ε
Refining Skip Connections by Fusing Multi-scaled Context … N
J ACc =
N i=1
pic +
i=1 N i=1
53
pic gic + ε gic −
N
(2) pic gic + ε
i=1
where pic is the probability that pixel i is of the lesion class c and pic is the probability that pixel i is of the non-lesion class c. The same is true for gic and gic , respectively. ε is the smooth factor often chosen as 1 for numerical stability that prevents division by zero.
2.5 Training In order to formulate a loss function which can be minimized, we simply use the loss function [18] known as the soft dice loss because we directly use the predicted probabilities instead of thresholding and converting them into a binary mask as: DSC L = 1 − DSCc
(3)
where DSCc is the Dice coefficient defined in Eq. 1. We use Nadam [19] as an optimizer with an initial learning rate set to 1e–3. Additionally, learning rate reduces 50% if validation loss did not improve for 20 epochs (minimum learning rate 1e–5). Ultimately, training was limited to 500 epochs but stopped early if validation loss does not improve over more than 50 epochs.
3 Results The evaluation framework was tested on the database released by the “Automatic Cardiac Diagnosis Challenge (ACDC)” workshop held in conjunction with the 20th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), on September 10th, 2017 in Quebec City, Canada [20]. The publicly available training dataset includes 100 patient scans each consisting of a short-axis cine-MRI acquired on 1.5T and 3T systems with resolutions ranging from 0.70 × 0.70 mm to 1.92 × 1.92 mm in-plane and 5 mm to 10 mm through-plane. Furthermore, segmentation masks for the myocardium (Myo), the left ventricle (LV) and the right ventricle (RV) are available for the end-diastolic (ED) and end-systolic (ES) phases of each patient. Then, the training database with ground truths provided are split into the training and test sets with ratio 8:2 to evaluate the image segmentation models.
54
N.-T. Nguyen et al.
Fig. 5 The learning curves by the proposed approach when segmenting images in the ACDCA database for endocardium (left) and epicardium (right). (a) the loss vs. epochs. (b) the DSC score vs. epochs
The learning curve showing the evolution of loss and segmentation performance on the training and validation data of the proposed network is shown in Fig. 5. It is obvious that the convergence speed of the network for ACDCA dataset is fast (after 25 epochs). The validation DSC score over the training process is variable. The reason behind this fact is that the validation set contains some images totally different from the ones in training set. Therefore, during the first learning iterations the model has some problems about segmenting those images. The segmentation results of our proposed method on some test images of the ACDCA database are shown in Fig. 6. The representative images are taken from the basal, mid-cavity and apical slices. As can be seen from this figure, by observation, the endocardium and epicardium contours are in good agreement between the results obtained by the proposed approach and the ground truths, even in the case of apical slice as in the third row of Fig. 6.
4 Compare to Other Methods To quantitatively evaluate the performances of the proposed model, we re-implement some state-of-the-art FCN based image segmentation architectures including FCN [1], SegNet [5], and U-Net [4]. We presented the average Dice Similarity Coefficient, and Jaccard coefficients of each method when segmenting all test images from the database in Table 1. From this table, by comparing quantitatively, we could see that the proposed model produces more accurate results than the others.
Refining Skip Connections by Fusing Multi-scaled Context …
55
Fig. 6 Representative segmentation by the proposed approach for the ACDCA data. The endocardial contours are in red, and the epicardial contours are in blue
Table 1 The mean of obtained Dice similarity coefficient and Jaccard Index between other stateof the-art and the proposed models on the ACDCA Dataset for both endocardium (Endo) and epicardium (Epi) regions Method
Dice coefficient
Jaccard Index
Endo
Epi
Endo
Epi
FCN [1]
0.89
0.92
0.83
0.89
SegNet [5]
0.82
0.89
0.75
0.83
U-Net [4]
0,88
0.92
0.82
0.87
Proposed approach
0.90
0.93
0.85
0.90
5 Conclusion This paper demonstrates the benefits of combining the proposed attention mechanism and multi-scale context integrated into the U-Net architecture for segmentation problem in cardiac magnetic resonance imaging. The experimental results on MICCAI Challenges dataset show high gain in semantic segmentation in relation to state of-the-art alternatives. As a general framework, our proposed model is not limited to medical image segmentation applications. In our future work, we plan to investigate the potential of the proposed model for general semantic segmentation tasks.
56
N.-T. Nguyen et al.
Acknowledgements This research is funded by the Hanoi University of Science and Technology (HUST) under project number T2020-PC-017.
References 1. Tran, P.V.: A fully convolutional neural network for cardiac segmentation in short-axis MRI. arXiv preprint arXiv:1604.00494 (2016) 2. Shyu, K.K., Pham, V.T., Tran, T.T., Lee, P.L.: Unsupervised active contours driven by density distance and local fitting energy with applications to medical image segmentation. Mach. Vis. Appl. 23(6), 1159–1175 (2012) 3. Tran, T.T., Pham, V.T., Lin, C., Yang, H.W., Wang, Y.H., Shyu, K.K., Tseng, W.Y., Su, M.Y., Lin, L.Y., Lo, M.T.: Empirical mode decomposition and monogenic signal based approach for quantification of myocardial infarction from MR images. IEEE J. Biomed. Health Inform. 23(2), 731–743 (2019) 4. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 234–241. Springer (2015) 5. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481– 2495 (2017) 6. Ninh, Q.C., Tran, T.T., Tran, T.T., Tran, T.A.X., Pham, V.T.: Skin lesion segmentation based on modification of SegNet neural networks. In: Proceedings 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi, pp. 575–578 (2020) 7. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C.: The importance of skip connections in biomedical image segmentation. In: Deep Learning and Data Labeling for Medical Applications, pp. 179–187. Springer (2016) 8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 9. Zhang, H., Patel, V.M.: Densely connected pyramid dehazing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3194–3203 (2018) 10. Chen, S., Tan, X., Wang, B., Hu, X.: Reverse attention for salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 234–250 (2018) 11. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 12. Woo, S., Park, J., Lee, J.-Y., So Kweon, I.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018) 13. Li, H., Xiong, P., An, J., Wang, L.: Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180 (2018) 14. Li, C., Tong, Q., Liao, X., Si, W., Sun, Y., Wang, Q., Heng, P.-A.: Attention based hierarchical aggregation network for 3D left atrial segmentation. In: International Workshop on Statistical Atlases and Computational Models of the Heart, pp. 255–264. Springer (2018) 15. Wang, Y., Deng, Z., Hu, X., Zhu, L., Yang, X., Xu, X., Heng, P.-A., Ni, D.: Deep attentional features for prostate segmentation in ultrasound. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 523–530. Springer (2018) 16. Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019). 17. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015)
Refining Skip Connections by Fusing Multi-scaled Context …
57
18. Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 240–248. Springer (2017) 19. Tato, A., Nkambou, R.: Improving adam optimizer (2018) 20. Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.-A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G.: Deep learning techniques for automatic MRI cardiac multistructures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)
End-to-End Hand Rehabilitation System with Single-Shot Gesture Classification for Stroke Patients Wai Kin Koh, Quang H. Nguyen, Youheng Ou Yang, Tianma Xu, Binh P. Nguyen, and Matthew Chin Heng Chua
Abstract Rehabilitation of the hands is crucial for stroke survivors to regain their ability to perform activities of daily living. Various technologies were explored and found unmatured, expensive and uncomfortable. Existing devices to assist rehabilitation are typically costly, bulky and difficult to set up. Our proposed solution aims to provide an end-to-end hand rehabilitation system that can be produced at low cost with greater ease of use. It incorporates gamification to motivate stroke survivors to perform physical rehabilitation through an infra-red depth camera and computer system. MediaPipe was employed for hand detection and hand landmark extraction. A single-shot neural network model is proposed for hand gesture detection with an accuracy rate of 98%. Lastly, a visually interactive game was developed to promote engagement of the user during the performance of rehabilitation.
W. K. Koh · M. C. H. Chua Institute of Systems Science, National University of Singapore, Singapore 119620, Singapore e-mail: [email protected] M. C. H. Chua e-mail: [email protected] Q. H. Nguyen (B) School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam e-mail: [email protected] Y. O. Yang · T. Xu Department of Orthopaedics, Singapore General Hospital, Singapore 169608, Singapore e-mail: [email protected] B. P. Nguyen School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_5
59
60
W. K. Koh et al.
1 Introduction Stroke is the leading cause of long-term physical disability worldwide [1]. These stroke-related physical disabilities have severe impairments on activities of daily living and reduced quality of life [2]. Rehabilitation provides an opportunity for stroke survivors to regain the ability to perform activities of daily living [3]. Stroke rehabilitation is an intensive and gradual process that requires several months of treatment which results in low participation and compliance [4]. Advanced technologies have been applied to hand rehabilitation [5] with positive results in improving the motivation of stroke survivors during participation in physical rehabilitation [6]. Various hand gesture sensing methods have been proposed utilizing ultrasonic waves, WIFI, audio, and sound wave. However, these researches are still in the early development stage. Exoskeleton based robotic systems [7] and soft-robotic gloves [8] have been extensively studied for hand rehabilitation; however, they are limited due to their contact-based nature which can lead to user discomfort, skin allergies complications and dermatitis. These robotic systems also require calibration and are often expensive and bulky. Virtual Reality (VR) is an emerging technology in hand rehabilitation [9]. Nonetheless, transient dizziness, headache and pain are frequently reported adverse effects of VR [10]. Infra-red depth camera hand tracking systems address the issues highlighted above. The Microsoft Kinect, Leap Motion Controller and Intel RealSense camera are widely used examples in hand tracking [12–15]. Our team chose to work with the Intel RealSense camera as it was able to perform shoulder detection in contrast to the Leap Motion Controller and had a longer service lifespan as compared to the Microsoft Kinect that was discontinued in 2018. In this work, an end-to-end hand rehabilitation system for stroke survivors is proposed. The proposed system can perform hand segmentation, hand landmark extraction, gesture classification, as well as encourage user motivation in hand rehabilitation exercises via gamification. The architecture, framework and machine learning techniques used in the system are discussed in Sect. 2. Implementation and discussion of the work are presented in Sect. 3. An exploration of future works as well as the conclusion of the paper can be found in Sect. 4.
2 Methodology The architecture of the proposed solution is shown in Fig. 1. The input to the system was sourced from an Intel RealSense camera via a standard plug-and-play universal serial bus (USB) interface. The procedures to process input are categorized into 4 parts: (1) Hand detection using a Neural Network via MediaPipe, 2) fingers joint landmark detection, 3) hand gesture recognition, and 4) gamification.
End-to-End Hand Rehabilitation System with Single-Shot Gesture Classification ... Fig. 1 Architecture of the proposed system
61
Input from sensor MediaPipe Hand Detection using Neural Network
21 points of 3D Hand Knuckle Extraction
DIP, PIP, MCP Extraction
Gesture Recognition using Single-shot Neural Network
Gamification
2.1 MediaPipe MediaPipe is an open-source framework developed by Google for cross-platform machine learning applications [16]. Customizable perception pipelines have been implemented in MediaPipe to help in rapid prototyping with inference models and other reusable components. MediaPipe was employed for hand detection and hand landmark extraction in this system. MediaPipe employs BlazePalm, a single-shot detector model which has an average accuracy of 95.7% for palm detection. After discovering the palm position, the coordinates of 21 points 3D hand-knuckle landmarks are then identified via regression. The labelling of the landmarks is illustrated in Fig. 2. To generalise the landmark extraction model, a dataset consisting of realworld hand images, rendered synthetic hand images and a mixture of these two datasets was used for the training.
2.2 Derivation of the Degrees of Freedom of Each Finger’s Joint The range of motion of the finger joints with respect to flexion and extension is critical at evaluating rehabilitation outcomes [17]. The positions of the distal interphalangeal (DIP), proximal interphalangeal (PIP) and metacarpophalangeal (MCP) joints are shown in Fig. 2. The angle of DIP, PIP and MCP joints are derived based on the 21 points 3D hand-knuckle obtained from the MediaPipe.
62
W. K. Koh et al.
Fig. 2 Labels of the 21 points 3D hand-knuckle landmarks (left), and positions of the DIP, PIP and MCP joints (right).
2.3 Gesture Recognition Gesture recognition methods can be categorised into static and dynamic gesture classification, where static gesture classification involves frame with no time information and dynamic gesture classification is mainly used for time-series purpose. Additionally, static gesture classification methods can be further arranged into supervised and unsupervised. However, supervised methods are more effective than unsupervised methods. In the proposed system, the gesture was processed frame by frame, hence supervised static gesture classification was selected. Among supervised static gesture classification, Support Vector Machine (SVM), k-Nearest Neighbors (k-NN) and Artificial Neural Network (ANN) have been widely employed for gesture recognition. Among these methods, ANN has better accuracy over SVM and k-NN. Thus, a single-shot neural network model to classify the hand gesture based on the 21 points 3D hand-knuckle is proposed. MSRA15, a publicly available hand dataset with 17 types of categorised gestures [18] was selected to verify the performance of ANN, SVM and k-NN. The details of the comparison were discussed in Sect. 3. MSRA15 dataset is a discrete dataset without any time information. The dataset captures the real hand gesture in-depth image and the coordinates of the 21 hand joints were also identified. MSRA15 has large diversified dataset with 76,375 hand gestures which were collected from 9 subjects.
2.4 Gamification Various studies indicate that serious gaming is currently growing a lot of attention in the healthcare community. Moreover, a significant improvement compared with
End-to-End Hand Rehabilitation System with Single-Shot Gesture Classification ...
63
conventional therapy has been shown [19]. Therefore, to increase the engagement of stroke survivor to the rehabilitation, an interactive and fun hand grasping game is proposed.
3 Implementation and Results MediaPipe was used for hand detection and hand landmarks extraction. MediaPipe was designed to allow the developer to prototype a pipeline incrementally. A pipeline, as shown in Fig. 3, is defined as a directed graph of components where each component is a Calculator. The graph is specified using a GraphConfig protocol buffer and then run using a Graph object. Each Calculator is built with a C++ programming language. A few customisations were made to the HandLandmark Calculator to extract out the hand landmarks and process these landmarks to derive the angle of DIP, PIP and MCP. Renderer Calculator was also modified to remove the background. With the coordinates extracted from the 21 points 3D hand-knuckle, the angle of DIP, PIP and MCP were then derived by using Eqs. (1), (2) and (3), respectively. The annotation, c, in the equations represents a current landmark, c + 1, c − 1 and 0 represent next, previous and index 0 landmark. Figure 4 shows the result of the derived DIP, PIP and MCP angles from the 21 landmarks. These data were processed in each frame. θ D I P = arctan2((Yc − Yc+1 ), (X c − X c+1 )) − arctan2((Yc − Yc−1 ), (X c − X c−1 )), (1) θ P I P = arctan2((Yc − Yc+1 ), (X c − X c+1 )) − arctan2((Yc − Yc−1 ), (X c − X c−1 )), (2) θ MC P = arctan2((Yc − Yc+2 ), (X c − X c+2 )) − arctan2((Yc − Y0 ), (X c − X 0 )), (3) where ⎧ y arctan x if x > 0, ⎪ ⎪ ⎪ ⎪ π x ⎪ if y > 0, ⎪ ⎨ 2 − arctan y π x arctan2(y, x) = − − arctan (4) if y < 0, ⎪ 2 ⎪ y y ⎪ ⎪ arctan x ± π if x < 0, ⎪ ⎪ ⎩ undefined if x = 0 and y = 0. A single-shot neural network model for hand gesture classification was built with Jupyter notebook in Python programming language. Rectified Linear Unit (RELU) was used as the activation function and Softmax function was used at the output layer of the neural network.
64
W. K. Koh et al.
input_frames_gpu
IMAGE
ImageTransformation IMAGE
IMAGE
NORM_RECT
HandLandmark PRESENCE
MAIN
NORM_RECT
LANDMARKS
MAIN
LOOP
LOOP
PreviousLoopback
PreviousLoopback_2
PREV_LOOP
PREV_LOOP
IMAGE
DISALLOW
Gate
HandDetection DETECTIONS
NORM_RECT
0
1
Merge
IMAGE
DETECTIONS
NORM_RECT
LANDMARKS
Renderer IMAGE
output_frames_gpu
Fig. 3 Hand tracking pipeline in MediaPipe
The model was trained with MSRA15 dataset. Four hand gestures (one, two, five and T) were selected from the MSRA15 dataset. The raw dataset was arranged into 4 rows which were the label and the position of x, y and z of the landmark using Python. The value of DIP, PIP and MCP were then derived based on it. The dataset was then split into training and testing dataset with the ratio of 80:20 and saved in seperate files for training and testing purpose.
End-to-End Hand Rehabilitation System with Single-Shot Gesture Classification ...
65
2 MCP: -9638280 3 PIP: 8.485805
13 MCP: 0.158643 14 PIP: 2.013722
5 MCP: 3.637231
15 DIP: 1.601999
6 PIP: -2.502235 7 DIP: -1.820882 17 MCP: 0.429200 9 MCP: -1.333726
18 PIP: 4.315496
10 PIP: -0.378674
19 DIP: 2.035606
11 DIP: 0.538825
Fig. 4 Angle of DIP, PIP and MCP based on the 21 3D hand-knuckles Table 1 Performance of the single-shot Neural Network (NN), SVM and k-NN models Gesture Precision Recall F1-score NN SVM k-NN NN SVM k-NN NN SVM k-NN 0 1 2 3 Average
0.99 1.00 0.93 1.00
0.93 0.99 1.00 0.84
0.76 0.96 0.87 0.94
1.00 1.00 1.00 0.92
0.99 1.00 0.82 0.93
0.92 1.00 0.91 0.68
0.99 1.00 0.97 0.96 0.98
0.96 0.99 0.90 0.88 0.93
0.83 0.98 0.89 0.79 0.88
The single-shot neural network model was further compared with SVM and kNN models by using the same MSRA15 dataset. The classification reports of the models were illustrated in Table 1. Based on the result, the single-shot neural network achieved 98% F1-score, which was better than SVM and k-NN of 93% and 88%, respectively. The proposed single-shot neural network was shown to have better prediction ability, thus it was a good candidate model to be integrated into the system for hand gesture detection. Unity game engine was employed to build the gamified exercise. To make the interaction with the stroke survivor more interactive, Brave Bird, a game with a flying bird trying to avoid the bombs running towards it, was developed. With the recognized gesture from the single-shot neural network model, the bird would fly up when the hand was detected in a clenched state and fly down when the hand was in an open state. The user interface of the developed game was illustrated in Fig. 5.
66
W. K. Koh et al.
Fig. 5 Brave Bird, a gamified hand rehabilitation exercise
4 Discussion and Conclusion At present, the system is only able to track a single hand and is unable to process partial occlusions. Simultaneous two-hands tracking may be developed in future work. The single-shot neural network model could be further improved to accommodate more gestures and even include hand gestures whilst manipulating tools. This would be a significant advancement in post-stroke hand rehabilitation especially in evaluating functional activities such as holding a fork or spoon for self-feeding. Finally, the output from the hand gesture model could be enhanced by using the MQTT protocol. This would greatly improve the gamification performance. The proposed end-to-end hand rehabilitation system for stroke patients was inexpensive and had good ease of use. An completed algorithm for hand detection to hand gesture recognition was developed. The ranges of motion for the MCP, PIP and DIP were derived to allow the therapist to easily monitor the progress of the client. An interactive game was deployed to increase the patient’s motivation via gamification of hand rehabilitation tasks. Acknowledgment This project is funded by Tote Board Enabling Lives Initiative Grant and supported by SG Enable.
End-to-End Hand Rehabilitation System with Single-Shot Gesture Classification ...
67
References 1. Langhorne, P., et al.: Stroke rehabilitation. Lancet 377(9778), 1693–1702 (2011) 2. Hreha, K., et al.: The impact of stroke on psychological and physical function outcomes in people with long-term physical disability. Disabil. Health J. 100–919 (2020) 3. Veisi-Pirkoohi, S., et al.: Efficacy of RehaCom cognitive rehabilitation software in activities of daily living, attention and response control in chronic stroke patients. J. Clin. Neurosci. 71, 101–107 (2020) 4. Dowling, A.V., et al.: An adaptive home-use robotic rehabilitation system for the upper body. IEEE J. Transl. Eng. Heal. Med. 2, 1–10 (2014) 5. Levanon, Y.: The advantages and disadvantages of using high technology in hand rehabilitation. J. Hand Ther. 26(2), 179–183 (2013) 6. Gorsic, M., et al.: Competitive and cooperative arm rehabilitation games played by a patient and unimpaired person: effects on motivation and exercise intensity. J. Neuroeng. Rehabil. 14(1), 1–18 (2017) 7. Iqbal, J., et al.: A novel exoskeleton robotic system for hand rehabilitation - conceptualization to prototyping. Biocybern. Biomed. Eng. 34(2), 79–89 (2014) 8. Polygerinos, P., et al.: Soft robotic glove for combined assistance and at-home rehabilitation. Rob. Auton. Syst. 73, 135–143 (2015) 9. Liao, Y., et al.: A review of computational approaches for evaluation of rehabilitation exercises. Comput. Biol. Med. 119, 1–29 (2020) 10. Laver, K.E., et al.: Virtual reality for stroke rehabilitation. Cochrane Database Syst. Rev. 2017(11) (2017) 11. Suarez, J., Murphy, R. R.: Hand gesture recognition with depth images: a review. In: Proceeding of IEEE International Working Robot Human Interaction Communication, pp. 411–417. IEEE, Paris, France (2012) 12. Guzsvinecz, T., et al.: Suitability of the Kinect sensor and leap motion controller - a literature review. Sensors 19(5), 1072 (2019) 13. Wen, R., et al.: Hand gesture guided robot-assisted surgery based on a direct augmented reality interface. Comput. Methods Programs Biomed. 116(2), 68–80 (2014) 14. Nguyen, B.P., et al.: Robust biometric recognition from palm depth images for gloved hands. IEEE Trans. Hum. Mach. Syst. 45(6), 799–804 (2015) 15. Wen, R. et al.: In situ spatial AR surgical planning using Projector-Kinect system. In: Proceeding of 4th Symposium Information and Communication Technology (SoICT 2013), pp. 164–171. ACM, Hanoi, Vietnam (2010) 16. Lugaresi, C., et al.: MediaPipe: a framework for perceiving and augmenting reality (2019) 17. Moreira, A. H. J. et al.: Real-time hand tracking for rehabilitation and character animation. In: Proceeding of IEEE 3rd International Conference on Serious Games Application, pp. 1–8. IEEE, Rio de Janeiro, Brazil (2014) 18. Rusydi, M.I., et al.: Recognition of sign language hand gestures using leap motion sensor based on threshold and ANN models. Bull. Electr. Eng. Inform. 9(2), 473–483 (2020) 19. Bonnechere, B., et al.: The use of commercial video games in rehabilitation: a systematic review. Int. J. Rehabil. Res. 39(4), 277–290 (2016)
Feature Selection Based on Shapley Additive Explanations on Metagenomic Data for Colorectal Cancer Diagnosis Nguyen Thanh-Hai, Toan Bao Tran, Nhi Yen Kim Phan, Tran Thanh Dien, and Nguyen Thai-Nghe
Abstract Personalized medicine is one of the hottest current approaches to take care of and improve human health. Scientists who participate in projects related to personalized medicine approaches usually consider metagenomic data as a valuable data source for developing and proposing methods for disease treatments. We usually face challenges for processing metagenomic data because of its high dimensionality and complexities. Numerous studies have attempted to find biomarkers which can be medical signs related significantly to the diseases. In this study, we propose an approach based on Shapley Additive Explanations, a model explainability, to select valuable features from metagenomic data to improve the disease prediction tasks. The proposed feature selection method is evaluated on more than 500 samples of colorectal cancer coming from various geographic regions such as France, China, the United States, Austria, and Germany. The set of 10 selected features based on Shapley Additive Explanations can achieve significant results compared to the feature selection method based on the Pearson coefficient and it also obtains comparative performances compared to the original set of features including approximately 2000 features.
N. Thanh-Hai (B) · N. Y. K. Phan · T. T. Dien · N. Thai-Nghe College of Information and Communication Technology, Can Tho University, Can Tho 900100, Vietnam e-mail: [email protected] T. T. Dien e-mail: [email protected] N. Thai-Nghe e-mail: [email protected] T. B. Tran Center of Software Engineering, Duy Tan University, Da Nang 550000, Vietnam e-mail: [email protected] Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_6
69
70
N. Thanh-Hai et al.
1 Introduction Medicine is one of the fields which has received considerable attention. Many scientific researchers have published and made great contributions to Medicine. Personalized medicine is a method that is being researched and applied to improve the effectiveness of diagnosis and treatment of the disease. This method is based on the genetic information of each patient to build a model to divide patients into different groups and cure through the intervention of medical products suitable for each patient. The genome plays an important role in the human body, and based on the individual patient’s genetic map, researchers have applied sophisticated techniques and in-depth studies to improve disease diagnostic performance. Metagenomics is the science of studying genetic materials, and genetic materials can be obtained directly without the need for culture. The metagenomics study has achieved many remarkable achievements. Many studies have shown that the microbiota in the human body provides much important information about human health and that the microbiota in the human body also greatly affects the diseases we suffer. The vast majority of bacteria are still undetected and the link between them remains individual, so the challenge is to explore the rest of the bacteria and do in-depth studies on the metagenomic datasets because the information is still very complicated. In this study, we proposed the feature selection method by Pearson and Shapley Additive Explanations (SHAP) [1] for choosing features from four metagenomic datasets and investigated the performance by several classification tasks with a machine learning model. Our contributions include: • We considered using the Random Forest model to distinguish the colorectal cancer diseases and selected the most important features from the high-dimensionality original datasets. • We leveraged the advantages of Pearson and SHAP method to exhibits the significant features. • We conducted several classification tasks to investigate the performance of the proposed method on four considered datasets. The performance comparison on the original datasets and the selected important features was also inspected. The rest of this study contains sections as follows. Some state-of-the-arts related to the work are presented in Sect. 2. The detailed information of metagenomic datasets is presented in Sect. 3. We introduced the implementation of the Random Forest model and the feature selection methods in Sect. 4. The performance measured by the Accuracy, Area Under the Curve (AUC), and Matthews correlation coefficient (MCC) are presented in Sect. 5. Finally, Sect. 6 contains our conclusion for the work.
Feature Selection Using Shapley Additive Explanations ...
71
2 Related Work In recent years, several studies have focused on the development of metagenomic sequencing due to its advantages. One of them can be considered as the sequenced microbial genomes based on the gene homology-based methods and the performance of those methods are not efficient in detecting the undiscovered viral sequences. The study [2] presented an approach for recognizing the viral sequences in metagenomic data via DeepVirFinder and the proposed approach outperformed the state-of-theart method. Furthermore, the extension of training data by appending purified viral sequences can increase the performance. The authors also investigated the proposed approach to real human gut metagenomic samples. They figured out 51, 138 viral sequences in patients with colorectal carcinoma (CRC) and revealed the viruses can play important roles in CRC. Using k-mer sequence signatures for discriminating the viral from the bacterial signal is the purpose of the VirFinder tool [3]. The proposed method can discover unknown viruses and investigate the performance of using ecosystem-focused models on aquatic metagenomic data. The authors presented a performance on the training set and the limitation of retrieving low abundance viral in metagenomes. Furthermore, they also proposed potential biases for viral detection in datasets and suggested the solution for increasing the performance. The approach in study [4] presented the newly virus identification which is not similar to the previous sequenced ones. The authors developed software, namely Host Taxon Predictor (HTP), for classifying between phages and eukaryotic viruses. The HTP performance was investigated on newly identified viral genomes and genome fragments datasets. The diversity and contribution of the microbiological community to the natural environment were explored and introduced in research [5, 6]. Based on the sequencing of RNAs, a researcher in [7] has extracted the first genome of SARSCoV-2. Using the Random Forests and Gradient BoostingTrees model, the authors in [8] have extracted the features on the Amazon Fine Food Reviews dataset.
Fig. 1 The SHAP score visualization of ten important species on the Feng dataset
72
N. Thanh-Hai et al.
Table 1 The additional information of datasets of Feng [21], Vogtmann [22], Yu [23], and Zeller [24] Dataset Factors Healthy Patients Total Features samples Feng
No. of samples Gender
Vogtmann
No. of samples Gender
Yu
No. of samples Gender
Zeller
No. of samples Gender
63 Male: 37 Female: 26 52 Male: 37 Female: 15 92 Male: 51 Female: 41 64 Male: 32 Female: 32
46 Male: 28 Female: 18 48 Male: 35 Female: 13 73 Male: 47 Female: 26 88 Male: 53 Female: 35
109 65 44 100 72 28 165 98 67 152 85 67
1981
1976
1932
1980
Studies [9–11] have proven to have discovered new microbial communities from investigating 16rRNA sequences from such undefined microorganisms. Research by the author in [12] is the recommendation of a virMine approach to the viral genomes from representative of viral or the viral and bacterial. VirMine removed the non-viral and repeat method to find the virus genome. From there, new viruses and species are discovered more easily. In addition, the author also evaluated the performance from three different environments of the microorganism. Through studies [13–16], we can see a close relationship between individualized medicine and microbiota, and at the same time, the biological system also plays a very important role for human health. Feature selection is the process of reducing the input data dimension to improve performance and reduce computation time. Pearson is a method that is used quite a lot to select features in several studies. By ranking the correlation, the most significant features can be selected for further works. The authors in [17] used the Pearson X2 method used in biomedical data analysis and they found that Pearson X2 showed encouraging results compared to FCBF, CorrSF, ReliefF, and ConnSF. Pearson Correlation Coefficient is also applied to identify daily activities in a smart home mentioned in [18]. Furthermore, the study [19] and [20] used Pearson’s method to choose the and both bring positive results.
3 Datasets for Colorectal Cancer Prediction Tasks As mentioned above, we used four metagenomic datasets in this study. Specifically, the considered datasets related to Colorectal Cancer from four cohorts namely Feng
Feature Selection Using Shapley Additive Explanations ...
73
Fig. 2 The SHAP score visualization of 10 important species on the Vogtmann dataset
[21], Vogtmann [22], Yu [23], and Zeller [24]. In two years, from 2014 to 2016, the data collection are conducted on 255 patients and 271 healthy individuals. Furthermore, the Feng dataset includes the samples from 46 patients and 63 healthy individuals with a count of 109, the Vogtmann dataset incorporates between 48 patients and 52 healthy individuals whereas the Yu and Zeller dataset consists of 73, 88 CRC patients and 92, 64 of Non-CRC patients respectively. Besides, the number of features of each dataset is also huge. In other words, the considered datasets are high dimensionality, the smallest dataset includes 1932 features and the largest dataset contains 1981 features. We also presented the additional information of four metagenomic datasets in Table 1. The feature on each dataset discovers the species abundance which shows a ratio of species bacterial in the human gut of each sample. Total abundance of all features in the same sample is sum up to 1 which is exhibited by the formula as follows (Eq. 1): k fi = 1 (1) i=1
where: • k is the number of features for a sample. • f i is the value of the i-th feature.
4 Learning Model and Feature Selection Methods-Based Pearson and SHAP 4.1 Learning Model We implemented the Random Forest with 500 Decision Trees and the maximum depth of each tree is 4. The tree classifier quality is evaluated by Gini impurity. Gini impurity
74
N. Thanh-Hai et al.
Fig. 3 The SHAP score visualization of 10 important species on the Yu dataset
is a measure of the frequency of specifying the random elements from the annotated dataset incorrectly based on the distribution of labels in the subset. Furthermore, in the training section, we can compute the contribution of each feature that decreasing the weighted impurity and obtain the feature importance of the sample. Furthermore, we evaluated the efficiency of the proposed method on the classification task by computing the Accuracy, AUC, and MCC. The formula of Accuracy and MCC which are presented as followings (Eqs. 2 and 3). ACC =
TN +TP T P + FN + T N + FP
T P × T N − FP × FN MCC = √ (T P + F P)(T N + F P)(T P + F N )(T N + F N )
(2)
(3)
where • • • •
FP stands for False Positive. FN stands for False Negative. TP stands for True Positive. TN stands for True Negative.
4.2 Feature Selection Approach Feature selection is the organization of diminishing the features of a sample on the dataset. The feature selection can reduce the computational cost and improve the performance of the learning model. In this study, we present the statistical-based feature selection via the Pearson correlation coefficient and the explanation of the machine learning model carried out by Shapley Additive Explanations method. The Pearson correlation coefficient is the most common statistic that measures the linear correlation between the features and the range value is from −1 to +1,
Feature Selection Using Shapley Additive Explanations ...
75
Fig. 4 The SHAP score visualization of 10 important species on the Zeller dataset
Fig. 5 Different SHAP, Pearson and Original feature selection method Comparison by Random Forest model, through ACC measure
whereas +1 is a total positive correlation, 0 is non-correlation, and −1 is a total negative correlation. The Pearson correlation coefficient can be computed as the formula in Eq. 4. We applied the Pearson to four considered datasets and selected the ten most positive correlation of the features on each dataset and leveraged those features for classifying the colorectal cancer diseases. n x y − ( x)( y) (4) r= [n x 2 − ( x)2 ][n y 2 − ( y)2 ] where • r = Pearson Coefficient
76
N. Thanh-Hai et al.
Fig. 6 Different SHAP, Pearson and Original feature selection method Comparison by Random Forest model, through AUC measure
• • • • • •
n = number of the pairs of the stock x y = sum of products of the paired stocks x = sum of the x scores y 2= sum of the y scores x 2 = sum of the squared x scores y = sum of the squared y scores
Shapley Additive Explanations (SHAP) is an efficient approach to understanding the learning model. Specifically, the SHAP process will in turn for each feature to randomly combine with other features together and generate the output for one case, each feature will be gained several points depending on the contributions to the output. Finally, the SHAP score will be calculated for a given feature, which is the average result of all changes in the predicted output. We trained four datasets with the Random Forest model and explained the model with the SHAP method. Then, we extracted the most crucial features from the datasets by computing the SHAP score of each feature. The Shapley value explanation is represented as an additive feature attribution method and specifies the explanation as to the Eq. 5. We also presented the SHAP score of ten crucial features on Feng, Vogtmann, Yu, and Zeller dataset in Figs. 1, 2, 3 and 4 respectively. g(z ) = φ0 +
M
φj
(5)
j=1
where g is the explanation machine learning model, z is the coalition vector M is the maximum coalition size and φ j is the feature attribution for feature j.
Feature Selection Using Shapley Additive Explanations ...
77
Table 2 The mean performance of SHAP, Pearson feature selection methods, and the original dataset by Random Forest model on 10-fold cross-validation. The standard deviation is presented in the round bracket Dataset Features Accuracy AUC MCC Feng Vogtmann Yu Zeller Feng Vogtmann Yu Zeller Feng Vogtmann Yu Zeller
SHAP Top 10 features
Pearson Top 10 features
Original set of features
0.8091 (0.1250) 0.7000 (0.0894) 0.7154 (0.1107) 0.7896 (0.0974) 0.6891 (0.1071) 0.5900 (0.0943) 0.6062 (0.1323) 0.7162 (0.0996) 0.7073 (0.1043) 0.6000 (0.1897) 0.6375 (0.1405) 0.7763 (0.1190)
0.8362 (0.1647) 0.7918 (0.1145) 0.7219 (0.1336) 0.8571 (0.1292) 0.7627 (0.1606) 0.6283 (0.1218) 0.6468 (0.1899) 0.7110 (0.0714) 0.7757 (0.0867) 0.6647 (0.1799) 0.7111 (0.1172) 0.8126 (0.1572)
0.6145 (0.2537) 0.4269 (0.1772) 0.4339 (0.2238) 0.5750 (0.1956) 0.3618 (0.2418) 0.2575 (0.2305) 0.2681 (0.2835) 0.4135 (0.2039) 0.4012 (0.2325) 0.1930 (0.3826) 0.2631 (0.2977) 0.5526 (0.2484)
5 Experimental Results We investigated the performance of the proposed method on 10-fold cross-validation with the Random Forest model. Table 2 presents the performance of SHAP, Pearson selection methods, and the original dataset. The standard deviation of the Accuracy, AUC, and MCC are also presented by the value in the round bracket along with the performance. The highlighted values in Table 2 present the best Accuracy, AUC, or MCC on the Feng, Vogtmann, Yu, and Zeller dataset. The performance of the feature selection method by SHAP outperforms the Pearson method and even the original set of features on four considered datasets. More specifically, by SHAP explanation, the Accuracy and MCC obtained the highest on the Feng dataset with 0.8091 and 0.6145 respectively, whereas the AUC of 0.8571 reached the top on the Zeller dataset. Furthermore, the Pearson feature selection method obtained the lowest performance in comparison with SHAP and the original set of features on the whole considered datasets. WE also visualized the comparison between the Accuracy, AUC, and MCC on four datasets in Figs. 5, 6 and 7 respectively. The Accuracy comparison in Fig. 5, the SHAP feature selection method obtained 0.8091, 0.7, 0.7154, and 0.7896 of Accuracy on the Feng, Vogtmann, Yu, and Zeller dataset respectively. It also outperforms the Person feature selection method and the original set of features. The most significant difference is on the Feng dataset. The Pearson method gained 0.6891, the original set of features obtained 0.7073 whereas the SHAP method reached 0.8091 of Accuracy. The AUC of 0.8571 on the Zeller dataset by the SHAP method is better in comparison with the rest but the AUC on the Feng, Vogtmann, and Yu dataset is also close. The MCC retrieved 0.6145, 0.4269, 0.4339, and 0.575 on the Feng, Vogtmann, Yu, and Zeller respectively. As observed
78
N. Thanh-Hai et al.
Fig. 7 Different SHAP, Pearson, and Original feature selection method comparison by Random Forest model, through MCC measure.
from the results, we obtain significant results in AUC on 2 datasets (using SHAP) including Vogtmann, Zeller with p-values of 0.0089, 0.0101, respectively.
6 Conclusion In this study, we present a method of feature selection which obtains promising performance in prediction tasks in Accuracy, AUC, and MCC. As proven from the experiments, SHAP reveals as an efficient approach for feature selection. The selected features with SHAP outperform the features which are chosen by the Pearson coefficient and reach comparative results in Colorectal Cancer prediction. We also observe differences in disease prediction performances among Colorectal Cancer samples from various regions. This requires further research to investigate and explore more about the results to obtain appropriate explanations. An optimal set of selected features should be taken into account to enhance the performance. Acknowledgement Tran Bao Toan was funded by Vingroup Joint Stock Company and supported by the Domestic Master/Ph.D. Scholarship of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.ThS63.
References 1. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)
Feature Selection Using Shapley Additive Explanations ...
79
2. Ren, J., Song, K., Deng, C., Ahlgren, N.A., Fuhrman, J.A., Li, Y., Xie, X., Poplin, R., Sun, F.: Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020) 3. Ponsero, A.J., Hurwitz, B.L.: The promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes. Front. Microbiol. 10, 806 (2019) 4. Gałan, W., et al.: Host taxon predictor - a tool for predicting taxon of the host of a newly discovered virus. Sci. Rep. 9, 1–13 (2019) 5. Chroneos, Z.C.: Metagenomics: theory, methods, and applications. Hum. Genomics 4(4), 282– 283 (2010). https://doi.org/10.1186/1479-7364-4-4-28211 6. Ponsero, A.J., Hurwitz, B.L.: The promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes. Front. Microbiol. 10, 806 (2019) 7. Udugama, B., et al.: DiagnosingCOVID-19: the disease and tools for detection. ACS Nano 14(4), 3822–3835 (2020) 8. Tran, P.Q., et al.: Effective opinion words extraction for food reviews classification. Int. J. Adv. Comput. Sci. Appl. (IJACSA), 11(7) (2020). http://dx.doi.org/10.14569/IJACSA.2020. 0110755 9. Jang, S.J., Ho, P.T., Jun, S.Y., Kim, D., Won, Y.J.: Dataset supporting description of the new mussel species of genus Gigantidas (Bivalvia: Mytilidae) and metagenomic data of bacterial community in the host mussel gill tissue. Data Brief 30, 105651 (2020). https://doi.org/10. 1016/j.dib.2020.105651 10. Ma, B., France, M., Ravel, J.: Meta-Pangenome: at the crossroad of pangenomics and metagenomics (2020). https://doi.org/10.1007/978-3-030-38281-09 11. Handelsman, J.: Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68(4), 669–685 (2004). https://doi.org/10.1128/MMBR.68.4.669-685. 20046 12. Garretto, A., Hatzopoulos, T., Putonti, C.: virMine: automated detection of viral sequences from complex metagenomic samples. PeerJ 7, e6695 (2019). https://doi.org/10.7717/peerj. 6695 13. Petrosino, J.F.: The microbiome in precision medicine: the way forward. Genome Med. 10, 12 (2018). https://doi.org/10.1186/s13073-018-0525-6 14. Behrouzi, A., et al.: The significance of microbiome in personalized medicine. Clin. Transl. Med. 8(1), 16 (2019). https://doi.org/10.1186/s40169-019-0232-y 15. Gilbert, J.A., Quinn, R.A., Debelius, J., et al.: Microbiome-wide association studies link dynamic microbial consortia to disease. Nature 535(7610), 94–103 (2016). https://doi.org/ 10.1038/nature188504 16. Kashyap, P.C., et al.: Microbiome at the frontier of personalized medicine. Mayo Clin. Proc. 92(12), 1855–1864 (2017). https://doi.org/10.1016/j.mayocp.2017.10.0043 17. Biesiada, J., Duch, W.: Feature selection for high-dimensional data — a pearson redundancy based filter. In: Kurzynski, M., Puchala, E., Wozniak, M., Zolnierek, A. (eds.) Computer Recognition Systems 2. Advances in Soft Computing, vol. 45. Springer, Berlin, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75175-5_30 18. Liu, Y., Mu, Y., Chen, K., et al.: Daily activity feature selection in smart homes based on pearson correlation coefficient. Neural Process. Lett. 51, 1771–1787 (2020). https://doi.org/ 10.1007/s11063-019-10185-8 19. Risqiwati, D., Wibawa, A.D., Pane, E.S., Islamiyah, W.R., Tyas, A.E., Purnomo, M.H.: Feature selection for EEG-based fatigue analysis using pearson correlation. In: 2020 International Seminar on Intelligent Technology and Its Applications (ISITIA), Surabaya, Indonesia, pp. 164–169 (2020). https://doi.org/10.1109/ISITIA49792.2020.9163760 20. Kalaiselvi, B., Thangamani, M.: An efficient Pearson correlation based improved random forest classification for protein structure prediction techniques. Measurement 162 (2020). https://doi. org/10.1016/j.measurement.2020.107885 21. Feng, Q., et al.: Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015). https://doi.org/10.1038/ncomms7528
80
N. Thanh-Hai et al.
22. Vogtmann, E., Hua, X., Zeller, G., Sunagawa, S., Voigt, A.Y., Hercog, R., Goedert, J.J., Shi, J., Bork, P., Sinha, R.: Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11(5), e0155362 (2016). https://doi.org/10. 1371/journal.pone.0155362 23. Yu, J., et al.: Metagenomic analysis of faecal microbiome as a tool towards targeted noninvasive biomarkers for colorectal cancer. Gut 66(1), 70–78 (2017). https://doi.org/10.1136/ gutjnl-2015-309800 24. Zeller, G., Tap, J., Voigt, A.Y., Sunagawa, S., Kultima, J.R., Costea, P.I., Amiot, A., Böhm, J., Brunetti, F., Habermann, N., Hercog, R., Koch, M., Luciani, A., Mende, D.R., Schneider, M.A., Schrotz-King, P., Tournigand, C., Tran Van Nhieu, J., Yamada, T., Zimmermann, J., Benes, V., Kloor, M., Ulrich, C.M., von Knebel Doeberitz, M., Sobhani, I., Bork, P.: Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014). https://doi.org/10.15252/msb.20145645 25. South, J., Blass, B.: The Future of Modern Genomics. Blackwell, London (2001)
Clinical Decision Support Systems for Pneumonia Diagnosis Using Gradient-Weighted Class Activation Mapping and Convolutional Neural Networks Thao Minh Nguyen Phan and Hai Thanh Nguyen Abstract In recent years, Deep Learning (DL) has gained great achievements in medicine. More specifically, DL techniques have had unprecedented success when applied to Chest X-Ray (CXR) images for disease diagnosis. Numerous scientists have attempted to develop efficient image-based diagnosis methods using DL algorithms. Their proposed methods can yield very reasonable performance on prediction tasks, but it is very hard to interpret the generated output from such deep learning algorithms. In this study, we propose a Convolutional Neural Network (CNN) architecture combining Gradient-weighted Class Activation Mapping (Grad-CAM) algorithm to discriminate between pneumonia patients and healthy controls as well as provide the explanations for the generated results by the proposed CNN architecture. The explanations include regions of interest that can be signs for the considered disease. As shown from the results, the proposed method has achieved a promising performance and it is expected to help the radiologists and doctors in the diagnosis process.
1 Introduction Pneumonia is one of the most infectious diseases causing worldwide death. According to the World Health Organization (WHO), until September 2020, there have been up to 31,7 million confirmed cases of COVID-19, including approximately 1 million deaths globally [1]. Pneumonia is dangerous not only to babies and young children but also to people who are over 65 years of age and/or have health problems or weak immune systems. Pneumonia is an infection of the parenchyma including inflammation of the alveoli, alveolar tubules, and sacs, primary bronchiolitis, or interstitial inflammation of the lungs. Community-acquired pneumonia includes pulmonary infections occurring outside of the hospital, manifested by lobar pneumonia, spot pneumonia, or atypical pneumonia caused by bacteria, viruses, fungi, and some other agents. T. M. N. Phan · H. T. Nguyen (B) College of Information and Communication Technology, Can Tho University, Can Tho, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_7
81
82
T. M. N. Phan and H. T. Nguyen
Nowadays, doctors can use numerous methods to diagnose pneumonia. Pneumonia diagnosis is usually based on patients’ history, or is conducted by performing a physical exam, blood tests, and sputum culture. CXR images can provide an overview of the patient’s chest. They exhibit not only the lungs but also other nearby locations. Such CXR images play an important and very meaningful role, therefore, they are used for assessing and detecting the disease at the early stages to prevent complications including death. In particular, Coronavirus disease (COVID-19) stems from a pneumonia disease that has caused millions of deaths and has spread explosively worldwide since the beginning of 2020. Due to the rapidly increasing number of new and suspected COVID-19 cases, applying Artificial Intelligence (AI) in the diagnosis process should be promoted. Medical image analysis which gains for making decisions and diagnosing pneumonia is one of the most promising research areas. Although CXR images from patients can contain areas of anomalies which can be signs for detecting the disease, it is difficult to find such infected areas by traditional methods. In recent years, the complexity of medical data sources has caused difficulties in analyzing and diagnosing the disease. Therefore, researchers have deployed and implemented deep learning algorithms and image processing-based systems to rapidly process hundreds of X-rays to accelerate the diagnosis of pneumonia. Moreover, the improvements in image classification bring many motivations for the development of medical images-based diagnosis. However, the process of annotation in the medical image is based on medical professional knowledge, health industry standards, and the health system. It is necessary to consider the optimum solutions in which fusion of deep learning classifiers and medical images can provide the fast and accurate detection of pneumonia. Deep learning approaches are applied to develop diagnosis models in medicine. However, the operations of models still work as a black box that is hard to explain the generated output. Therefore, we construct the CNN architecture that is designed from scratch, then, we apply Grad-CAM method to visualize the explanations from output of the proposed CNN. Besides, we compare our proposed method to several of pretrained models as well as a vast of different existed models in detecting pneumonia using CXR images. In this study, our study provides some contributions: • We propose a CNN architecture which can be suitable for stratifying between healthy individuals and pneumonia samples. In order to improve the prediction performance, we have deployed some data pre-processing and data augmentation techniques. • We also construct a website-based decision support system that is expected to assist radiologists to diagnose pneumonia using chest X-ray images. • We compare the proposed CNN architecture to several pre-trained models including ResNet50 [2], VGG16 [3], VGG19 [3], and MobileNetv2 [4]. Our proposed method has achieved better prediction performances in various metrics. • We present a way to use Gradient-weighted Class Activation Mapping (GradCAM) [6] to detect the affected lung regions that can be signs for pneumonia diagnosis.
Clinical Decision Support Systems Using Gradient-Weighted Class Activation Mapping
83
In the remaining of this study, we present some state-of-the-art methods related to our work in Sect. 2 and introduce the dataset for pneumonia diagnosis experiments in Sect. 3. Next, the methods which we compare and choose to illustrate for experiments in presented in Sect. 4. In Sect. 5, experimental results of our proposed methods are exhibited. Finally, Sect. 6 summarizes some important remarks for the research.
2 Related Work The issue of classifying CXR images has been significantly investigated in the medical diagnosis domain. In this section, we present a brief review of some related contributions from the existing literature. Stephen et al. [7] had implemented a CNN model from scratch that had a total of 19 layers for classifying and detecting pneumonia from CXR images. Data were processed through the data augmentation method in 7 different ways to avoid overfitting and to reduce generalization error because of the small size of the dataset. The authors in [8] presented the achievements of performance on hyperspectral image classification tasks based on the approach of embedding a deep feature manifold. The authors in [5] proposed two CNN models which one with a dropout layer and another without a dropout layer. A series of convolution and maximum pooling layers performed as a feature extractor. The testing accuracy for the four scenarios were 90.68, 89.3, 79.8, and 74.9%, respectively. Apart from these studies, the author in [9] performed a pre-trained method with Xception [10], VGG16 [3], and VGG19 [3] models. These studies achieved an accuracy of 82, 87, 92%, respectively. Many researchers have solved the problem of image classification with high accuracy.
3 Chest X-Ray Images for Pneumonia Classification Experiments The original CXR image dataset [7] of 1.16 GB size has been collected from Kaggle including 5856 JPEG images. The images are divided into 3 folders of Train, Test, Val. Each folder is divided into 2 sub-folders, namely Pneumonia and Normal. Pneumonia sub-folder contains samples of Pneumonia patients while Normal includes healthy controls. The considered chest x-ray images were classified manually by specialists. CXR images of one-to-five-year old patients were selected from Guangzhou Women and Children’s Medical Center, Guangzhou. We rearranged the entire data into a training set and a validation set. A total of 5271 images were allocated to the training set and 585 images were assigned to the validation set. Table 1 shows information on the considered images.
84
T. M. N. Phan and H. T. Nguyen
Table 1 Chest X-ray images dataset description (unit: the number of images) Classes Training Testing Total Normal Pneumonia
1425 3846
158 427
1583 4273
Fig. 1 The proposed architecture for the Pneumonia Diagnosis Support System
4 Method The research method begins with the output of the original images which were collected from the X-ray machine. After that, they were modified to the size of 150 × 150 before performing other tasks. The Keras library with TensorFlow was used to implement the proposed CNN architecture. Then, we deploy the Grad-CAM method for visualizing explanation tasks. We experiment with various famous pretrained models to compare with the proposed CNN model. The proposed method is shown in Fig. 1.
4.1 Data Augmentation and Transfer Learning 4.1.1
Data Augmentation
The data augmentation method is explored with the detailed parameters including shifting images horizontally by 10% of both width and height, zooming by 20% on some training images, flipping images horizontally and rotating several training images at 30◦ angles randomly.
Clinical Decision Support Systems Using Gradient-Weighted Class Activation Mapping
85
Fig. 2 The proposed Convolutional Neural Network for Pneumonia Classification
4.1.2
Transfer Learning
Transfer learning provides the approach to improve the performance for deep learning algorithms, especially CNN algorithm. We reused and utilized knowledge acquired for one task to solve related ones. There are three different transfer learning methods in CNNs including feature extractors, fine-tuning, and pre-trained models. We employed four pre-trained models such as MobileNetV2 [4], VGG16 [3], VGG19 [3], ResNet [2] which were already trained on the ImageNet dataset.
4.2 The Proposed Convolutional Neural Network Architecture The architecture of the proposed CNN classifier has been presented in Fig. 2. Initially, we employ a simple classification model receiving a 150 × 150 image as the input. The image is convoluted with 32 filters at the first convolutional layer (followed by an activation function of ReLU). The architect has 5 convolutional layers to better extract the detailed object. Max-pooling layers of 2 × 2 are used after each convolutional layer. There are 2 dense layers which the first dense layer has 128 filters output employing ReLU and softmax function. Besides, the dropout layer is 0.2. The learning rate of the model is reduced to 0.000001 to reduce overfitting. We suggest to use the Adam optimizer with categorical_crossEntropy as the cost function. The number of epochs is 20 through training and testing of several CNN models. The proposed algorithm mainly focuses on the binary classification to classify the various CXR images for fast and accurate detection of pneumonia.
86
T. M. N. Phan and H. T. Nguyen
Fig. 3 A explanatory structure of lung [19, 20]
4.3 Explainable Deep Learning Using Grad-CAM In CXR images from pneumonia patients (Fig. 3), abnormalities are usually exhibited either as areas of increased density (opacities) or areas of decreased one on a CXR image. The disease signs often begins within the alveoli and spreads from one alveolus to another area with signs including triangular cloudiness in the side of the hilum, the bottom outside or the cloud-shaped cloudy cloudiness, possibly blurred diaphragmatic angle. The signs on the X-ray are ill-defined homogeneous opacity obscuring vessels, silhouette sign including loss of lung or soft tissue interface, airbronchogram extensions to the pleura or fissure, but not crossing it, without volume loss [12]. Although deep learning techniques are widely adopted, they still work mostly as a black box. Therefore, understanding the reasons behind the predictions is quite important in assessing the reliability of a model. Class Activation Maps (CAM) assist to visualize and improve the predictions of model [11]. The output of the global average pooling (GAP) layer is fed to the dense layer to identify the discriminative Region of Interest (ROI) localized to classify the inputs to their related classes. For visualizing the explanations, Selvaraju et al. [6] proposed the Gradient Weighted Class Activation Mapping (Grad-CAM) technique which highlighted the regions of interest aiming to provide explanations for the predictions. Grad-CAM is a generalization of CAM. It provides a visual explanation for the connected neural network while performing detection. It uses the gradient information of the expected class, flowing back into the deepest convolutional layer to generate explanations. Grad-CAM is applied to any of the convolutional layers while the predicted label has been calculated using the complete model. The last convolutional layer is mostly used for applying this method. Figure 4 presents the Grad-CAM which highlights the highly localized regions of interest for pneumonia positive cases class within the x-ray images by using the proposed model. Grad-CAM can provide pieces of evidences and signs to support the diagnosis.
Clinical Decision Support Systems Using Gradient-Weighted Class Activation Mapping
87
Fig. 4 An explanatory diagram of Grad-CAM. Features that are used for pneumonia detection get highlighted in the class activation map Table 2 The brief synthetic of our experimental results Method Accuracy (%) AUC (%) VGG16 [3] VGG19 [3] ResNet50 [2] MobileNetV2 [4] Our proposed method with data augmentation Our proposed method without data augmentation
Times (seconds)
82.37 84.78 69.87 63.14 94.19
– – – – 97.77
3127 2508 3050 1519 2698
96.94
99.18
2633
5 Results 5.1 Model Evaluation A 10-fold cross validation is produced to get an estimate of the feasibility of using CXR to diagnose pneumonia by the proposed method. The performance is assessed by average values of accuracy and AUC on 10-fold cross validation.
5.2 Experimental Results The experiments (with the results shown in Table 2) had run on a Ubuntu 18.04 server with the configuration about 20 cores of CPU and 64 GB of RAM. The pre-trained architectures were firstly load and then fine-tuned them for the work.
88
T. M. N. Phan and H. T. Nguyen
Table 3 Comparison of our proposed method with other methods Research Accuracy (%) AUC (%) Ayan E. et al. [9] Liang G. et al. [16] Sharma H. et al. [5] Sirish V.K. et al. [13] Stephen O. et al. [7] Raheel S. et al. [15] Saraiva A. et al. [14] Rajaraman S. et al. [17] Chouhan V. et al. [18] Our proposed method
84.50 90.05 90.68 92.31 93.73 94.30 94.40 96.20 96.39 96.94
87.00 – – – – – 94.50 99.00 99.34 99.18
Time (seconds) 4980 – – – – – – – – 2633
Fig. 5 Activation maps for chest X-ray having 2 cases that are normal and pneumonia corresponding to the best and worst model
The experimental results of using the proposed CNN architecture with the first layer containing 32 filters along with the optimized Adam function can obtain an average accuracy and AUC of 96.94%, 99.18%, respectively and the average time of 2606 s to run the experiment with the proposed method. The experiment can run to 20 epochs for each running time. Our proposed method achieves better results compared to various existing methods on the same considered dataset [7]. All the results mentioned are reported by the authors in their respective studies (as summarized in Table 3).
5.3 Comparative of Explanations of the Results Using Grad-CAM By using the Grad-CAM method for community-acquired pneumonia, and nonpneumonia, representative examples of attention heatmaps were generated. Heatmaps
Clinical Decision Support Systems Using Gradient-Weighted Class Activation Mapping
89
Fig. 6 The proposed method performance through epochs
are standard and overlapped on the original image. The red color highlights the activation area associated with the predicted class. Figure 5 exhibits the activation maps in the comparison between the best architecture (Our proposed architecture has gained the highest accuracy as shown in Table 2) and the worst model (MobileNetV2 as shown in Table 2). These activation maps assist to localize the regions in the image with the most indicative of pneumonia. The activation maps are exhibited using the output from the last convolutional layer. In the case of the best model, the abnormal lung is predicted the presence of pneumonia more correctly than the worst model manifested with a more diffuse interstitial pattern in both lungs. The right CXR image taken on the patient exhibits bilateral patchy ground-glass opacities around the right mid to lower lung zone observed as well as multi-focal patch consolidations. However, with respect to the worst model, one representative example in the normal case illustrates wrong regions nearly the esophagus. Besides, another example of a pneumonia misclassified case is that the signs are located under the arm. Therefore, these activation maps of our proposed model can be used to monitor affected lung areas during disease progression.
5.4 Performance Analysis The performances are measured using standards to evaluate the training efficiency of the above model based on metrics such as Accuracy (ACC), Area Under the Curve (AUC), Loss are reported in Fig. 6. Accuracy is the ratio of the number of correct predictions to the total number of predictions. True Positive (TP) is that if pneumonia infected person is detected as pneumonia. True Negative (TN) is that
90
T. M. N. Phan and H. T. Nguyen
Fig. 7 Confusion matrix visualization of our proposed method
Fig. 8 The web interface is designed to be a simple tool that provides diagnostic information as well as a visual explanation of the prediction
if a person is correctly detected as a normal control. False Positive (FP) represents incorrect detection where a normal person is detected positive for pneumonia. False Negative (FN) represents incorrect detection where a person infected with pneumonia is detected as a normal one. We have a confusion matrix shown in Fig. 7.
5.5 The Pneumonia Diagnosis Based on Chest X-Ray System We present the primary interface of the pneumonia diagnosis based on the CXR system for the function of automatically diagnosing pneumonia by using CXR as shown in Fig. 8. To produce automated pneumonia diagnosis, the user may select the pneumonia diagnosis function. This function allows users to upload a CXR image and it displays the experimental result including an explanation by the remarkable position into the image as well as the ratio of accuracy for predicting on-screen after a few seconds. The system stores the history of the user’s health record to support for the search in the future.
Clinical Decision Support Systems Using Gradient-Weighted Class Activation Mapping
91
6 Conclusion In this work, we propose to use a CNN architecture combining a Model Interpretation to detect pneumonia. Some techniques based on augmentation are also deployed to improve the performance. The proposed CNN architecture has achieved a promising result with an average accuracy of 96.94%, and an average AUC of 99.18%. The obtained performance outperforms several famous pre-trained models. The trained CNN model, then, was fetched into a Grad-CAM to provide visual explanations which can be used to support doctors to make decision for the diagnosis. In an environment with limited equipment, inexperienced physicians can leverage the proposed system to support for making decision on diagnosis. The system is expected to improve the diagnosis accuracy to speed up the diagnosis process. Due to the issue of imbalanced classes in collected data, some oversampling techniques should be explored to improve the performance. Further researches may develop methodology to integrate data to improve pneumonia diagnosis accuracy based on medical images as well as making predictions from the patient’s history.
References 1. World Health Organization. Coronavirus disease Situation Report–150. https://www.who.int/ emergencies/diseases/novel-coronavirus-2019/situation-reports. September 2020 2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning or image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https://doi.org/10.1109/cvpr.2016.90 3. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR 2015) (2015). http://arxiv.org/abs/1409.1556 4. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/cvpr.2018.00474 5. Sharma, H., Jain, J.S., Bansal, P., Gupta, S.: Feature extraction and classification of chest X-Ray images using CNN to detect pneumonia. In: Proceedings of the International Conference on Cloud Computing, Data Science & Engineering, pp. 227–231 (2020). https://doi.org/10.1109/ confluence47617.2020.9057809 6. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2016), ISSN: 2380-7504. https://doi.org/10.1007/s11263-019-01228-7 7. Stephen, O., Sain, M., Maduh, U.J., Jeong, D.: An efficient deep learning approach to pneumonia classification in healthcare. J. Healthc. Eng. 2019, 1–7 (2019), ISSN: 2040-2295. https:// doi.org/10.1155/2019/4180949 8. Liu, J., Yang, S., Huang, H., Li, Z., Shi, G.: A deep feature manifold embedding method for hyperspectral image classification. Remote Sens. Lett. 11, 620–629 (2020), ISSN: 2150-7058. https://doi.org/10.1080/2150704X.2020.1746855 9. Ayan, E., Unver, H.M.: Diagnosis of pneumonia from chest X-Ray images using deep learning. In: Proceedings of the Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science, pp. 1–5 (2019). https://doi.org/10.1109/EBBT.2019.8741582
92
T. M. N. Phan and H. T. Nguyen
10. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800–1807 (2017). https://doi.org/10.1109/CVPR.2017.195 11. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016). https://doi.org/10.1109/CVPR.2016.319 12. Kasper, D.L., Fauci, A., Hauser, S., Longo, D., Larry Jameson, J., Loscalzo, J.: Harrison’s principles of internal medicine (19th edition). McGraw Hill Professional (2015). ISBN 978-007-180216-1 13. Kaushik, V.S., Nayyar, A., Kataria, G., Jain, R.: Pneumonia detection using convolutional neural networks. In: Proceedings of First International Conference on Computing, Communications, and Cyber-Security, Lecture Notes in Networks and Systems, vol. 121, pp. 471–483 (2019), ISSN: 2367-3370. https://doi.org/10.1007/978-981-15-3369-3_36 14. Saraiva, A.A., Santos, D.B.S., Costa, N.J.C., Sousa, J.V.M., Ferreira, N.M., Valente, A., Soares, S.: Models of learning to classify X-ray images for the detection of pneumonia using neural networks. Int. Conf. Bioimaging 2, 76–83 (2019). https://doi.org/10.5220/0007346600760083 15. Raheel, S.: Automated pneumonia diagnosis using a customized sequential convolutional neural network. In: Proceedings of the International Conference on Deep Learning Technologies, pp. 64–70 (2019). https://doi.org/10.1145/3342999.3343001 16. Liang, G., Zheng, L.: A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Comput. Methods Programs Biomed. 187 (2020), ISSN: 0169-2607. https://doi.org/10.1016/j.cmpb.2019.06.023 17. Rajaraman, S., Candemir, S., Kim, I., Thoma, G., Antani, S.: Visualization and interpretation of convolutional neural network predictions in detecting pneumonia in pediatric chest radiographs. Appl. Sci. 8 (2018), ISSN: 2076-3417. https://doi.org/10.3390/app8101715 18. Chouhan, V., Singh, S.K., Khamparia, A., Gupta, D., Tiwari, P., Moreira, C., Damasevicius, R., Albuquerque, V.H.C.: A novel transfer learning based appproach for pneumonia detection in chest X-ray images. Appl. Sci. 10 (2020), ISSN: 2076-3417. https://doi.org/10.3390/ app10020559 19. https://www.stwhospice.org/breathlessness-management. Accessed 1 Sep 2020 20. https://undergradimaging.pressbooks.com/chapter/approach-to-the-chest-x-ray-cxr/. Accessed 1 Sep 2020
Improving 3D Hand Pose Estimation with Synthetic RGB Image Enhancement Using RetinexNet and Dehazing Alysa Tan, Bryan Kwek, Kenneth Anthony, Vivian Teh, Yifan Yang, Quang H. Nguyen, Binh P. Nguyen, and Matthew Chin Heng Chua
Abstract Hand pose estimation has recently attracted increasing research interest, especially with the advance of deep learning. Albeit many successes, the current state of research does present some opportunities for improvement of estimation accuracy. This paper presents several image enhancement techniques to embed with existing deep learning architectures to improve the performance of hand pose estimation. In particular, we propose a preprocessing approach for image data using a low light illuminance model or a dehazing algorithm before passing the image data to a hand pose estimation model. Both the preprocessing methods are evaluated on a rendered hand-pose dataset using different evaluation metrics. The experimental results shows success in boosting the performance of hand pose estimation for both 2D and 3D image data. A. Tan · B. Kwek · K. Anthony · V. Teh · M. C. H. Chua Institute of Systems Science, National University of Singapore, Singapore 119620, Singapore e-mail: [email protected] B. Kwek e-mail: [email protected] K. Anthony e-mail: [email protected] V. Teh e-mail: [email protected] M. C. H. Chua e-mail: [email protected] Y. Yang Kaiser Trading Group, Melbourne, VIC 3004, Australia e-mail: [email protected] Q. H. Nguyen (B) School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam e-mail: [email protected] B. P. Nguyen School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_8
93
94
A. Tan et al.
1 Introduction In recent years, there has been growing popularity of hand pose estimation as a research topic, given the widespread applications in areas such as sign language recognition and augmented/virtual/mixed reality (AR/VR/MR) systems [1–3]. With the success of convolutional neural network (CNN) and generative adversarial network (GAN) implementations, as well as access to more powerful computational resources, researchers are beginning to tackle hand pose estimation problem using deep learning (DL) approaches, transitioning from traditional lookup and tree-based algorithms that have less effective learning capabilities [4]. The use of deep learning algorithms would also render specialized devices such as stereo and depth cameras and/or wired gloves unnecessary [5], thereby broadening the application domain. Zimmermann and Brox [6] pioneered the application of DL to estimate the 3D position of a hand pose with single 2D RGB images. Recognizing the challenge posed by the lack of depth information in RGB (red-green-blue) data, as well as infinite camera angles and positions, the authors introduced a Rendered Hand-pose Dataset (RHD) along with their research, in which synthetic images of 3D human pose are rendered as Blender1 character models onto 2D images, with randomised backgrounds sourced from Flickr.2 Their research was pivotal in the hand pose estimation domain, and this dataset has been used by several studies for benchmarking thereafter. Utilizing the RHD dataset, Zimmermann and Brox [6] trained and developed a 3-stage DL pipeline to predict and estimate 3D hand poses. They are described as HandSegNet, the first stage that segments the hand from a larger full body image; PoseNet, the second stage that predicts 21 key points in 2D space; and PosePriorNet, the third and final stage that lifts the 21 key points into 3D space. Performance metrics were reported in a multitude of ways, namely Mean End Point Error (EPE), Median EPE and AUCs. EPE estimates the difference between the ground truth and the model’s prediction for the 21 hand key points. This is done at different stages of the entire pipeline. The paper [6] published its results based on training, testing and fine tuning on the well-established Stereo Benchmark Dataset (STB), after initial development and testing with RHD. This research will be benchmarked with results obtained with RHD. The authors were able to achieve 66.3% for (HandSegNet + Posenet), 54.9% for (PosePriorNet) and 60.3% for (HandSegNet + PoseNet + PosePriorNet) AUCs. These values, which were documented both partially in their paper [6] as well as their published codebase [7], are identified as the benchmarks that are matched against our research. Detailed evaluation of AUCs and EPE metrics are available in Sect. 3. The focus of this paper is to experiment with various image enhancement and preprocessing techniques and propose a suitable model architecture that have the best improvements on the hand pose estimation performance. The enhancements are primarily based on low light image illuminance and image dehazing or denoising. 1 https://www.blender.org/. 2 https://www.flickr.com/.
Improving 3D Hand Pose Estimation Using RetinexNet and Dehazing
95
Experiments were conducted iteratively to improve performance accuracies and support the hypotheses. It is important to note the testing strategy employed in the original research. The authors elected to use a subject split [6] as opposed to the conventional random split of 80/20 or 70/30. This means that of the total 20 characters that were rendered, 16 of the characters were used for training and 4 of the characters were used for testing. This prevents the models from overfitting to feature attributes that correspond to particular subjects, which would possibly yield poorer results for real-world data. It is a far more robust strategy, as it allows the pose estimation to be subject agnostic and operable in an actual deployment setting. Due to that, the same split methodology would be adopted in this research for a fair comparison.
2 Approach Synthetic images may not be realistic as compared to real-life images due to insufficient grey levels and the lack of subtle colour differences such as natural skin tones. Random backgrounds also introduce sharp contrasts between artificial characters and real-world backgrounds. Furthermore, 3D Blender rendered characters may exhibit unnatural boundaries between skin (hand/fingers) and backgrounds. This may degrade the transferability of using models built on synthetic images onto real world inferences. A hypothesis was devised to refine these synthetic images, with ideas applied to enhancing real world photographs. If these images are enhanced similarly, model estimation may improve. Before embarking on image enhancements, the raw datasets were put through the original models to establish baseline performance and provide for some initial comparisons. Results were replicable for the RHD dataset. By visually examining the RHD dataset, it was evident that there were many samples of images with low brightness and/or low contrast. Hence, some algorithmic image techniques were implemented, such as various contrast enhancement algorithms. Next, experiments were conducted with image aesthetic and resolution enhancement GANs [8–10]. Initial results from these attempts showed some modest improvements albeit small. However, it is important to note that while these enhancements did improve the 2D stage of the pipeline, it perform worse for 3D lifting. After initial experimentations, the focus was shifted to image pre-processing to increase illuminance and also possibly denoising. Figure 1 illustrates a sample from the RHD training data, which exhibits both low contrast and low illuminance.
2.1 Low Light Enhancement (RetinexNet) A recent study by Wei et al. [11] has suggested that the visibility of images can be significantly degraded due to insufficient lighting in image capturing and that the lost details and low contrast will result in poorer performance of many computer
96
A. Tan et al.
Fig. 1 RHD RGB training sample no. 116. (a) Original; (b) RetinexNet output (original); (c) RetinexNet output (1st retrain); (d) RetinexNet output (2nd retrain)
Fig. 2 RetinexNet original training sample. (a) Normal light; (b) Low light
vision systems designed for normal-light images. As a result, [11] developed a deep learning model known as RetinexNet that was based on Retinex vision theory to illuminate dark photos. According to [12], Retinex theory is the human physiology of perception of colour based on the intensity of surrounding colour casts. A retinaand-cortex system (retinex) may perceive a colour as a code reported from the retina, which shows correlation with the reflectance of objects but is independent of the flux of radiant energy [12]. In RetinexNet [11], a CNN model was trained with a large dataset of paired photographs (low and normal light, see Fig. 2 for an example). Those photos mainly reflect indoor and outdoor environments such as bedrooms, courtyards and streets. The entire pipeline of the model consists of three stages: Decom-Net, EnhanceNet, and Reconstruction. The first stage is to decompose an image into illuminance and reflectance. The second stage is to enhance the illuminance output using an auto encoder network. A denoising operation was also conducted on the reflectance output. Lastly, the third stage is to reconstruct the image into a final output [11]. Some recent and similar research includes Retinex-GAN [13], which achieved similar output as [11], and another CNN model from Chen et al. [14] which illuminates low light photos based on raw camera image data. Utilizing the original model weights in [11], the RHD images, such as the one depicted in Fig. 1a, were enhanced. These pre-processed images are then used to retrain the entire 3D hand pose pipeline, while maintaining initial parameters, training and evaluation sizes. Figure 1b illustrates an output from the RetinexNet for the same image.
Improving 3D Hand Pose Estimation Using RetinexNet and Dehazing
97
Fig. 3 New training pair sample for RetinexNet. (a) Original; (b) Random reduction in brightness and contrast
Initial results showed an ever-slight improvement and reduction in EPEs. Improvements of 3–4% were observed in the 2D stages of the pipeline while 3D lifting only achieved an increase 0.1–0.2%. However, this is a relative improvement from initial experimentations with resolution enhancement GANs. The results from the above experimental approaches are all tabulated and visualised in the Results section. The next hypothesis was such that if RetinexNet was trained with relevant hand pose images, the contrast of hand texture against the background may allow for the 3D model to better estimate the hand poses (2D and 3D). In order to test the hypothesis, the Large-scale Multiview 3D Hand Pose dataset [16] is used. However, this dataset did not contain any low light samples. After analysing the variety of images available, 600 images were selected and corresponding synthetic pairs were generated with random brightness and a constant 10% reduction in contrast. Figure 3 shows a paired example. After retraining RetinexNet with this new set of 1200 images (600 × 2), new weights were obtained and utilized to process the RHD dataset once again. Figure 1c illustrates the output from the RetinexNet after retraining. Evaluations indicate worse results in 2D metrics (decrease in 1–2%) than those previously processed with original weights. Interestingly, a reduction in 3D EPE metrics was observed. Also, the AUC scores for 3D somewhat increased. Improvement margins were not large enough to warrant any significance. However, it does suggest that retraining with relevant hand pose images may help in lifting 3D estimations despite suffering losses for 2D. The final hypothesis was to retrain RetinexNet with hand segmented images against real world backgrounds. The idea was that contrast information is focused solely on the area of hands that was relevant to final inference and estimations. Fortunately, the same dataset [16] provided these augmented synthetic images. Once again 600 images were selected. To reduce variability in experimentation, the same corresponding images were selected though their backgrounds have been randomised. Figure 4 illustrates the same pose as in Fig. 3, segmented and layered above another random background.
98
A. Tan et al.
Fig. 4 New training pair sample for RetinexNet (segmented). (a) Original; (b) Random reduction in brightness and contrast
After retraining RetinexNet a second time with these new synthetic augmented images, the RHD dataset was processed again. Figure 1d illustrates the output from the RetinexNet after retraining. Final evaluations were encouraging. EPE estimation did not suffer from significant loss, approximately no larger than 1mm increase in error compared to RetinexNet’s original weights but consistently lower than the benchmark. There was a 1–2% increase for AUC scores as compared to previous retraining (Retrain 1). AUC scores increased across the board comparing to the benchmark initially identified: 70.8% for (Handsegnet + Posenet), 57.7% for (PosePriorNet) and 62.1% for (Handsegnet + Posenet + PosePriorNet). These results confirm that this approach of image preprocessing (illuminance with RetinexNet trained with hand pose data) does improve the performance for 3D hand pose estimation for the identified paper [6]. The results from the above approaches are all tabulated and visualised in the next section. As a graphical summary, Fig. 5 illustrates the architecture pipeline adopted for this approach.
2.2 Dark Channel Prior Dehazing It is suggested in [17] that the irradiance camera receives from the scene point decays along the line of sight, and that the incoming light will be blended with the ambient light by atmospheric particles. Consequently, the images degrade in contrast and fidelity, and the degradation will be spatially variant given the correlation between the amount of scattering and the distance from the camera to the scene point. This explains why haze removal (or dehazing) is highly desired in consumer and computational photography and computer vision applications as it significantly increase the visibility of the scene and correct the airlight-caused shift of color.
Improving 3D Hand Pose Estimation Using RetinexNet and Dehazing
99
Fig. 5 Proposed architecture pipeline (RetinexNet + 3D Hand Pose)
In [17], the authors use the statistical distribution of haze free images learned prior, to enhance/dehaze new images based on estimating atmospheric light found in dark channels. Similar work was also done by Kim J. et al. [18] but with and additional cost minimisation function, to reduce information loss. Because the RHD images were prepared with randomised real world backgrounds sourced from Flickr, it was theorized that it may suffer from such image capture haze effects. Furthermore, dehazing synthetic character images layered artificially on random backgrounds may increase the contrast and colour constancy that may aid in the performance of computer vision. The algorithm described in this research was applied to the entire RHD dataset. However, visual checking indicates little difference between the output and the original images. Histogram intersection was plotted to compare the elements distribution of the original image (red) and the image after dehazing (blue), which clearly demonstrates that there are indeed changes to the image after processing. Histogram correlation that compared function provided by OpenCV, yielded a value of 0.902 (Fig. 6).
100
A. Tan et al.
Fig. 6 Original and Dehaze output histogram
Fig. 7 Proposed architecture pipeline (Dehazing + 3D Hand Pose)
After retraining 3D hand pose model with the dehazed images, evaluations showed an increased in every metric for both EPEs and AUCs consistently. Reduction of EPE of approximately 1–2 was observed across both 2D and 3D metrics. AUC scores of 70.6% for (Handsegnet + Posenet), 59.7% for (PosePriorNet) and 62.3% for (Handsegnet+Posenet+PosePriorNet) was comparatively higher against the benchmark in every regard. Detailed results and comparisons are presented as follows. It is evident that dehazing has a favourable effect in pre-processing synthetic images for 2D and 3D hand pose estimation. While the changes may not be visible to the human eye, dehazing perhaps produces a denoising effect and increases the contrast of a synthetic pose set against a real-world background. Thus, refining features for a deep learning model to better estimate. Figure 7 illustrates the steps in the pipeline taken for pre-processing before the image is utilized to retrain or inferred for evaluation.
Improving 3D Hand Pose Estimation Using RetinexNet and Dehazing Fig. 8 Qualitative sample visualization
101
Ground Truth
image_num: 1082 - 2D EPE: 3.53 pixels, AUC: 0.88 - 3D EPE: 18.8 mm, AUC: 0.62
Retinex1
image_num: 1082 - 2D EPE: 4.83 pixels, AUC: 0.84 - 3D EPE: 24.41 mm, AUC: 0.51
Retinex2
image_num: 1082 - 2D EPE: 3.88 pixels, AUC: 0.87 - 3D EPE: 17.9 mm, AUC: 0.65
Dehazing
image_num: 1082 - 2D EPE: 3.98 pixels, AUC: 0.86 - 3D EPE: 25.97 mm, AUC: 0.49
3 Experimental Results and Discussion This paper adopts the same testing strategy as in [6]. There are a total of 43,986 images, of which 41,258 were used for training and 2,728 for evaluation. This split was by subject/character, where no single subject exists in both the training and evaluation set. This prevents the model from overfitting across all subjects and possibly making the model infer new real-world data poorly. Furthermore, the raw data images were unlabelled thus it was impossible to recalibrate the train/test split. Manually sorting the data by subjects was also impractical due to the size of dataset. Cross validation would also be deemed inappropriate since benchmarking was against the original authors’ results. It would most likely yield results that are unfair and skewed. Keeping the experimentation parameters identical was highly important to prevent any introduction of unintended bias and for accurate comparisons. Lastly, the original published results in [6] were based on the STB dataset. Unfortunately, due to time constraints and considerable amount of time needed for data preprocessing and preparation, this is left out of this study but future research. The next section reports the results from the above experimental approaches.
102
A. Tan et al.
Table 1 Evaluation results Approach
Metric
Model
Ev2DGT
Ev2D
Ev3D
Eval Full 0–50 mm
Low Light Enhancement (RetinexNet)
Mean EPE
Median EPE
AUC
Dark Channel Prior Dehazing
Mean EPE
20–50 mm
Original
9.13
17.04
24.25
35.60
–
RetinexNet (Original)
7.76
16.21
22.68
35.87
–
RetinexNet (Retrain 1)
8.61
16.81
21.96
36.59
–
RetinexNet (Retrain 2)
7.83
16.01
22.61
35.20
–
Original
4.99
5.83
20.84
28.68
–
RetinexNet (Original)
4.12
4.61
19.46
28.76
–
RetinexNet (Retrain 1)
4.76
5.21
18.84
28.82
–
RetinexNet (Retrain 2)
4.19
4.65
19.62
27.65
–
Original
0.724
0.663
0.549
0.424
0.603
RetinexNet (Original)
0.766
0.707
0.577
0.425
0.606
RetinexNet (Retrain 1)
0.741
0.686
0.588
0.425
0.604
RetinexNet (Retrain 2)
0.764
0.708
0.577
0.437
0.621
Original
4.99
5.83
20.84
28.68
–
Dehazing
4.01
4.63
18.24
27.63
–
Median EPE
Original
9.13
17.04
24.25
35.60
–
Dehazing
7.59
16.43
21.50
34.84
–
AUC
Original
0.724
0.663
0.549
0.424
0.603
Dehazing
0.771
0.706
0.597
0.438
0.623
Table 2 Qualitative sample results Model 2D EPE Ground Truth RetinexNet (Retrain 1) RetinexNet (Retrain 2) Dehazing
3.528 4.831 3.877 3.98
2D AUC
3D EPE
3D AUC
0.833 0.841 0.873 0.863
18.797 24.41 17.899 25.97
0.618 0.506 0.649 0.486
Improving 3D Hand Pose Estimation Using RetinexNet and Dehazing
103
The detailed scores under different evaluation metrics for each approach are documented. Scores from different evaluation types and scopes listed below are also compared. In Table 1, scores are grouped by metric (rows) and each individual metric measurement (columns) and the best performing metric for each group has been embolden. In addition, some qualitative sample outputs for each approach are also provided in Table 2 and Fig. 8. A single sample is for illustrative purposes and is not a representation of the performance of the entire model and overall scores. • • • •
Ev2DGT - HandSegNet + PoseNet (ground truth) Ev2D - HandSegNet + PoseNet Ev3D - PosePriorNet EvFull - HandSegNet + PoseNet + PosePriorNet
Researches that deal with RHD, STB and 3D hand pose estimation are aplenty. Some notable ones include [19–24]. However, directly comparing performances with these work would not be equitable nor constructive as few studies involve enhancing synthetic images for training 3D hand pose estimation model. On the other hand, this serves as a favourable proposition that the approach and findings in this paper are novel and innovative.
4 Conclusions and Future Work This research has demonstrated two approaches of enhancing synthetic images to improve 3D hand pose estimations. It was able to achieve an improvement of accuracies of 62.1% with low light illuminance and 62.3% with dehazing, higher than the benchmark paper’s original accuracy of 60.03% [6]. It is also important to note from the observations in the results, that it consistently reduces EPEs and increases AUC scores across different evaluation types. Another valuable observation is that different variations in approaches may improve different aspects of the pipeline. For example, RetinexNet Original and Retrain 1 may improve 2D estimation but worsen 3D estimation whereas RetinexNet Retrain 2 improves 3D estimation but does not bolster 2D estimation. The first direction for future work is to evaluate existing approaches with the STB dataset, which will lead to comparative results with published results in [6], as well as an evaluation of the model’s transferability to different images. The survey done in [25] provides a good summary of freely available datasets related to the hand pose domain. Secondly, it is possible to apply image processing of RetinexNet after hand segmentation. This is to reject the hypothesis that it will perform better, due to Retinex’s theory hinging on the perception of colour based on surrounding constancy. After hand segmentation, a lot of this background information is removed and thus there is a significant loss of information.
104
A. Tan et al.
The third possibility is to experiment with other illuminance algorithms or GAN models such as EnlightenGAN [26]. This is to further support this research findings on low light illuminance on synthetic images for improving hand pose estimation. Lastly, synthetically generated images may not reflect real world hand anatomy. Hand and finger boundaries in real world images are not artificially smooth and contiguous in nature. There have been work using GAN models to refine artificial images generatively based on real world samples. It is worthwhile to investigate if such refinement improves hand pose estimation. Acknowledgment This project is funded by Tote Board Enabling Lives Initiative Grant and supported by SG Enable.
References 1. Wen, R., et al.: In situ spatial AR surgical planning using Projector-Kinect system. In: Proceedings SoICT 2013, pp. 164–171. ACM (2013) 2. Wen, R., et al.: Hand gesture guided robot-assisted surgery based on a direct augmented reality interface. Comput. Method Programs Biomed. 116(2), 68–80 (2014) 3. Nguyen, B.P., et al.: Robust biometric recognition from palm depth images for gloved hands. IEEE Trans. Human-Mach. Syst. 45(6), 799–804 (2015) 4. Shotton, J., et al.: Real-time human pose recognition in parts from a single depth image. In: CVPR 2011 (2011) 5. Wang, R.Y., Popovic, J.: Real-time hand-tracking with a color glove. ACM Trans. Graph. (TOG), 28(3), 1–8 (2009) 6. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings ICCV 2017, pp. 4903–4911. IEEE (2017) 7. Zimmermann, C., Brox, T.: ColorHandPose3D network (2017). https://github.com/lmbfreiburg/hand3d 8. Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings CVPR 2017, pp. 105–114. IEEE (2017) 9. Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Proceedings ECCV 2018, pp. 63–79. Springer (2019) 10. Song, Y.: Single image super-resolution. Morris Undergraduate J. 6(1) (2018), University of Minnesota 11. Wei, C., et al.: Deep Retinex decomposition for low-light enhancement. In: Proceedings BMVC (2018) 12. Land, E.H.: The Retinex theory of color vision. Sci. Am. 237(6), 108–128 (1977) 13. Yangming, S., et al.: Low-light image enhancement algorithm based on Retinex and Generative Adversarial Network. In: Proceedings BMVC (2018) 14. Chen, C., et al.: Learning to see in the dark. In: Proceedings CVPR 2018, pp. 3291–3300. IEEE (2018) 15. Yuen, P.L., Chee, S.C.: Getting to know low-light images with the exclusively dark dataset. Comput. Vis. Image Underst. 178, 30–42 (2018) 16. Gomez-Donoso, F., et al.: Large-scale multiview 3D hand pose dataset. Image Vis. Comput. 81, 25–33 (2017) 17. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011) 18. Kim, J.-H., et al.: Optimized contrast enhancement for real-time image and video dehazing. J. Visual Commun. Image Represent. 24, 410–425 (2013)
Improving 3D Hand Pose Estimation Using RetinexNet and Dehazing
105
19. Yuan, S., et al.: RGB-based 3D hand pose estimation via privileged learning with depth images. In: Proceedings of Observing and Understanding Hands in Action Workshop, ICCV (2018) 20. Yang, L., Yao, A.: Disentangling latent hands for image synthesis and pose estimation. In: Proceedings CVPR 2019, pp. 9877–9886. IEEE (2019) 21. Panteleris, P., et al.: Using a single RGB frame for real time 3D hand pose estimation in the wild. In: Proceedings Winter Conference on Applications of Computer Vision (WACV 2018), pp. 436–445. IEEE (2018) 22. Spurr, A., et al.: Cross-modal deep variational hand pose estimation. In: Proceedings CVPR 2018, pp. 89–98. IEEE (2018) 23. Cai, Y., et al.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Proceedings ECCV 2018, pp. 678–694. Springer (2018) 24. Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: Proceedings CVPR 2019, pp. 10825–10834. IEEE (2019) 25. Doosti, B.: Hand pose estimation: a survey. arXiv:1903.01013 (2019) 26. Jiang, Y., et al.: EnlightenGAN: deep light enhancement without paired supervision. arXiv:1906.06972 (2019)
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection Dang Xuan Tho and Dao Nam Anh
Abstract We present a method which enables to learn and recognize symptoms of COVID-19 from chest X-ray images in a class balancing algorithm. Images are trained and tested by deep learning methods allowing to extract initial image features. A probabilistic representation is used for all aspects of the learning objects: features, samples, class and relative data modeling. An imbalance-based sample detector is used to discover a minority class in the class distribution. In learning, the samples of the minority class are analyzed and the imbalance issue is fixed. This is done with and without the use of SMOTE and SPY for class balancing. In recognition, the SVM is applied to classify images. The imbalance nature of the model with a solution combining VGG-16, SPY and SVM is demonstrated by excellent results over other parametric learning options. Keywords Chest X-ray image · COVID-19 · VGG-16 · SPY · SVM · Imbalance
1 Introduction In data mining, if the number of elements occurring in one class is greater than that of another, the data set is said to be very different in nature [1]. The majority class is used to refer to a data layer with a higher number of elements, while the minority class contains fewer elements [2]. Usually, the majority class is represented as negative and the minority class represents the positive element [3, 4]. Data that has only two layers is called binary layer, while data containing more than two layers is known as multilayer. Multilayer problems can be converted to binary class problems by using the strategy of one class versus the remaining classes. D. X. Tho (B) Hanoi National University of Education, Hanoi, Vietnam e-mail: [email protected] D. N. Anh Electric Power University, Hanoi, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_9
107
108
D. X. Tho and D. N. Anh
In the real world, there are problems with data imbalance, such as identifying oil slicks in satellite radar imagery, text sorting, spam filtering, financial fraud detection, identify fake phone calls and, most importantly, medical diagnosis [5–9]. In such cases, the learning algorithms are affected by the majority class elements so that the minority class elements are heavily misclassified. To solve problems related to data imbalance, various techniques have been proposed. In principle, problems with data imbalance can be found in medical imaging analysis, in particular for checking chest images for diagnosis of lung diseases [10]. Due to requirement of high accuracy in prediction, convolutional neural networks can be applied in connection with checking class imbalance to solve the problem. In this study, a deep learning-based method was proposed to detect corona virusinfected patients by X-ray imaging combined with a data re-balancing method to achieve better predictive performance.
2 Related Work Addressing to the methods of data imbalance, these methods can be classified into categories depending on how they proceed with the class imbalance [2–4]. The datalevel approach is to preprocess the data to rebalance the imbalance layers. It is, therefore, important to reduce the difference-distribution effect in the classification process. The resampling technique used to preprocess unbalanced data can be divided into three categories: the method of random removal of elements of the majority class to rebalance the data, the method of generating random elements, and the method of weights for minority layer data. The method of random removal can lead to data loss due to the removal of some data that might make sense. Synthetic minority over sampling technique (SMOTE) is a fairly well known method. Here, in the minority class, new elements are created by interpolating elements of the minority class that lie together. SMOTE randomly selects one of the k-nearest-neighbors (k-NN) of a minority class element and generates an artificial value from a random element between the two elements. [11]. In this way, SMOTE avoids the condensation problem but it creates noise particles. For such problems encountered in SMOTE, several filter-based methods are used to avoid interference in unbalanced datasets, such as SMOTE-TL and SMOTE-E. Modified synthetic minority oversampling technique (MSMOTE) [12] is an improved form of SMOTE. By calculating the distance between all elements, the minority layer is divided into three groups: the potential noise element, the safety and the border. MSMOTE removes interference particles based on the k-NN classification method when MSMOTE produces artificial particles. However, it neither has effect on cases of hidden noise elements, nor gives preference to important extraneous elements. To solve this problem, the combination of SMOTE and IPF filter is used for noise handling and border adjustment [13]. SMOTE enhancements go further for more robust feed techniques, like SMOTE, B1-SMOTE and B2-SMOTE extension methods [14]. Other works combined SMOTE with logistic regression
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection
109
(LR), decision tree model C5 (C5), 1-nearest neighbor search and several famous classifiers [15]. Gao et al. proposed to use the pre-trained deep learning network of InceptionResNet-v2 to extract characteristics by SMOTE technique in the prediction problem of non-acne facial pigmentation [16]. Pandey et al. also used the 11-layer deep neural network model incorporating SMOTE artificial sampling to handle the minority layers in the cardiovascular disease prediction problem to help reduce the mortality rate of heart patients [17]. A new sampling technique based on clustering was proposed by [18]. First, the majority class elements are divided into n clusters, where n is the number of elements of the minority class. Then, the data is balanced when n centroids are used instead of majority class elements. In addition, another proposed method (SPY) based on the analysis of elements on the frontier and adjacent elements is often misclassified than others. In SPY, these majority class elements are known as the spy element [19]. Hence, it tries to change the label of the majority class elements to the label of the minority class patterns in the training data. As a result, the number of elements of the majority class decreases, whereas the number of the minority classes increases. In the second category, the existing classification algorithm-level approach considers weights for minority layer data. The SVM algorithm has one limitation that is inefficiency on unbalanced data. Then, various heuristics methods have been incorporated in the SVM model, including over-sampling, under-sampling and cost-sensitive learning by Tang et al. [20]. A variation for decision trees is proposed by Sanz et al. [21], who improved the evolutionary fuzzy classification method for modeling and prediction in financial applications. Park et al. [22] introduced two general decision tree-based approaches to the problem of unbalanced data classification using features of α divergences. Wu et al. [23] proposed to improve the Random Forest algorithm to classify unbalanced text. Zieba et al. [24] proposed a combined method by combining adaboost algorithm with SVM to classify unbalanced data, helping to minimize exponential error function used to predict life expectancy of lung cancer patients. After performing surgery. Shao et al. [25] proposed to improve SVM by using different weights for classes to classify unbalanced data. Krawczyk et al. [26] proposed a branch-weight sensitive decision tree for unbalanced data classification. And the third, the cost-sensitive approach, combines data-level and algorithmiclevel approaches to integrate into the learning process. The cost-sensitive learning method preserves the main AdaBoost learning framework and also includes cost items in the weighted update formula. Hence, the general difference in these recommendations is by improving the weight update formula. In the cost-sensitive group, the most typical approaches are AdaC1, AdaC2 and AdaC3, CSB1, CSB2, AdaCost [27, 28]. The method that combines algorithmoriented and preprocessing approaches to effectively handle unbalanced data analysis problems is supported by Sun et al. [29]. Although the sampling technique has been successful in classifying unbalanced data, Lu et al. [30] also proposed a hybrid approach between sampling techniques with Bagging.
110
D. X. Tho and D. N. Anh
3 The Method Many deep learning methods exist that could serve as a general framework for our analysis of recognition of disease symptom for chest X-ray images. For the purpose of the article we have choose to use a joint solution covering a deep learning method [31] given its long tradition and state-of-art performance in medical image analysis [32]. To allow the refining features to increase prediction accuracy, technique of checking class imbalance [10] is proposed to apply to the features, which are delivered from deep learning.
3.1 Feature by Deep Learning In what follows, the notation x refer to a set of features extracted from a sample s by a deep learning method, and y-the class of the sample. In particular, a chest X-ray image is the sample in our study and the features are numerical. p(y|x) =
p(x|y) p(y) p(x)
(1)
To illuminate the fact that a set of features x is derived from a sample s by a learning method, it’s useful to think (1) in terms of conditional probability in a proper, logical and consistent form: p(x|y, s) p(y|s) (2) p(y|x, s) = p(x|s) A single sample si and the number of samples S give description of the samples in a vector form s = {si }, i = 1, 2..S and the expansion of p(x|s) from (2) by conditional independence assumptions on si : p(x|s) =
p(x|si )
(3)
i
Similarly, the likelihood p(x|y, s) depends on the training samples via sum of the probability by (4), and the prior p(y) from (2) in relation with the set of samples si can be calculated by (5). p(x|y, s) = p(x|y, si ) (4) i
p(y|s) =
p(y|si )
(5)
i
It is important to note that the set of features x are the states of a deep learning network in our method. Their notion at time t can be expressed by (6).
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection
x(t) ≡ (x1 (t), ...x X (t))
111
(6)
We can think of this deep learning network as one of structured output prediction, with the input space consisting of samples which are the chest X-ray images. It was clear that output is the final state of the network. We therefore can see the set of samples s with features x j , j = 1..X is the initial state of the network by (7) where t = 0, and the conditional probability p(x|s) is the final state of the network with t = T by (8). x 0j = x j (t = 0) ≡ s
(7)
x Tj = x j (t = T ) ≡ p(x|s)
(8)
3.2 Feature Refining by Checking Class Imbalance Now suppose that we are given a training set comprising deep learning features x ≡ (x1 , ..., x X ) by (8) for N observations of samples s which are chest X-ray images, together with corresponding observations of the binary COVID-19 classification of y, denoted y = (y 1 , y 2 )T . For the improvement objectives, it is instructive to check class imbalance for the training set using balancing techniques [2] for removing the class imbalance. As can be observed, the problem of class imbalance is raised if conditional probability of classes given observations in training are not equal each other (9). We therefore mark minority class by y ∗ , and because of considering binary classification, for examples, COVID-19 Positive and Negative, equation (9) can be expressed as (10). (9) p(y 1 |x) ≠ p(y 2 |x) p(y ∗ |x) < .5
(10)
Having reviewed the Synthetic minority over sampling technique (SMOTE) [11] in Sect. 2, we consider its generating new samples x gen with the minority class y ∗ by interpolating samples of the minority class that lie together by (11). As a result of this method of checking class imbalance, the data set for training is enlarged by (12), and the learning performance can be enhanced. p(y ∗ |x gen ) = 1
(11)
x := x ∪ x gen
(12)
In the case of resolving class imbalance problem with other sampling technique based on clustering, such as that SPY [18], samples holding minority class x spy are
112
D. X. Tho and D. N. Anh
identified firstly by (13). Then, majority class samples noted x knn , which are neighbors of these samples x spy are filtered by an algorithm, similar to k-NN, by (14). Finally, the samples are moved to minority class y ∗ by (15). As training data refinement by SPY was obtained, the supervised training with the data can be processed. p(y ∗ |x spy ) = 1
(13)
p(y ∗ |x knn ) = 0
(14)
p(y ∗ |x knn ) := 1
(15)
We have explained that, the checking class imbalance after having deep learning features is essentially data refinement for improving prediction performance. In the case study with COVID-19 classification for chest X-ray images, we focused on two methods for checking class imbalance, consisting of SMOTE and SPY.
3.3 Algorithm The previous subsection highlighted the conceptual base of our method on studies of COVID-19 prediction from chest X-ray image, focusing attention especially on data refinement in which there is a possibility to ameliorate learning performance. Additionally, the learning objects are analyzed in their relation by Bayesian reasoning. Together mentioned above conceptional notions on proposed learning method, the Algorithm 1 expresses the method on Pseudocode to derive a detailed description. ALGORITHM 1. Imbalance in Learning Chest X-ray for COVID-19 Detection Input: the training chest X-ray images s, class y; the test images s test , class y test Output: the prediction y pr ed ; per f or mance 1: [x, deepK er nel]:= Get_Deep_Lear ning_T raining_Featur e (s, y) by (2 -
8) 2: [x test ]:= Get_Deep_Lear ning_T est_Featur e (s test , deepK er nel) 3: for opt = 0 : 3 do 4: if opt = =0 then Prediction by Deep Learning 5: [y pr edict ]:= Deep_Lear ning_T est (x test ) 6: else 7: if opt = =1 then Ignore checking class imbalance 8: else 9: if opt = =2 then Check class imbalance by SMOTE 10: [x, y]:=SMOTE(x, y) by (9, 10, 11, 12)
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection
113
11: else Check class imbalance by SPY 12: [x, y]:=SPY(x, y) by (13, 14, 15) 13: end if 14: end if 15: [svm K er nel]:= SV M_T raining (x, y) by (1) 16: [y pr edict ]:= V M_T est (x test , svm K er nel) 17: end if 18: [ per f or mance]:= Evaluate_Per f or mance (y test , y pr edict ) by (16) 19: end for
To evaluate learning performance, we use the Accuracy of (16) [34] where TP is short for true positive; FP - false positive; TN - true negative and FN - false negative. Acc =
TP +TN T P + T N + FP + FN
(16)
The Accuracy measures of how well a binary classification test correctly identifies or excludes a condition. This is suitable for the case when both classes of COVID-19 positive and negative are in the same level of attention. However, when performing training on data set with unequal distribution of samples for classes we may need the area under the curve (AUC) [35]. In this way, our results of experiments for three different options in the Algorithm 1 can be evaluated and compared.
4 Experiments To validate the method we have described and, in particular, to attend a real challenge facing the COVID-19 diagnosis from chest X-ray image, we performed a case study with the COV database [36]. In the moment we are writing, it contains 244 frontal images of patients potentially positive to COVID-19. Some examples of the chest X-ray images from the COV database are displayed in Fig. 2. The database provides total file labels for the images, namely, COVID-19, COVID19 ARDS, SARS, Streptococcus and others. For cross validation we created 5 folds, each of them has 30 images for training and 25 images for test. Table 1 shows the number of images available for each label in training set and test set in one fold. Hence the ratio of positive to negative in the training set is 12/18 = 2/3 which is the same ratio for the test set by 10/15 = 2/3. Now, given a set of training images and a set of test image in one fold, the algorithm 1 can be performed in a structured ways. As can be observed, four different classification options are obtainable from the experiment: 1. Classification by a method of deep learning (DL). 2. Classification by SVM with features extracted from a method of deep learning (marked by DL.SVM). 3. Classification by SVM with features which are firstly extracted from a method of deep learning and then refined by SMOTE (marked by DL.SMOTE.SVM).
114
Fig. 1 Experiments of COVID-19 Classification for the COV database
Fig. 2 Examples of Chest X-ray images from the COV database
D. X. Tho and D. N. Anh
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection
115
Table 1 Number of samples for training and test in one fold COVID-19
COVID-19 ARDS
Others
SARS
Streptococcus
Positive
Negative
Training
6
6
6
6
6
12
18
Test
5
5
5
5
5
10
15
Ratio of positive to negative in the training set is 12/18 = 2/3 is the same ratio for the test set by 10/15 = 2/3
a
Table 2 AUC for the COV database by ResNet-50 & VGG-16 with SMOTE and SPY DL method DL DL.SVM DL.SMOTE.SVM DL.SPY.SVM ResNet-50 VGG-16 a
77.03 78.60
80.96 87.25
83.61 86.78
85.45 90.56
The best results are printed in bold
Table 3 Accuracy for the COV database by ResNet-50 & VGG-16 with SMOTE and SP DL method DL DL.SVM DL.SMOTE.SVM DL.SPY.SVM ResNet-50 VGG-16 a
80.51 81.77
85.70 88.33
86.86 88.20
87.71 89.88
The best results are printed in bold
4. Classification by SVM with features which are firstly extracted from a method of deep learning and then refined by SPY (marked by DL.SPY.SVM) Note that the COVID-19 detection from chest X-Ray images using Deep Learning (DL) and Convolutional Neural Networks was addressed by [39–41]. Automatic detection from x-ray images utilizing features of convolutional neural networks features for SVM (DL.SVM) was presented in [42] while DL.SMOTE.SVM was applied for other applications [16, 17]. The case of DL.SPY.SVM is our new contribution for the COVID-19 detection from chest X-Ray. In the experiments in Fig. 1, the chest X-ray images are pre-processed in step 1 for getting the lung from the image and split the database to five folds. To allow diversity of deep learning methods, we apply ResNet-50 [37] and VGG16 [38] in step 2. Checking class imbalance is performed in step 3 by SMOTE and SPY. Classification is conducted in step 4 where AUC and Accuracy are the metrics for performance evaluation. Table 2 reports the AUC for the COV database by ResNet-50. The classification by ResNet-50 (77.03%) was improved by SVM (80.96%). While this is enhanced by checking class imbalance, it is higher at 83.61% for SMOTE and 85.45% for SPY. The averaged AUC scores from five splits are shown in the blue bars of the Fig. 3. Furthermore, using VGG-16 instead of ResNet-50 allows to have AUC at 78.60%, then get 87.25% by SVM. Finally, SMOTE and SPY raised the AUC to 86.78% and 90.56% accordingly. We combine results in AUC from four options of classification for two deep learning models in table Table 3. Figure 3 demonstrates these records by averaged values and display them by red bars. In relation to the estimation by
116
D. X. Tho and D. N. Anh
Fig. 3 AUC for COVID-19 database by ResNet-50 & VGG-16 with SMOTE and SPY
Fig. 4 Accuracy for the COV database by ResNet-50 & VGG-16 with SMOTE and SPY
Accuracy, the use of ResNet-50 enabled average of 80.51%, and extent to 85.70% by SVM. Imbalance methods of SMOTE and SPY helped to enhance to 86.86% and 87.71% accordingly. Table 3 reports the results in the second line and Fig. 4 illustrates these scores by blue bars. Similar reports are performed to evaluate classification with VGG-16. Accuracy by the VGG-16 is 81.77% and improved to 88.33% by SVM. 88.20% and 89.88% are results of SMOTE and SPY. Lastly, we differentiate the results of four classification in Table 4 and displays these by red bars in Fig. 4. It should be pointed out, that the fourth classification option with VGG-19 features, enhanced by the SPY has the best performance in both AUC and Accuracy. There has recently been significant research interest focusing the diagnostics of COVID-19 from chest X-ray images. To find solution for imbalanced distribution of the sizes of the infection regions between COVID-19 and pneumonia, authors of [39] use a dual-sampling strategy to train the network, which is a 3D convolutional network ResNet34 combined with features of visual attention. Total 4,982 chest CT images were collected for their experiments. The version of attention ResNet34 with uniform sampling allowed to get AUC at 94.80% and Accuracy at 87.9%. Consider the DarkNet like a classifier for a deep learning based model, authors of [40] use also you only look once (YOLO) real time object detection system to detect and classify COVID-19 cases from X-ray images. The COV database and [36], ChestX-ray8 database [44] were the material for the article and 87.02% was the Accuracy reported by the work.
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection Table 4 Comparison of the DL.SPY.SVM to other published results Method Databasea AUC (%) SPY with ResNet-50 (our) SPY with VGG-16 (our) ResNet34 [39] DarkNet-19 [40] DenseNet-169 [41] a
COV [36] COV [36] 4982 images COV [36], Chest X-ray8 [44] COVID-CT [41] with masks
85.45 90.56 94.80 n/a 98.00
117
Accuracy (%) 87.71 89.88 87.90 87.02 89.00
Database are not the same
By incorporating lung masks and lesion mask via multi-task learning and leveraging pretraining [41], the DenseNet-169 in the work [41] achieved AUC of 98.0%, and an accuracy of 89.0%. Note that, the DenseNet-169 was trained on the combined dataset, under different pre-training methods. The mentioned above research results are resumed in the Table 4 with note of different database used for experiments.
5 Conclusions We have represented an imbalance method in learning chest X-ray images for COVID-19 detection. The method is capable to deal with the X-ray images classification that typically are hard handled by regular learner. There is need for accurate feature preparation by a range of deep learning models as well as to check class imbalance in training set. The SPY method has considerable potential to refine training set in case of class imbalance. The kernel based classification by SVM can enhance learning rate both in AUC and Accuracy. The joint method covering deep learning, class imbalance and SVM is proved effective solution for classification, especially suitable for chest X-ray images. The performance of the method is similar to or outperforms state-of-the-art methods that include deep learning models. For feature extraction from X-ray images, the VGG-19 may be preferable for our case study with the COV database. In other cases, selecting other deep learning model for feature extraction can lead to have superior performance. It is worth noting that the class imbalance once learnt can be implemented precisely improving classification performance. Currently our research is concentrating on applying manifold techniques for decreasing number of features, which are very large due to nature of deep learning models. Acknowledgements Authors would like to thank the Vietnam Ministry of Education and Training for supporting the B2021-SPH-01 project on “A research on graph representation learning methods and applying for mining biomedical data”. Authors thank also the anonymous referees for their careful review and constructive comments.
118
D. X. Tho and D. N. Anh
References 1. Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics), 42(4), 1119–1130 (2012) 2. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., et al.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. App. 73, 220–239 (2017) 3. Shelke, M.S., Deshmukh, P.R., Shandilya, V.K.: A review on imbalanced data handling using undersampling and oversampling technique. Int. J. Recent Trends Eng. Res. 3, 444–449 (2017) 4. Shakeel, F., Sabhitha, A.S., Sharma, S.: Exploratory review on class imbalance problem: an overview. In: 2017 8th International Conference (ICCCNT), pp. 1–8. IEEE (2017) 5. Krestenitis, M., Orfanidis, G., Ioannidis, K., Avgerinakis, K., et al.: Oil spill identification from satellite images using deep neural networks. Remote Sens. 11(15), 1762 (2019) 6. Ratadiya, P., Moorthy, R.: Spam filtering on forums: a synthetic oversampling based approach for imbalanced data classification. arXiv preprint arXiv:1909.04826 (2019) 7. Bian, Y., Cheng, M., Yang, C., Yuan, Y., Li, Q., et al.: Financial fraud detection: a new ensemble learning approach for imbalanced data. In: PACIS, p. 315 (2016) 8. Chang, Q., Lin, S., Liu, X.: Stacked-SVM: a dynamic SVM framework for telephone fraud identification from imbalanced CDRs. In: Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, pp. 112–120 (2019) 9. Fotouhi, S., Asadi, S., Kattan, M.W.: A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inform. 90, 103089 (2019) 10. Anh D.N., Hoang N.T.: Learning validation for lung CT images by multivariable class imbalance. In: Frontiers in Intelligent Computing: Theory and Applications. Advances in Intelligent Systems and Computing, vol 1013 (2020) 11. Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. AI Res. 61, 863–905 (2018) 12. Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: 2009 second international workshop on computer science and engineering, Vol. 2, pp. 13–17. IEEE (2009) 13. Saez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015) 14. Kaur, H., Pannu, H.S., Malhi, A.K.: A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput. Surv. 52(4), 1–36 (2019) 15. Wang, K.J., Makond, B., Chen, K.H., Wang, K.M.: A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients. Appl. Soft Comput. 20, 15–24 (2014) 16. Gao, R., Peng, J., Nguyen, L., Liang, Y., Thng, S., Lin, Z.: Classification of non-tumorous facial pigmentation disorders using deep learning and SMOTE. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2019) 17. Pandey, S.K., Janghel, R.R.: Automatic detection of arrhythmia from imbalanced ECG database using CNN model with SMOTE. Australasian Phys. Eng. Sci. Med. 42(4), 1129–1139 (2019) 18. Dang, X.T., Bui, D.H., Nguyen, T.H., Nguyen, T.Q.V., Tran, D.H.: Prediction of autism-related genes using a new clustering-based under-sampling method. In: 2019 11th Inter. Conf. on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2019) 19. Dang, X.T., Tran, D.H., Hirose, O., Satou, K.: SPY: a novel resampling method for improving classification performance in imbalanced data. In: 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), pp. 280–28. IEEE (2015) 20. Tang, Y., Zhang, Y.Q., Chawla, NV., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics), 39(1), 281–288 (2008) 21. Sanz, J.A., Bernardo, D., Herrera, F., et al.: A compact evolutionary interval-valued fuzzy rulebased classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23(4), 973–990 (2014)
Imbalance in Learning Chest X-Ray Images for COVID-19 Detection
119
22. Park, Y., Ghosh, J.: Ensembles of (α)-trees for imbalanced classification problems. IEEE Trans. Knowl. Data Eng. 26(1), 131–143 (2012) 23. Wu, Q., Ye, Y., Zhang, H., Ng, M.K., Ho, S.S.: ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl. Based Syst. 67, 105–116 (2014) 24. Zieba, M., Tomczak, J.M., Lubicz, M., Swiatek, J.: Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl. Soft Comput. 14, 99–108 (2014) 25. Shao, Y.H., Chen, W.J., Zhang, J.J., et al.: An efficient weighted Lagrangian twin support vector machine for imbalanced data classification. Pattern Recogn. 47(9), 3158–3167 (2014) 26. Krawczyk, B., Galar, M., Jelen, L., Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016) 27. Haixiang, G., Yijing, L., Yanan, L., Xiao, L., Jinling, L.: BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng. Applic. Artif. Intell. 49, 176–193 (2016) 28. Lee, W., Jun, C.H., Lee, J.S.: Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification. Inf. Sci. 381, 92–103 (2017) 29. Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015) 30. Lu, Y., Cheung, Y.M., Tang, Y.Y.: Hybrid sampling with bagging for class imbalance learning. In: Pacific-Asia Conference on Knowledge Discovery Data Mining pp. 14–26, Springer (2016) 31. LeCun, Yann., Bengio, Yoshua, Hinton, Geoffrey: Deep learning. Nature. 521(7553), 436–444 (2015) 32. Hajnal, J., Hawkes, D., Hill, D.: Medical Image Registration. CRC Press, Baton Rouge, Florida (2001) 33. Barber, D.: Bayesian Reasoning Machine Learning. Cambridge University Press (2012) 34. Taylor, J.R.: An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books. pp. 128–129 (1999) 35. Swets, John A.: Signal Detection Theory ROC Analysis in Psychology Diagnostics: Collected Papers. Lawrence Erlbaum Associates, Mahwah, NJ (1996) 36. Cohen, J.P., Morrison, P., Dao, L., Roth, K., Duong, T.Q., Ghassemi, M.: Covid-19 image data collection: prospective predictions are the future. arXiv:2006.11988 (2020) 37. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision pattern recognition, pp. 770–778 (2016) 38. Russakovsky, O., Deng, J., Su, H., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV). 115(3), 211–252 (2016) 39. Xi, O., Jiayu, H., Liming, X., et al.: Dual-sampling attention network for diagnosis of COVID19 from community acquired pneumonia. arXiv:2005.02690 (2020) 40. Ozturk, T., Talo, M., Yildirim, E.A., Baloglu, U.B., Yildirim, O., Acharya, U.R.: Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput. Biol. Med. 121 (2020) 41. Zhao, J., Zhang, Y., He, X., Xie, P.: COVID-CT-dataset: a CT scan dataset about covid-19. arXiv preprint arXiv:2003.13865 (2020) 42. Sethy, P.K., Behera, S.K., Ratha, P.K., Biswas, P.: Detection of coronavirus disease (COVID-19) based on deep features. Preprints, 2020030300 (2020) 43. Apostolopoulos, I.D., Bessiana, T.: Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. arXiv preprint arXiv:2003.11617 (2020) 44. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.: Chest X-ray8: hospital-scale chest X-ray database benchmarks on weakly-supervised classification localization of common thorax diseases. In: Proceedings of the IEEE Conference on CVPR, pp. 2097–2106 (2017)
Deep Learning Based COVID-19 Diagnosis by Joint Classification and Segmentation Tien-Thanh Tran, Thi-Thao Tran, and Van-Truong Pham
Abstract COVID-19 is currently one the most life-threatening problems around the world. Fast and accurate detection of the COVID-19 infection play an important role to identify, take better decisions as well as ensure treatment for the patients. In this paper, we propose a fast and efficient method to identify COVID-19 patients based on deep learning approach. The proposed approach includes segmentation and classification stages. The segmentation stage is performed by employing U-Net neural network to accurately segment the lung position from chest CT images. The classification stage is achieved by DenseNet169 model. We applied the proposed model to dataset contains 349 CT scans that are positive for COVID-19 and 397 negative CT scans that are normal or contain other types of diseases. Experiment show that our model outperforms other methods in term of accuracy, sensitivity, F1, and AUC evaluation metrics. Keywords COVID-19 · Deep Learning · Segmentation · Classification · Computed tomography
1 Introduction SARS-CoV-2 pandemic (COVID 19) spreading globally from the beginning of 2020 until now has not shown any signs of remission. According to World Health Organization (WHO) report, to date, there were more than 28 million people infected and more than 900 thousand deaths [1]. It is causing the world to face crisis in many aspects from medical to economic. One of the main causes of death is the lack of timely diagnosis and treatment. Early detection of COVID 19 cases plays a critical role in treatment and prevention of the disease’s spread to the community. Until now, the gold standard for detecting COVID-19 is the Reverse Transcription Polymerase Chain Reaction (RT-PCR) [2], which consists of detecting viral RNA from sputum or nasopharyngeal swab. The main drawback of RT-PCR method is the long waiting T.-T. Tran · T.-T. Tran · V.-T. Pham (B) School of Electrical Engineering, Hanoi University of Science and Technology, No.1 Dai Co Viet, Hanoi, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_10
121
122
T.-T. Tran et al.
time for results, which can take days, and the limitation of equipment to perform this method in most hospitals [3]. Furthermore, its low sensitivity prevents new positive cases from being detected and isolated in time. Radiological equipment such as X-ray and CT-scans came up as potential alternatives for COVID-19 diagnosis. Compared to X-ray images, the CT images of lung have better quality and the richer in obtained information [3]. Thus, automatic diagnosis of lung lesion to detect infected patients from CT scans offers great potential in coping with this pandemic [4]. Computer diagnostics can save time, contribute to minimizing overload on medical systems, and even provide accuracy equivalent to experts in some cases [3]. While using lung CT image for diagnosis COVID-19 is really promising, it has some drawbacks. First is the limitation of annotated public COVID-19 CT datasets. Second, lung CT images have some unique characteristics such as variability of lesion, density difference between images, and not clear distinction between normal and lesion sites on lung. Recent studies have shown the effectiveness of deep learning in many vision problems, such as segmentation [5–8], object detection [9, 10], classification [11–14]. Deep learning-based methods also achieved many successes in biomedical image processing problems. Tran [15] proposed a deep fully convolutional neural network (FCN) architecture to tackle the problem of automated left and right ventricle segmentation. For the task of classification skin lesion from dermoscopic images, authors in [11] proposed using transfer learning from a pre-trained VGG16 model on ImageNet dataset. Another deep fully convolutional neural network architecture named Unet, which consists of a contracting path to capture context and a symmetric expanding path, was proposed in [7] that enables precise localization. In this work, we aim to address these two problems, which is first stage and second stage respectively, by (1) accurately segment the lung position from chest CT scans with a deep encoder-decoder model and (2) classifying as COVID-19 positive or negative from the segmented lungs. Chest CT images collected from different machines will have different properties such as lung sizes, background pixel values, contrast, brightness, etc. While other properties can be easily solved by deep learning, difference in lung size and background noise will limit the performance of the model. Hence, accurate segmentation of lung area and thereby removing noisy background is essential and will make the system more robust with variety types of CT scan. In first stage, we applied a deep fully convolutional neural network architecture named U-Net [7] to produce a fast, accurate and robust model for detecting lung region. We will use the first stage to filter noisy background in the dataset used for training models in second stage. In the second stage, we trained the DenseNet169 [12] architecture to detect COVID-19 cases. Experiment show that our model outperforms other methods in many evaluation metrics. The remainder of this paper is structured as follows. In Sect. 2, we describe detail our system, which is mainly based on classification and segmentation tasks. Section 3 presents the experimental studies. The results of our work are presented in Sect. 4. Discussion and conclusion are given in Sect. 5.
Deep Learning Based COVID-19 Diagnosis …
123
Table 1 Dataset Split captured from dataset’s paper [4] Class
Train
Val
Test
# patient
COVID
130
32
54
Non-COVID
105
24
42
#images
COVID
191
60
98
Non-COVID
234
58
105
2 Methodology 2.1 Dataset Description In this study, we used the public COVID19-CT dataset proposed in [4]. The dataset contains 349 CT scans that are positive for COVID-19 and 397 negative CT scans that are normal or contain other types of diseases. Positive labeled scans are from 143 patient cases. Negative for COVID-19 CTs, selected from public dataset and search engine, are scans of various diseases. The dataset was built from 760 preprints about COVID-19 from medRxiv and bioRxiv, posted from Jan 19th to Mar 25th . The low-level structure information from the PDF files of preprints are extracted by PyMuPDF. The quality of samples is well-preserved. But because CT images in the dataset have different lung sizes and noisy background, we did not train the dataset directly. Instead, we determined the lung area on every samples in the COVID19CT dataset and eliminated unnecessary background using the segmentation stage to obtain new filtered dataset. We trained our classification model on this filtered dataset. At inference time, we pre-process test samples with the segmentation stage before passing it through classification stage to get the final prediction (Table 1).
2.2 Proposed Method In this study, we proposed a system to diagnose patients with COVID-19 using deep learning. The system consists of two stages which are segmentation stage and classification stage. We firstly trained the U-Net on our dataset collected from several public datasets, and saved the pre-trained weights. Then, we used pre-trained segmentation model to identify lung region. By this way, it can get rid of unnecessary background in the images from a dataset, which is proposed in [4] for training classification model. We then trained the DenseNet169 [12] with the filtered dataset and saved all their weights. At inference time, test samples are passed through the trained model to get predictions. From this, we can get the final output of the system which represents the probability of those test samples taken from positive patients.
124
T.-T. Tran et al.
2.3 Segmentation Stage For image segmentation task in the first stage, in this work, we applied U-Net [7] to segment lung area from lung CT images. U-Net neural network proposed by Ronneberger is an improvement version of fully convolutional network architecture. The architecture of U-Net is presented in Fig. 1. The structure of the U-Net is extended from FCN [6]. It is the combination of a contracting part (the encoder) and an expanding part (the decoder). The contracting part composes of convolutional and max-pooling layers. The expanding part consists of the aggregation of the encoder intermediate features, upsampling and convolutional layers. To recover fine-grained features that may be lost in the downsampling stage, cross-over connections are used by concatenating equally sized feature maps. U-Net has shown to be applicable to multiple medical image segmentation problems. Some representative results when applying U-Net to segment lung CT images from the dataset are presented in Fig. 2. In which, the original images are put in the first column, segmented results by U-Net are given in the second column and the cropped results are presented in the last column.
Fig. 1 Basic structure of the U-Net (figure adapted from [7])
Deep Learning Based COVID-19 Diagnosis …
125
Fig. 2 Some representative segmentation results by U-Net. (a) Input lung CT images, (b) Predict results from U-Net, (c) Segment results plotted on original images, and (d) Cropped results
2.4 Classification Stage Dense Convolutional Network (DenseNet) is an advanced neural network proposed by Huang et al. in [12]. The core idea behind the DenseNet is that ConvNets contain shorter connections between layers close to the input; and those close to the output can be substantially deeper, more accurate, and efficient to train. In DenseNet, all layers are connected directly with each other to ensure maximum information flow between layers in the network. Each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. With this dense connected architecture, DenseNet has better parameter efficiency, alleviate the vanishing-gradient problems, strengthen feature propagation and encourage feature reuse. By the time this model came out, it outperformed the current state-ofthe-art result on most of the benchmark tasks. In this study, DenseNet169 is used as deep learning architecture for training classification dataset due to the limitation of data. The structure of DenseNet169 is given in Fig. 3.
3 Training Protocol For training our classification model, we use same protocol for all experiments. We trained all our experiments on Google Colab with PyTorch framework. Training is done for 120 epochs with fixed learning rate of 1e-4, and the checkpoint with best performance for validation set is saved for evaluation of the test set. The loss function used here is binary cross-entropy and we choose Adam algorithm as the optimizer.
126
T.-T. Tran et al.
Fig. 3 DenseNet169 architecture
For evaluation criteria, we used various parameters including accuracy, precision, recall, F1 score and AUC score. Accuracy is the total number of correct predictions out of all predictions. Recall/Sensitivity depicts out of total actual coronavirus cases; how many are identified as COVID-19 affected. Precision is out of total subjects predicted as corona virus cases, how many actually infected COVID-19. The F1 score is the harmonic mean of precision and recall. AUC measure the entire twodimensional area underneath the entire ROC curve from (0, 0) to (1, 1).
4 Experimental Results In this section, we conduct experiments to evaluate the performances of the proposed model. In the first experiment, we present the role of segmentation step in the proposed system. In the second experiments, we compare the performances of the proposed model with state-of-the-art methods. In the first experiment, we compared the system performance when the segmentation stage is performed. To this end, we first trained the DenseNet169 with the COVID19-CT dataset and the filtered dataset. The performances of the model when training with and without using segmentation stage is reported in Table 2. As we can observe from Table 2, the model trained without using segmentation stage have accuracy lower than 0.82; the F1 score is lower than 0.83 and AUC score is just around 0.88. After using segmentation stage, we get better performance for all parameters. The accuracy, F1 score and AUC score are higher than 0.87, and 0.91, respectively. Those results confirm the usefulness of the use of segmentation stage to detect the lung area and removed noisy background of CT scans. In the next experiment, we compare our model with four other methods: Liu et al. [16], Silva et al. [17] Jiang et al. [18], and He et al. [19] methods using the same dataset. The comparative results are shown in Table 3. As we can see from Table 3, the best result for image classification was obtained using our method with an
Deep Learning Based COVID-19 Diagnosis … Table 2 Performance comparison between the system without segmentation stage and with segmentation stage
127
Metrics
Dense169
Accuracy Sensitivity
without segmentation
0.82
with segmentation
0.87
without segmentation
0.76
with segmentation
0.91
F1
without segmentation
0.83
with segmentation
0.87
AUC
without segmentation
0.88
with segmentation
0.91
Table 3 Performance comparison between our proposed model and other methods also used COVID19-CT dataset Metrics
Method [16]
Method [17]
Method [18]
Method [19]
Our approach
Accuracy
0.85
0.88
0.83
0.86
0.87
Sensitivity
0.86
-
-
-
0.91
F1
0.85
0.86
0.84
0.85
0.87
AUC
0.91
0.90
0.89
0.94
0.91
accuracy of 0.87 versus 0.86 by He et al. [19], 0.83 by Jiang et al. [18], 0.88 by Silva et al. [17], and 0.85 by Liu et al. [16]. Our system also outperformed the other four methods in F1 score. We also reach the comparable accuracy with He et al. [19].
5 Conclusion We have proposed a multiple stages system for detecting COVID-19 from chest CT scans of patients. The proposed approach includes two stages: segmentation of lung area from CT images, and classification of COVID and non-COVID patients. The segmentation step is performed by UNet neural network. Meanwhile, the classification is conducted by DenseNet169 network. Experimental results show that the proposed approach outperforms other state-of-the-art methods. Acknowledgements This research is funded by the Hanoi University of Science and Technology (HUST) under project number T2020-PC-017.
128
T.-T. Tran et al.
References 1. https://covid19.who.int. Accessed 12 Sep 2020 2. Ai, T., Yang, Z., Hou, H., Zhan, C., Chen, C., Lv, W., Tao, Q., Sun, Z., Xia, L.: Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 296(2), E32–E40 (2020). https://doi.org/10.1148/radiol.2020200642 3. Shi, F., Wang, J., Shi, J., Ziyan, Wu., Wang, Q., Tang, Z., He, K., Shi, Y., Shen, D.: Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19. IEEE Rev. Biomed. Eng. 14, 4–15 (2021). https://doi.org/10.1109/RBME.2020. 2987975 4. Yang, X., He, X., Zhao, J., Zhang, Y., Zhang, S., Xie, P.: COVID-CT-dataset: a CT scan dataset about COVID-19 (2020). https://arxiv.org/abs/2003.13865 5. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2017) 6. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015) 7. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Proceedings International Conference Medical Image Computer Comput.Assist. Intervent, pp. 234–241 (2015) 8. Pham, V.T., Tran, T.T., Wang, P.C., Lo, M.T.: Tympanic membrane segmentation in otoscopic images based on fully convolutional network with active contour loss, signal. Image Video Process. (2020). https://doi.org/10.1007/s11760-020-01772-7 9. Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision (ICCV), pp. 1440– 1448 (2015) 10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real time object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016) 11. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings 36th International Conference Machine Learn, pp. 6105–6114 (2019) 12. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 14. Tran, T.T., Fang, T.Y., Pham, V.T., Lin C., Wang, P.C., Lo, M.T.: Development of an automatic diagnostic algorithm for pediatric otitis media, Otol. Neurotol. 39, 1060–1065 (2018) 15. Tran, P.V.: A fully convolutional neural network for cardiac segmentation in short-axis MRI (2016). https://arxiv.org/abs/1604.00494 16. Liu, B., Gao, X., He, M., Liu, L., Yin, G.: A fast online covid-19 diagnostic system with chest CT scans. In: 26TH ACM Sigkdd Conference on Knowledge Discovery and Data Mining (Health Day) (2020) 17. Silva, P., Luz, E., Silva, G., Moreira, G., Silva, R., Lucio, D.M.D.: Efficient Deep Learning Model for COVID-19 Detection in large CT images datasets: a cross-dataset analysis. Res. Square (2020). https://doi.org/10.21203/rs.3.rs-41062/v1 18. Jiang, Y., Zeng, Z., Zhou, B.: Deep learning aided CT diagnosis on Convid-19. https://noiselab. ucsd.edu/ECE228/projects/Report/10Report.pdf. Accessed 2020 19. He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E., Xie, P.: Sample-efficient deep learning for COVID-19 diagnosis based on CT scans, MedRxiv 7 (2020)
General Computational Intelligence Techniques and Their Applications
Why It Is Sufficient to Have Real-Valued Amplitudes in Quantum Computing Isaac Bautista, Vladik Kreinovich, Olga Kosheleva, and Hoang Phuong Nguyen
Abstract In the last decades, a lot of attention has been placed on quantum algorithms—algorithms that will run on future quantum computers. In principle, quantum systems can use any complex-valued amplitudes. However, in practice, quantum algorithms only use real-valued amplitudes. In this paper, we provide a simple explanation for this empirical fact.
1 Formulation of the Problem Need for Quantum Computing. For many practical problems, there is still a need for faster computations. For example, current spectacular successes of deep learning (see, e.g., [2]) could be even more spectacular if we could process even more data. Computers’ ability to process information is limited, among other thing, by the fact that all speeds are bounded by the speed of light. Even with a speed of light, sending a signal from one side of a 30 cm-size laptop to another takes 1 nanosecond— the time during which even the cheapest of current computers performs at least 4 operations. So, to make computations faster, it is necessary to make computer components much smaller. Already these components—such as memory cells— consist of a small number of molecules. If we make these cells much smaller, they will consist of only a few molecules. To describe the behavior of such small objects, it is necessary to take into account quantum physics—the physics of the microworld; see, e.g., [1, 4]. Thus, computers need to take into account quantum effects. I. Bautista · V. Kreinovich (B) · O. Kosheleva University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] I. Bautista e-mail: [email protected] O. Kosheleva e-mail: [email protected] H. P. Nguyen Division Informatics, Math-Informatics Faculty, Thang Long University, Nghiem Xuan Yem Road, Hoang Mai District, Hanoi, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_11
131
132
I. Bautista et al.
Successes of Quantum Computing. At first, computer engineers viewed quantum effects as a nuisance—since in quantum physics, everything is probabilistic, but we want computers to always produce the same correct result, with probability 1. Because of this probabilistic character of quantum physics, we cannot simply use the same algorithms on the quantum-level computers, we need to come up with new algorithms, algorithms that would provide reliable answers even in the probabilistic environment of quantum physics. Such algorithms have been invented; see, e.g., [3]. Interestingly, many of them require even fewer computational steps than the usual non-quantum algorithms. The two most well-known examples are: • Grover’s algorithm that finds an element √ with the desired property in an unsorted nelement array in time proportional to n – while non-quantum algorithms require at least n steps, and • Shor’s algorithm that factors large n-digit integers in time bounded by a polynomial of n; this may sound like an academic problem until one realizes that most existing encodings protecting our privacy and security are based on the fact that with nonquantum algorithms, the only known algorithms for such factorization require physically impossible exponential time. Main Idea Behind Quantum Computing. How come quantum computing algorithms can be so much faster? The main explanation is that in quantum physics, with every two states s an s , we can also have superpositions of these states, i.e., states of the type a · s + a · s, where a and a are complex numbers (known as amplitudes) for which |a|2 + |a |2 = 1. Philosophers and journalists are still arguing about the example – proposed by Nobelist Schrodinger, one of the founding fathers of quantum physics – that we can have a composition of a dead cat and an alive cat, but for particles, such superpositions have been experimentally observed since the early 20th century. In particular, in a quantum computer, in addition to the usual 0 and 1 states of every bit—which in quantum computing are denoted |0 and |1—we can also have superpositions of these states, i.e., states of the type c0 |0 + c1 |1, where c0 and c1 are complex numbers for which |c0 |2 + |c1 |2 = 1. A quantum system corresponding to a bit is known as an quantum bit, or qubit, for short.
Why It Is Sufficient to Have Real-Valued Amplitudes in Quantum Computing
133
If we measure the state of the qubit, we will get 0 with probability |c0 |2 and 1 with probability |c1 |2 . The fact that these probabilities should add up to 1 explains the above restriction on the coefficients. Similarly, for 2-bit combinations, in addition to the traditional (non-quantum) states 00, 01, 10, and 11, we can have superpositions c00 |00 + c01 |00 + c10 |00 + c11 |11, where c00 , c01 , c10 , and c11 are complex numbers for which |c00 |2 + |c01 |2 + |c10 |2 + |c11 |2 = 1. In general, for an n-qubit system, we can have states c0...00 |0 . . . 00 + c0...01 |0 . . . 01 + . . . + c1...11 |1 . . . 11 characterized by a complex-valued vector c = (c0...00 , c0...01 , . . . , c1...11 ). How can this help with computations? For example, in the non-quantum searchin-an-array algorithm, the only thing we can do is select an integer i and check whether the i-th element of the array satisfies the desired property. This way, if we make fewer than n checks, we check fewer than n elements and we may miss the desired element—this explains why we need at least n computational steps in the non-quantum case. In quantum physics, instead of asking for an element number i, we can submit a request which is a superposition of some integers, ci |i + ci |i + . . . This way, in effect, we can check several elements in one step. This is just an idea, not an explanation of Grover’s algorithm—on the one hand, we can check several elements, but on the other hand, if we do it naively, the results will be probabilistic—and we want guaranteed bounds. So, to compensate for the probabilistic character of the quantum measurements, we need to use quite some ingenuity. Sometimes, it works—as in Grover’s and Shor’s cases, sometimes it does not. Unitary Transformations and Beyond. When we describe a bit in non-quantum physics, what is important is that we have a system with two states. Which of the two states is associated with 0 and which with 1 does not matter that much. From this viewpoint, all the properties of the bit system are invariant with respect to a swap 0 ↔ 1. In the quantum case, in addition to a swap, we can also have arbitrary unitary transformation c → T c, where T is a unitary matrix, i.e., a matrix for which T T † = def T † T = I , where I is the unit matrix, Ti†j = T ji∗ , and z ∗ denotes complex conjugate:
134
I. Bautista et al.
(x + y · i)∗ = x − y · i and i =
def
def
√ −1.
In particular, for each qubit, we can have Walsh-Hadamard transformations— actively used in quantum computing—in which 1 1 T |0 = √ |0 + √ |1 2 2 and
1 1 T |1 = √ |0 − √ |1. 2 2
Invariance means, in particular, that for any quantum algorithm that uses states s1 , s2 , etc., and for every unitary transformation T , we can perform the same computations by using instead states T s1 , T s2 , etc. Unitary transformation maps each vector form the original space into a vector from the same space, and preserves the vector’s length def
(c1 , c2 , . . .)2 = |c1 |2 + |c2 |2 + . . . One can also consider generalized unitary transformations, when each vector is mapped into a vector from a possibly higher-dimensional space—as long as this transformation preserves the lengths of all vectors. Similarly, for any quantum algorithm that uses states s1 , s2 , etc., and for every generalized unitary transformation T , we can perform the same computations by using instead states T s1 , T s2 , etc. Interesting Phenomenon. Many researchers have come up with many creative quantum algorithms for solving important practical problems. And there is a general—and somewhat unexpected—feature of all these algorithms: • while in general, we can have state with general complex values of the coefficients ci , but • in all proposed algorithms, the coefficients are real-valued! It should be mentioned that this does not mean that we cannot use non-real complex values in these algorithms. For example, one can see that all probabilities remain the ef same if instead of the original coefficients ci , we use coefficients ci = exp(α · i) · ci , where α is a real-valued constant. In particular, if we take α=
π , 2
we can replace all real values ci with purely imaginary values i · ci . This possibility also follows from the fact that this transformation can be described as c → T c, where the diagonal matrix T = diag(exp(α · i), exp(α · i), . . . , exp(α · i))
Why It Is Sufficient to Have Real-Valued Amplitudes in Quantum Computing
135
is, as one can easily check, unitary. What the above empirical fact means is that it is sufficient to use only real-valued amplitudes—in the sense that whatever we can do with complex-valued amplitudes, we can do with real-valued amplitudes as well. Important Challenge. A natural question is: why? Why real-valued amplitudes are sufficient for quantum computing? What We Do in This Paper. In this paper, we provide a simple and natural explanation for this empirical fact—thus showing that this is true not only for all known quantum algorithms: for any future quantum algorithm, it is also sufficient to use real-valued amplitudes.
2 Our Explanation Main Idea: 1-Qubit States. Suppose that at some point, a quantum algorithm uses a state c0 |0 + c1 |1, in which c0 and c1 are non-real complex numbers c0 = a0 + b0 · i and c0 = a1 + b1 · i, i.e., for which the state has the form (a0 + b0 · i)|0 + (a1 + b1 · i)|0. Then, we can form a related state of the 2-qubit system, with an additional qubit: • whose 0 state corresponds to real parts of the amplitudes and • whose 1 state corresponds to the imaginary parts: a0 |00 + b0 |01 + a1 |10 + b1 |11. One can see that this transformation from the original 1-qubit state with complex coefficients to a real-valued 2-qubit state preserves the length of each vector and is, thus, generalized unitary. 2-Qubit States. Similarly, we can transform a general 2-qubit state (a00 + b00 · i)|00 + (a01 + b01 · i)|01 + (a10 + b10 · i)|10 + (a11 + b11 · i)|11, where ai j and bi j are real numbers, into the following real-valued state of a 3-qubit system with an additional auxiliary qubit: a00 |000 + b00 |001 + a01 |010 + b01 |011 + a10 |100 + b10 |101 + a11 |110 + b11 |111.
136
I. Bautista et al.
This transformation from the original 2-qubit state with complex coefficients to a real-valued 3-qubit state preserves the length of each vector and is, thus, generalized unitary. General Case. In general, we can transform an arbitrary n-qubit state (a0...00 + b0...00 · i)|0 . . . 00 + (a0...01 + b0...01 · i)|0 . . . 01 + . . . + (a1...11 + b1...11 · i)|1 . . . 11 into the following real-valued state of an (n + 1)-qubit system with an additional auxiliary qubit: a0...00 |0 . . . 000 + b0...00 |0 . . . 001 + a0...01 |0 . . . 010 + b0...01 |0 . . . 011 + . . . + a1...11 |1 . . . 110 + b1...11 |1 . . . 111. This transformation from the original n-qubit state with complex coefficients to a real-valued (n + 1)-qubit state preserves the length of each vector and is, thus, generalized unitary. Since this transformation is generalized unitary, we can implement any quantum algorithm with the corresponding transformed states T s1 , T s2 , …– i.e., we can indeed implement the original algorithm by using states with real-valued amplitudes only. Acknowledgments This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics. Addison Wesley, Boston, Massachusetts (2005) 2. Goodfellow, I., Bengio, Y., Courville, A.: Deep Leaning. MIT Press, Cambridge, Massachusetts (2016) 3. Nielsen, M.A., Chuang, I.L.: Quantum Computation and Quantum Information. Cambridge University Press, Cambridge, U.K. (2000) 4. Thorne, K.S., Blandford, R.D.: Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics. Princeton University Press, Princeton, New Jersey (2017)
On an Application of Lattice-Valued Integral Transform to Multicriteria Decision Making Michal Holˇcapek and Viec Bui Quoc
Abstract The paper is devoted to the application of the integral transform for latticevalued functions, which is based on a Sugeno-like fuzzy integral, to multicriteria decision making. We present an integral transform defined within the space of functions whose values belong to a complete residuated lattice. We use this integral transform as an extended qualitative aggregation operator in multicriteria decision making to get the evaluation of alternatives for a decision-maker. The proposed approach is illustrated and compared with a common approach on a car selection problem.
1 Introduction Multicriteria decision making (MCDM) is used in screening, prioritising, ranking, or selecting a set of alternatives under usually independent, incommensurate or conflicting criteria. An MCDM problem is usually characterized by the ratings of each alternative with respect to criteria and weights determining their significance [1–3, 8]. As an example, let us consider a problem of selecting a car, the aim of which is to buy a new car from a set of cars of different brands. This set is called the set of alternatives. To select the best car, it is necessary to determine suitable criteria (e.g., price, brand, design, safety, performance) together with their degrees of importance according to which it will be decided. The evaluation of alternatives (i.e., cars) is provided by an aggregation of values expressing the degrees to which criteria are satisfied, taking into account the weights of their importance in a decision making. More formally, let A = {a1 , . . . , an } denote a set of alternatives, let C = {c1 , . . . , cm } denote a set of criteria, and let L be a linearly ordered scale. A satisfaction of criteria by alternatives can be described as a function r : A → L C , where r (ai )(c j ) expresses the degree to which the j-th criterion c j is satisfied by the i-th The first author announces a support of Czech Science Foundation through the grant 18-06915S and the ERDF/ESF project AI-Met4AI No. CZ.02.1.01/0.0/0.0/17_049/0008414. M. Holˇcapek (B) · V. B. Quoc CE IT4I - IRAFM, University of Ostrava, 30.dubna 22, 701 03 Ostrava 1, Czech Republic e-mail: [email protected] URL: http://irafm.osu.cz © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_12
137
138
M. Holˇcapek and V. B. Quoc
alternative ai .1 The importance of criteria can be described as a function w : C → L, where the higher value of w(c j ) means the higher importance of criterion c j . The evaluation of alternatives is then a function u : A → L given as u(ai ) = h w (r (ai )(c1 ), . . . , r (ai )(cm )),
(1)
where h w : L m → L is an aggregation function respecting the importance of criteria expressed by the function w. The most popular aggregation function in practice is the weighted m average (generally OWA operators can be applied [17]) when we w(ci ) = 1. However, the linearly ordered scale L may not satisfy assume that i=1 generally all the requirements for the application of the weighted average. This can occur when the standard arithmetic operations cannot be, in principle, used for the values of the scale L (e.g., the values of a scale are only linearly ordered labels like low, medium, high or bad, good, excellent) or even can but the weighted average provides wrong results.2 In this case, it is reasonable to use aggregation operators on bounded linearly ordered sets (or even bounded partial ordered sets or lattices) such as the weighted minimum or maximum proposed in [4], weighted median in [18] or linguistic OWA operator in [10]. The theory of such aggregation operators “often referred to as qualitative”, with other examples can be found in [9]. In this contribution, we are interested in MCDM, where the alternatives are evaluated in a linearly ordered set endowed by additional operations (precisely, in a residuated lattice) and, moreover, the evaluation is not a single value for each alternative but a vector whose values determine the satisfaction of alternatives with respect to global criteria describing suitable features. This seems to be advantageous in a situation when it is difficult to specify the importance of criteria with respect to one global criterion formally expressing “to be the best alternative”, whose satisfaction by alternatives corresponds to the evaluation of alternatives,3 and it is easier to select global criteria related only to criteria from certain subgroups of all criteria, which allows us to simply determine the importance of criteria with respect to the related global criteria. Subgroups of criteria may overlap which means that one criterion has an influence in the evaluation of more than one global criterion as it is demonstrated in Fig. 1. The evaluated alternatives can be used directly for a decision (by a comparison of vectors), or can serve as input values for another MCDM (e.g., a hierarchical 1 Note
that L C denotes the set of all functions from C to L. example, let r1 = r (a1 )(price) = 0.2, r2 = r (a2 )(price) = 0.5 and r3 = r (a3 )(price) = 0.8 denote the satisfactions of the criterion “price” by three alternatives (cars). Although we have r2 − r1 = r3 − r2 , the real prices of alternatives may not capture the same differences, because of different considerations when real prices are lower and when higher. This type of heterogeneity is quite common, especially when quantifying something which is not well measurable (e.g. car design), and in this case, the weighted average can lead to an incorrect evaluation of alternatives. 3 One can see that the evaluation of alternatives described as a function u : A → L can be identically expressed by a function u : A → L {g} , where g represents a global criterion “to be the best alternative”, and the evaluation of alternatives can be equivalently described by the degrees to which the alternatives satisfy the global criterion g, i.e., u(ai ) = u (ai )(g) for i = 1, . . . , n. The importance of criteria can be equivalently expressed as a function w : C × {g} → L, where w (c j , g) determines the degree to which the criterion c j is important in the evaluation of the global criterion g. 2 For
On an Application of Lattice-Valued Integral Transform to MCDM Fig. 1 Relationship between criteria from C = {c j | j = 1, . . . , 9} and global criteria from G = {gk | k = 1, 2, 3}, where the displayed arrows indicate an existing importance that can be described by degrees of importance and missing arrows indicate non importance
139
g1
c1
c2
c3
g2
c4
c5
g3
c6
c7
c8
c9
model is considered). Our aim is to introduce the evaluation of alternatives with respect to global criteria by a novel approach, which is based on the multiplicationbased integral transform (integral transform for short) of lattice-valued functions presented in [11] that uses a Sugeno-like fuzzy integral in its definition. More formally, if A denotes the set of alternatives, C the set of criteria and, in addition, let G denote the set of global criteria, then the evaluation of alternatives with respect to global criteria can be described as a function u : A → L G which is defined through the following commutative diagram r
A u
LC h LG,
(2)
where the function r expresses a satisfaction of criteria from C by alternatives from A, and h is an “extended” aggregation function, which will be introduced as the integral transform of the space L C to the space L G with a kernel function w : C × G → L, where w(c j , gk ) determines the degree to which the criterion c j is important in the evaluation of alternatives by the criterion gk . If a criterion c j is not important at all for a global criterion gk , then w(c j , gk ) is equal to the bottom element of L. Note that, for G = {g}, the extended aggregation function h defined by the integral transform involves the weighted maximum mentioned above.4 Of course, the extended aggregation function can be obtained also in other ways, but the integral transform provides a consistent way for the evaluation of alternatives with a possibility of changing parameters. The paper is structured as follows. In the next section, we briefly recall the definition of complete residuated lattices, the basic concepts of fuzzy set theory and the theory of fuzzy measure spaces, and the definitions of multiplication-based fuzzy integral and the integral transform for residuated lattice-valued functions. The MCDC
4 Similarly
the weighted minimum can be obtained as a special case of the residuum base integral transform proposed in [12].
140
M. Holˇcapek and V. B. Quoc
based on the integral transform is introduced in the third section, and an illustrative example on a car selection problem is presented in the fourth section. The last section is a conclusion.
2 Preliminaries 2.1 Algebra of Truth Values In this paper, we assume that the scale is a complete linearly ordered residuated lattice, i.e., an algebra L = L , ∧, ∨, ⊗, →, 0, 1 with four binary operations and two constants such that L , ∧, ∨, 0, 1 is a complete linearly ordered lattice, where 0 denotes the least element and 1 denotes the greatest element of L, L , ⊗, 1 is a commutative monoid (i.e., ⊗ is associative, commutative and the identity a ⊗ 1 = a holds for any a ∈ L) and the adjointness property is satisfied, i.e., a ⊗ b ≤ c iff a ≤ b → c
(3)
holds for each a, b, c ∈ L, where ≤ denotes the corresponding lattice ordering. The operations ⊗ and → are called the multiplication and the residuum. For the sake of simplicity, we will omit “complete linearly ordered” in “complete linearly ordered residuated lattice” and write only a residuated lattice. Example 1 It is well-known that the algebra L = [0, 1], min, max, T, →T , 0, 1, where T is a left continuous t-norm [13] and a →T b = {c ∈ [0, 1] | T (a, c) ≤ b} defines the residuum, is a complete linearly ordered residuated lattice. Recall the fundamental continuous t-norms, namely, the minimum, product and Łukasiewicz t-norms: TM (a, b) = a ∧ b TP (a, b) = a · b TŁ (a, b) = max(0, a + b − 1).
2.2 Fuzzy Sets Let L be a residuated lattice, and let X be a non-empty universe of discourse. A function A : X → L is called a fuzzy set (L-fuzzy set) on X . A value A(x) is called a membership degree of x in the fuzzy set A. The set of all fuzzy sets on X is denoted by L X . Obviously, fuzzy sets are nothing else than lattice-valued functions. A fuzzy set
On an Application of Lattice-Valued Integral Transform to MCDM
141
A on X is called crisp if A(x) ∈ {0, 1} for any x ∈ X . Obviously, a crisp fuzzy set can be uniquely identified with a subset of X . The symbol ∅ denotes the empty fuzzy set on X , i.e., ∅(x) = 0 for any x ∈ X . The set of all crisp fuzzy sets on X (i.e., the power set of X ) is denoted by 2 X . A constant fuzzy set A on X (denoted as a X ) satisfies A(x) = a for any x ∈ X , where a ∈ L. The sets Supp(A) = {x | x ∈ X & A(x) > 0} and Core(A) = {x | x ∈ X & A(x) = 1} are called the support and the core of a fuzzy set A, respectively. A fuzzy set A is called normal if Core(A) = ∅. Let A, B be fuzzy sets on X . The extension of the operations ∧, ∨, ⊗ and → on L to the operations on L X is given by (A B)(x) = A(x) B(x)
(4)
for any x ∈ X , where ∈ {∧, ∨, ⊗, →}. Let X, Y be non-empty universes. A fuzzy set w ∈ L X ×Y is called a (binary) fuzzy relation. We say that a fuzzy relation w is normal in the second coordinate (and similarly in the first coordinate), whenever Core(w y ) = ∅ for any y ∈ Y , where w y : X → L is defined as w y (x) = w(x, y) for any y ∈ Y . A relaxation of the normality in the second coordinate of fuzzy relation is a semi-normal in the second coordinate fuzzy relation, which is defined as Supp(w y ) = ∅ for any y ∈ Y .
2.3 Fuzzy Measure Spaces In this paper, we consider the algebras of sets defined as follows. A generalization can be found in [7]. Definition 1 Let X be a non-empty set. A subset F of 2 X is an algebra of sets on X provided that (A1) ∅, X ∈ F , (A2) if A ∈ F , then X \ A ∈ F , (A3) if A, B ∈ F , then A ∪ B ∈ F . A pair (X, F ) is called a measurable space (on X ) if F is an algebra of sets on X . Let (X, F ) be a measurable space and A ∈ F (X ). We say that A is F -measurable if A ∈ F . We now present some examples of algebras of sets on a non-empty fuzzy set on X . The sets {∅, X } and 2 X are trivial algebras of sets on X . Since the intersection of algebras is again an algebra, the smallest algebra on X containing a subset G of 2 X always exists and its unique. In this way, one can introduce various algebras of sets, such as the algebra (σ -algebra) of Borel sets [15]. Definition 2 Let (X, F ) be a measurable space. A function μ : F → L is called a fuzzy measure on (X, F ) if (i) μ(∅) = 0 and μ(X ) = 1, (ii) if A, B ∈ F such that A ⊆ B, then μ(A) ≤ μ(B).
142
M. Holˇcapek and V. B. Quoc
A triplet (X, F , μ) is called a fuzzy measure space if (X, F ) is a measurable space and μ is a fuzzy measure on (X, F ). Let us consider here only one simple example of fuzzy measure space, further examples can be found in [7]. Example 2 Let L be a residuated lattice on [0, 1] from Example 1. Let X be a finite non-empty set, and let F be an algebra of sets on X . Then the triplet (X, F , μr ), where μr (A) =
|A| , |X |
A ∈ F,
where |A| and |X | denote the cardinality of A and X , respectively. The fuzzy measure μr is called the relative cardinality. Let λ : L → L be a monotonically nondecreasing function such that λ(0) = 0 and λ(1) = 1. Then, a function μr,λ : F → L given by μr,λ (A) = λ(μr (A)) for any A ∈ F is again a fuzzy measure on (X, F ).
2.4 Multiplication-Based Fuzzy Integral Sugeno (fuzzy) integral (see, [16]) belongs among the most important qualitative aggregation functions. For functions whose values belong to a residuated lattice the following fuzzy integral, which is based on the multiplication ⊗, was introduced in [5–7]. Definition 3 Let (X, F , μ) be a fuzzy measure space, and let f : X → L. The ⊗-fuzzy integral of f on X is given by
⊗
f dμ =
A∈F
μ(A) ⊗
f (x) .
(5)
x∈A
If (X, 2 X , μ) is a fuzzy measure space with a finite set X and the fuzzy measure is symmetric, i.e., μ(A) = μ(B) if and only if |A| = |B| for any A, B ∈ 2 X , then one can use the following simple formula for the computation of ⊗-fuzzy integral. Theorem 1 Let (X, 2 X , μ) be a fuzzy measure space such that X = {x1 , . . . , xn } and μ is symmetric. Then
⊗
f dμ =
n
f (xπ(i) ) ⊗ μ({x1 , . . . , xi }),
i=1
where π is a permutation on X such that f (xπ(1) ) ≥ f (xπ(2) ) ≥ · · · ≥ f (xπ(n) ).
(6)
On an Application of Lattice-Valued Integral Transform to MCDM
143
2.5 Integral Transform We consider the multiplication-based integral transform introduced in [11], which can be seen as a generalization of the upper fuzzy transform introduced in [14]. Definition 4 Let (X, F , μ) be a fuzzy measure space, and let w : X × Y → L be a ⊗ : L X → LY semi-normal in the second component fuzzy relation. A function F(w,μ) defined by ⊗ ( f )(y) = F(w,μ)
⊗
w(x, y) ⊗ f (x) dμ.
(7)
is called a (w, μ, ⊗)-integral transform. The fuzzy relation w is called the integral kernel. The following theorem presents some basic properties of integral transform (see, [11]). Theorem 2 For any f, g ∈ 2 X and a ∈ L, we have (i) (ii) (iii) (iv) (v)
⊗ ⊗ F(w,μ) ( f ) ≤ F(w,μ) (g) if f ≤ g; ⊗ ⊗ ⊗ ( f ) ∧ F(w,μ) (g); F(w,μ) ( f ∧ g) ≤ F(w,μ) ⊗ ⊗ ⊗ F(w,μ) ( f ) ∨ F(w,μ) (g) ≤ F(w,μ) ( f ∨ g); ⊗ ⊗ ( f ) ≤ F(w,μ) (a ⊗ f ); a ⊗ F(w,μ) ⊗ ⊗ ( f ). F(w,μ) (a → f ) ≤ a → F(w,μ)
Note that if μ(Core(w y )) = 1 for any y ∈ Y , where w y (x) = w(x, y) for x ∈ X and y ∈ Y , then the (w, μ, ⊗)-integral transform preserves constant functions.
3 MCDM Based on the Integral Transform Let L be a complete linearly ordered residuated lattice as the scale for the evaluation of alternatives. Let A = {a1 , . . . , an } denote a set of alternatives, let C = {c1 , . . . , cm } denote a set of criteria characterizing a decision situation, and let G = {g1 , . . . , g } denote a set of global criteria. Finally, let (C, F , μ) be a fuzzy measure space over the set of criteria C. According to (2), the evaluation of alternatives u : A → L G is determined by a (w, μ, ⊗)-integral transform as follows ⊗ u(ai )(gk ) := F(w,μ) (r (ai ))(gk ) =
⊗
w(c j , gk ) ⊗ r (ai )(c j ) dμ, gk ∈ G,
(8)
where the kernel function w : C × G → L determines the importance of the criteria from C in the evaluation of alternatives with respect to the global criteria from G assuming that w is semi-normal in the second component, i.e., for any gk ∈ G, there exists at least one c j ∈ C such that w(c j , gk ) > 0. We should note that the setting of kernel function w is hard work for an expert with experience because its values significantly influence the decision. Following
144
M. Holˇcapek and V. B. Quoc
the assumptions on the weighted maximum proposed by Dubois and Prade in [4], one could even assume that w is normal in the second component, which means that each function wgk is a possibility distribution (i.e., maxc j ∈C wgk (c j ) = 1), and in the case of lower fuzzy transform proposed by Perfilieva in [14], one could be even stronger and assume that the sets Core(wg1 ), . . . , Core(wg ) form a partition of C. But w does not provide the only parameter of our approach. Other parameters are the fuzzy measure space and the selection of residuated lattice, especially, the multiplication operation. For example, if the measurable space F = 2C and the fuzzy measure is defined as μ(X ) = 1 for any X ∈ 2C \ {∅}, and μ(∅) = 0, the evaluation of alternatives can be expressed as u WM ⊗ (ai )(gk ) :=
w(c j , gk ) ⊗ r (ai )(c j ), ai ∈ A, gk ∈ G,
(9)
c j ∈C
which can be seen as a ⊗-weighted maximum generalizing the weighted maximum with ⊗ = ∧, i.e., u WM ∧ . It is easy to see that for any fuzzy measure μ on a measurable space (C, F ), the evaluation of alternatives u given by (8) cannot be higher than the WM evaluation u WM ⊗ given by the ⊗-weighted maximum, i.e., u(ai )(gk ) ≤ u ⊗ (ai )(gk ) for any ai ∈ A and gk ∈ G.
4 Illustrative Example We consider a car selection problem, that is, we would like to buy a new car from some famous car brands. The MCDM problem is to select an appropriate car from the following four alternatives: Toyota Wigo, Hyundai Grand i10, Honda City and Nissan Terra, i.e., we consider A = {Wigo, Grand i10, City, Terra} as the set of alternatives. Our global criteria for the evaluation of alternatives that form the set G are looks, safety and performance. To determine the evaluation of alternatives with respect to the global criteria we consider ten criteria, namely, Price, Logo, Year of Manufacture, Top Speed, Fuel consumption, Style, Insurance quote, Boot space, Warranty and Equipment, which form the set C. For a comparison of the evaluation of alternatives based on the integral transform with other evaluations based on quantitative and qualitative aggregations, we consider the residuated lattices L defined by the left-continuous t-norms on [0, 1] (see, Example 1). The satisfaction of criteria by alternatives (i.e., r (ai , c j )) is displayed in Table 1. One can see that, for example, r (Wigo, Price) = 0.2 < 0.5 = r (Grand i10, Price), which correspondes to the higher price of Toyota Wigo than the price of Hyundai Grand i10, and a lower price naturally increases the satisfaction of the criterion Price. For the purpose of the evaluation of alternatives, we consider the integral kernel w : C × G → [0, 1] whose values are displayed in Table 2. It can be seen that the set Supp(wLooks ) consists of seven criteria from C, namely, Price, Logo, Year, Style,
On an Application of Lattice-Valued Integral Transform to MCDM
145
Table 1 Satisfactions of the criteria by the alternatives Criteria
Cars Wigo
Grand i10
City
Terra
Price
0.2
0.5
0.7
0.4
Logo
0.9
0.7
0.6
0.8
Year
0.6
0.2
0.8
0.4
Top.sp/mph
0.6
0.8
0.2
0.4
Fuel.co/mpg
0.9
0.5
0.4
0.7
Style
0.7
0.9
0.6
0.8
Insurance.qu
0.1
0.7
0.6
0.9
Boot.sp/litres
0.2
0.5
0.7
0.3
Warranty
0.3
0.5
0.2
0.8
Equipment
0.8
0.7
0.9
0.6
Table 2 Integral kernel determining the importance of criteria for the evaluation of alternatives with respect to global criteria Criteria
Global criteria Looks
Safety
Performance
Price
1
0
0
Logo
0.2
0.3
0.5
Year
1
0
0
Top.sp/mph
0
0
1
Fuel.co/mpg
0
0
1
Style
0.6
0.4
0
Insurance.qu
0.5
0.5
0
Boot.sp/litres
0
0
1
Warranty
0.6
0
0.4
Equipment
0.7
1
0
Insurance quote, Warranty and Equipment, that are important in a certain non-zero degree for the evaluation of alternatives with respect to the global criterion (a car feature) Looks. Similarly the sets Supp(wSafety ) and Supp(wPerformance ) consist of four and five criteria from C, respectively. One could see that the functions wgk , gk ∈ G, have the non-empty cores, hence, these functions are possibility distributions (cf. [4]), which seems to be a reasonable requirement reflecting the fact that there is at least one fully important criterion for each global criterion. To ensure that the evaluation of alternatives is “fair” and respects only the important criteria, we define a fuzzy measure μ on (C, 2C ) as follows μ(X ) =
1, |X | , 4
|X | ≥ 4, otherwise,
146
M. Holˇcapek and V. B. Quoc
for any X ∈ 2C . The “fair” evaluation is reflected in the fact that the fuzzy measure μ is symmetric.5 Moreover, we set μ(X ) = 1 for |X | ≥ 4, which is motivated by the numbers of criteria in the support of functions wgk , gk ∈ G. More specifically, we use the minimum number 4 to allow the maximum evaluation of all alternatives equal to 1, ideally when the satisfaction of criteria is equal to 1 for all alternatives and the non-zero degrees of importance in Table 2 would be changed to 1. Our setting of fuzzy measure does not influence the evaluation of alternatives in the abovementioned ideal case, although we consider arbitrary left-continuous t-norm as the multiplication on the residuated lattice, which seems to be a reasonable requirement. A stronger requirement defined analogously could be introduced using the number of elements in the cores of functions wgk , gk ∈ G, which will guarantee the preservation of constant satisfactions of criteria by alternatives.6 In Table 3, we present the evaluations of alternatives u T with respect to global criteria for the fundamental continuous t-norms, namely, the minimum, product and Łukasiewicz t-norms. To compare the proposed approach based on the integral transform with some, say, representatives of the standard (quantitative and qualitative) approaches, we show in the same table the evaluation of alternatives using the weighted average given as 10 u
WA
(ai )(gk ) :=
w(c j , gk ) · r (ai )(c j ) , ai ∈ A, gk ∈ G, 10 j=1 w(c j , gk )
j=1
(10)
representing the quantitative approach, and the weighted maximum u WM ∧ representing the qualitative approach, although, it can be obtained as a special case of the integral transform. To select the best car we aggregate the values of vectors evaluating alternatives with respect to global criteria in Table 3 to one value using the weighted average with respect to the weights: w(Looks) = 0.35, w(Safety) = 0.4 and w(Performance) = 0.25, with a total sum equal to 1, expressing their importance for our selection of the best car. The results are displayed in Table 4. To compare the resulting evaluations of cars we determine the orders of cars that correspond to the orders of their evaluations presented in Table 3, e.g., we get Honda City, Nissan Terra, Hyundai Grand i10, Toyota Wigo for the evaluation u TM , where Honda City has the highest evaluation 0.57, while Toyota Wigo the lowest evaluation 0.495. Surprisingly, there are no two evaluations resulting in the same order of cars, but three of all evaluations indicate Toyota Wigo as the car with the worst evaluation. Clearly the candidates for the best car are Honda City and Nissan Terra with the two highest evaluations. It is probably impossible to say, what evaluation of alternatives is right or even the best in this illustrative example, since each uses a different type of aggregation, but summing
5A
symmetric fuzzy measure is introduce in Subsect. 2.4 on page 6. the comment on the integral transform in the end of Subsect. 2.5 on page 7.
6 See
On an Application of Lattice-Valued Integral Transform to MCDM
147
Table 3 Evaluation of alternatives with respect to global criteria determined by the integral transforms, the weighted average and the weighted maximum Global criteria Cars Wigo Grand i10 City Terra u TM
u TP
u TŁ
u WA
u WM ∧
Looks Safety Performance Looks Safety Performance Looks Safety Performance Looks Safety Performance Looks Safety Performance
0.6 0.4 0.5 0.315 0.2025 0.3375 0.2 0.05 0.2 0.476 0.636 0.582 0.7 0.8 0.9
0.5 0.5 0.5 0.3675 0.2625 0.375 0.2 0 0.25 0.547 0.736 0.602 0.7 0.7 0.8
0.7 0.5 0.5 0.4725 0.225 0.225 0.35 0.15 0.1 0.658 0.736 0.43 0.8 0.9 0.7
0.6 0.5 0.5 0.42 0.24 0.32 0.4 0.1 0.3 0.606 0.731 0.543 0.6 0.6 0.7
Table 4 Aggregation of various evaluations of alternatives to order the alternatives Evaluations Cars Wigo Grand i10 City Terra u TM u TP u TŁ u WA u WM ∧
0.495 0.275 0.14 0.566 0.79
0.5 0.327 0.132 0.636 0.725
0.57 0.311 0.207 0.632 0.815
0.535 0.323 0.255 0.64 0.625
the ranking numbers of cars7 over all evaluations to get an overall ranking number we can conclude that Honda City and Nissan Terra occupy the first and second place with the overall ranking number 10 (e.g., 10 = 1 + 3 + 2 + 3 + 1 for Honda City). The third place gets Hyundai Grand i10 with the overall ranking number 13, and Toyota Wigo gets the last place with the overall ranking number 17. If we remove the weighted maximum evaluation in the summation of ranking numbers, which depends on only one maximum value, the best car is Nissan Terra with the overall ranking number 6 = 2 + 2 + 1 + 1.
7A
car gets the ranking number n if it stands on the n-th position in an order of cars. The ranking number 1 (4) indicates the best (worst) car with respect to considered evaluation of cars.
148
M. Holˇcapek and V. B. Quoc
5 Conclusion In this paper, we presented the multiplication-based integral transform of residuated lattice-valued functions and showed a way how this type of integral transform may be used for an evaluation of alternatives in a multicriteria decision making when the considered scale has only a lattice structure enriched with further operations. We demonstrated our approach to the decision making problem of car selection, where we compared the results obtained by the integral transformation with the results of the evaluation of alternatives based on the weighted average and the weighted maximum. Since the integral transform can be seen as an “extended” qualitative aggregation function, our approach provide a tool for qualitative evaluations as opposed to quantitative evaluations based on the weighted average or more general OWA operators. We believe that our approach based on integral transform offering parameters as the integral kernel, fuzzy measure space, or multiplication operation on a residuated lattice the mixture of which leads to many types of qualitative evaluations could attract decision-makers as an effective alternative to quantitative approaches.
References 1. Arfi, B.: Fuzzy decision making in polities: a linguistic fuzzy set approach (LFSA). Polit. Anal. 3, 23–56 (2005) 2. Bellman, E., Zadeh, A.: Decision-making in a fuzzy environment. Manage. Sci. 17, 141–164 (1970) 3. Belton, V., Steward, J.: Multiple Criteria Decision Analysis-An Integrated Approach. Kluwer Academic Publishers, Boston/Dordrecht/London (2002) 4. Dubois, D., Prade, H.: Weighted minimum and maximum operations in fuzzy set theory. Inf. Sci. 39(2), 205–210 (1986) 5. Dubois, D., Prade, H., Rico, A.: Residuated variants of Sugeno integrals: towards new weighting schemes for qualitative aggregation methods. Inf. Sci. 329, 765–781 (2016) 6. Dvoˇrák, A., Holˇcapek, M.: L-fuzzy quantifiers of type 1 determined by fuzzy measures. Fuzzy Sets Syst. 160(23), 3425–3452 (2009) 7. Dvoˇrák, A., Holˇcapek, M.: Fuzzy measures and integrals defined on algebras of fuzzy subsets over complete residuated lattices. Inf. Sci. 185(1), 205–229 (2012) 8. Fenton, N., Wang, W.: Risk and confidence analysis for fuzzy multicriteria decision making. Knowl.-Based Syst. 19, 430–437 (2006) 9. Gagolewski, M.: Data fusion theory, methods, and applications. Institute of Computer Science, Polish Academy of Sciences, Warsaw (2015) 10. Herrera, F., Herrera-Viedma, E., Verdegay, J.: Direct approach processes in group decision making using linguistic OWA operators. Fuzzy Sets Syst. 79, 175–190 (1996) 11. Holˇcapek, M., Bui, V.: Integral transforms on spaces of complete residuated lattice valued functions. In: Proceeding of IEEE World Congress on Computational Intelligence (WCCI) 2020, pp. 1–8. IEEE (2020) 12. Holˇcapek, M., Bui, V.: On integral transforms for residuated lattice-valued functions. In: Proceeding of Information Processing Management of Uncertainty (IPMU) 2020, pp. 1–14. IPMU (2020) 13. Klement, E., Mesiar, R., Pap, E.: Triangular Norms, Trends in Logic, vol. 8. Kluwer Academic Publishers, Dordrecht (2000)
On an Application of Lattice-Valued Integral Transform to MCDM
149
14. Perfilieva, I.: Fuzzy transforms: theory and applications. Fuzzy Sets Syst. 157(8), 993–1023 (2006) 15. Srivastava, S.: A Course on Borel Sets. Springer (1998) 16. Sugeno, M.: Theory of Fuzzy Integrals and its Applications. Ph.D. thesis, Tokyo Institute of Technology (1974) 17. Yager, R.: On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Trans. Syst. Man Cybern. 18(1), 183–190 (1988) 18. Yager, R.: Fusion of ordinal information using weighted median aggregation. Int. J. Approximate Reasoning 18, 35–52 (1998)
Fine-Grained Network Traffic Classification Using Machine Learning: Evaluation and Comparison Tuan Linh Dang and Van Chuong Do
Abstract The network traffic classification problem is classified into coarse-grained traffic classification and fine-grained traffic classification. Previous researchers have successfully applied machine learning techniques to solve the coarse-grained traffic classification problem with very high accuracy. However, there are few studies associated with the fine-grained traffic classification problem because of an appropriate lack of labeled data of the application flows. This paper proposes a data collection method and investigates various unsupervised and supervised learning techniques in our collected data to solve the fine-grained traffic classification problem. Experimental results showed that the decision tree and random forest got the highest accuracy at 96%. The decision tree also had the lowest prediction time, which is well-suited to be implemented in real-time fine-grained traffic classification applications.
1 Introduction Traffic classification has become a crucial problem in computer networks. Real-time traffic classification may help network service providers to overcome severe network issues. Traffic classification is a core component of the network administration domain, intrusion detection systems, traffic scheduling, quality of service, and lawful interception. Traditionally, the traffic classification problem is divided into two issues that correspond to two traffic classification levels: • Coarse-grained traffic classification constructing a classifier capable of classifying application-layer protocols or a classifier capable of identifying traffic by a high level of generality by a feature such as an email, game, chat, or web [1]. T. L. Dang (B) School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam e-mail: [email protected] V. C. Do OCS Research Center, Viettel High Technology, Viettel Group, Hanoi 100000, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_13
151
152
T. L. Dang and V. C. Do
• Fine-grained traffic classification: constructing a classifier capable of identifying which network traffic belongs to specific applications such as Facebook, YouTube, Dropbox [1]. For many years, researchers have focused on solving the coarse-grained traffic classification problem [2–8]. Machine learning techniques are the most suitable approaches to solving coarse-grained traffic classification problems. However, there are few studies associated with the fine-grained traffic classification problem because of an appropriate lack of labeled data of the application flows [9]. The significant contribution of this paper is to propose an approach to build a dataset for fine-grained traffic classification problem. Besides, this dataset is further investigated by different machine learning algorithms to find the most suitable algorithm for solving fine-grained traffic classification problems. The investigated algorithms are autoclass, decision tree, random forest, fully connected neural network, and 1D convolutional neural network algorithm. This paper is structured as follows. Section 2 describes related work. Section 3 presents methods for data collecting and solving the fine-grained traffic classification issue. Section 4 presents the results. Section 5 concludes our paper.
2 Related Work 2.1 Fine-Grained Network Traffic Classification There are different approaches to solve the fine-grained traffic classification problem, such as based on statistics or behavioral analysis [10]. Previous authors also proposed a behavioral classification engine to deal with the fine-grained traffic classification problem. The data used are Netflow records, for example, counts of received packets and bytes. They used SVM techniques to develop the classifier [11]. A previous study proposed a framework called Atlas used in Software-Defined Networking (SDN). Atlas is used to identify mobile applications. A prototype Atlas using decision tree was tested capable of classifying the top 40 applications on google play. In this article, the authors also argue that the lack of labeled data hinders solving the fine-grained traffic classification problem [9]. Another study deployed an OpenFlow-based SDN in the enterprise network to collect data. This data is used to train machine learning models: Random Forests, Stochastic Gradient Boosting, Extreme Gradient Boosting. The resulting models can classify seven applications (YouTube, LinkedIn, Skype, Vimeo, BitTorrent, Dropbox, Facebook), and the HTTP protocol [12]. Previous studies have mainly worked on Netflow data or SDN data. Those approaches are relatively complex and depend on the network monitoring tool.
Fine-Grained Network Traffic Classification Using Machine Learning ...
153
2.2 Machine Learning Algorithms The machine learning techniques considered to solve the problem of fine-grained traffic classification are as follows. • Unsupervised learning (clustering): Autoclass [13]. – The Autoclass may solve the coarse-grained traffic classification problem with high accuracy [14]. In comparing three machine learning techniques (Autoclass, K-mean, DBSCAN), autoclass obtained the best classification results [15]. Thus, autoclass is our first investigated algorithm to solve the fine-grained traffic classification problem. – Autoclass is an unsupervised machine learning technique, which is also a soft clustering and probability-based learning technique [13]. Autoclass uses the EM algorithm to find the best clusters in the training data set [16]. Autoclass repeats the EM algorithm with random points in the parameter space to find the best model. If the number of clusters in the data set is determined, the pre-configure of the autoclass can be conducted. If the number of clusters in the data set is unknown, autoclass can self-estimate the number of clusters in the data set. • Supervised learning: decision tree, random forest, fully connected neural net-work, 1D convolutional neural network [17–19].
2.3 Evaluation Metrics 2.3.1
Evaluation Metrics for the Clustering Algorithm
Equation (1) shows the homogeneity H measure to investigate the performance of the autoclass [14]. H (c) =
max(count (a, c)|a ∈ A)) a count (a, c)
(1)
where A and CU are the application sets and clusters found during the learning, a is an application of application set A, c is a cluster of cluster set CU . Mathematically, it is possible to write a ∈ A and c ∈ CU . Function count (a, c) counts the number of flows that application a belongs to cluster c [14]. The homogeneity H (c) of cluster c is defined as the most significant fraction of flows of one application in the cluster. The overall homogeneity H is the mean of the cluster homogeneities as shown in Eq. (2). H (c) H= c 0≤H ≤1 (2) |CU | where |CU | is the total number of clusters.
154
T. L. Dang and V. C. Do
Fig. 1 Fine-grained network traffic classification problem using machine learning techniques
2.3.2
Evaluation Metrics for the Supervised Learning Algorithm
From Eqs. (3) to (6) are balanced − accuracy, micr o − pr ecision, micr o − r ecall, micr o − F1 − scor e where T P is true positive, F N is false negative, F P is false positive, and T N is true negative. Balanced − accuracy avoids inflated performance estimates on imbalanced datasets. TP TN 1 + (3) Balanced − accuracy = 2 T P + FN T N + FP Overall precision, recall, F1-score is micro averaged metrics. |C A| Micr o − Pr ecision = |C A| i=1
i=1
T Pi
(T Pi + F Pi )
|C A| T Pi Micr o − Recall = |C A| i=1 i=1 (T Pi + F Ni )
Micr o − F1 − Scor e = 2 ×
Micr o − Pr ecision × Micr o − Recall Micr o − Pr ecision + Micr o − Recall
(4)
(5)
(6)
where C A is a set of classes. |C A| is the total number of classes.
3 Method The model of solving the fine-grained network traffic classification problem using machine learning techniques is shown in Fig. 1. There are four main blocks called “data collection”, “data preprocessing”, “training and evaluation”, and “classifier”, respectively.
Fine-Grained Network Traffic Classification Using Machine Learning ...
155
Data Collection: There are two ways to collect the data • Use publicly recognized dataset such as the Kaggle data set • Build a self-build dataset. Data collection will be divided into two key steps – Collection of Network Traffic: We use network capture tools (e.g., Wireshark, Tcpdump) to collect network traffic [20]. The outputs of this step are files that contain the traffic of the applications. – Flows Statistic and Flows Characteristics Statistics: From the collected files of network traffic, the CICFlowmeter tool is used to statistic the bidirectional flows and characteristics of the bi-directional flows [21]. CICFLowmeter can analyze over 80 bi-directional flow features, such as the number of flow packets, number of flow bytes, and packet length. The details of bi-direction flow can be seen in the previous study [2]. The data collection step in the “self-build dataset” approach can be described as follows. 1. Choose an application that needs to collect data. 2. Turn on only one program that we want to collect data. 3. Use Wireshark to capture application packets. Turn off promiscuous mode while capturing the packet. 4. Using CICFlowmeter to statistic bidirectional flows with the features of the bidirectional flow. The timeout of flows is 600 ms. 5. Assign a label to the resulting data set. 6. Go back to step 1 and work with another application. The method for the data collection and the obtained data set are our contributions. Data preprocessing: performs data cleaning, feature selection, data dimension reduction, and data normalization. Pandas, Numpy, and Scikit-learn tools were used to preprocess the data [22–24]. Details are as follows. • Eliminate records that are missing attribute values or records that contain nonnumeric or infinitely negative, or infinitely positive values. Discard records that have a transport layer protocol attribute other than TCP or UDP. • Eliminate redundant attributes, including flow ID, source IP address, destination IP address, source port, destination port, timestamp. • Transform the protocol field to One-Hot Encoding. • Transforms dataset such that the resulting distribution has a mean of 0 and a standard deviation of 1. Training and Evaluation: We train the machine learning models with preprocessed data to find the best suitable algorithm for the problem. We use Scikit-learn and autoclasswrapper libraries in Python language to conduct experiments [25]. During the machine learning model training, we tweak the parameters of the model to get the best results. The configuration of machine learning algorithms used in the experiments is as follows.
156
T. L. Dang and V. C. Do
• The decision tree algorithm used is CART with the Gini index [26]. • A random forest algorithm has 100 decision trees with the Gini index. • For the Kaggle data set: – A fully connected neural network has four layers: • The number of nodes in each layer is 79-40-20-9 • The first three-layer uses ReLu activation function • The output layer has a Softmax activation function • The batch-normalized layer is alternated between layers – 1D-CNN has six layers: • The first two layers are 1D convolutional layers. · ·
3 × 3 filter and stride = 1 Padding = “same”
• After two convolution layers, there are Maxpooling and Flatten layers. • The next fully connected layer has 32 neurons and uses the activation function ReLu. • The output fully connected layer has nine neurons and uses activation Softmax. • The batch-normalize layer is alternated between layers. • For the self-built dataset: The fully connected neural network and 1D convolutional neural network architecture are similar to the Kaggle dataset, but the only difference is that the number of input layer neurons is 78, and the output layer is 12. In this module, different machine learning algorithms will be studied so that the most suitable algorithm will be recommended for the fine-grained traffic classification problem. Classifier: The classifier obtained in the previous step is employed in administration or security software.
4 Experiments 4.1 Datasets 4.1.1
Kaggle Dataset
In the first experiment, we looked at the dataset “IP Network Traffic Flows, Labeled with 75 Apps” from Kaggle [27]. The data set, which contains 87 flow features,
Fine-Grained Network Traffic Classification Using Machine Learning ...
157
Table 1 Class distribution of applications in the Kaggle dataset Application Number of flows Amazon Dropbox Facebook Gmail Skype Twitter Window update Yahoo Youtube
86875 25102 29033 40260 30657 18259 34471 21268 170781
is labeled a dataset of 75 applications and application layer protocols. It contains 3577296 records and is stored in a CSV file. In this dataset, there are 456706 flows of 9 applications called Amazon, Facebook, Dropbox, Gmail, Skype, YouTube, Yahoo, Twitter, Windows Update, respectively.
4.1.2
Self-built Dataset
The “IP Network Traffic Flows, Labeled with 75 Apps” dataset may not be reliable since it is labeled with the ntopng tool [28]. Moreover, most of the labels in the Kaggle data set are application layer protocols. Hence, we need a data set that is appropriate for the fine-grained traffic classification problem. The diversifying of the applications in the self-built data set needs to be addressed, such as stream application, web application, and bulk data upload application. After two days, we obtained 943505 flows of 12 applications called Dropbox, Facebook, Outlook, Shopee, Skype, Spotify, Steams, Teams, Thunderbird, Gmail, YouTube, Zalo. Other researchers can download our self-built dataset, including pcap and CSV files, in [29].
4.2 Data Preprocessing Tables 1 and 2 show the class distribution of applications after the data preprocessing described in the data processing paragraph of Sect. 3 in this manuscript.
158
T. L. Dang and V. C. Do
Table 2 Class distribution of applications in the self-built dataset Application Number of flows Dropbox Facebook Gmail Outlook Shopee Skype Sportify Steam Teams Thunderbird Youtube
24907 159557 48759 27622 11362 21086 58587 336433 59720 55149 95420
4.3 Training and Evaluation of Machine Learning Models 4.3.1
Hardware Configurations
We used Google colab [30] to conduct related experiments equipped with: • GPU: 1xTesla K80, 2496 CUDA cores, 12 GB GDDR5 VRAM • CPU: 1 single-core hyper-threaded Xeon Processors @2.3 GHz • RAM: 13 GB
4.3.2
Clustering Solution
Because the unlabeled data is abundant, while the labeled data is limited, our first solution is to use autoclass, an unsupervised algorithm, to solve fine-grained traffic classification. Solve fine-grained traffic classification by autoclass consisting of two phases. Phase 1 is the running autoclass to divide all data into different clusters. On the other hand, the cluster is identified in phase 2. In this situation, each cluster will be mapped to the application that has the most flow in that cluster. Homogeneity H has a crucial role in this algorithm: the higher H , the better identifying of the applications. However, if more than one application contributes to a significant number of flows in one cluster, the efficiency of the identifying will be reduced [14]. Additionally, we can also see that any cluster cannot map an application and that multiple clusters can be mapped to an application. The results obtained when running autoclass with Kaggle dataset and self-built dataset are presented in Table 3. We can see that the autoclass inefficiently solves the problem of fine-grained traffic classification. The H values were very low for both datasets. So, mapping clusters to
Fine-Grained Network Traffic Classification Using Machine Learning ... Table 3 Results of running the autoclass algorithm Dataset Number of clusters Kaggle Self-built
105 105
159
H 0.543 0.645
Fig. 2 Results of running supervised learning algorithms with Kaggle dataset Table 4 Results of supervised learning algorithms with Kaggle Dataset Model
Balanced-accuracy
Micro-precision
Micro-recall
Micro-F1-score
Predict time (s)
Decision Tree
0.777
0.823
0.823
0.823
0.274
Random Forest
0.776
0.851
0.851
0.851
4.030
Fully Connected NN
0.636
0.755
0.755
0.755
2.490
1D-CNN
0.617
0.741
0.741
0.741
3.758
the applications is high ambiguous. A higher H value is required to unambiguously map the clusters to applications (the H value to be greater than 0.85 is expected). It is necessary to have another approach to overcome this problem.
4.3.3
Supervised Machine Learning Solution
5-fold cross-validation and stratified sampling were employed in our experiments to solve the issue of imbalanced data. Besides, we also used balanced− accuracy, micr o− pr ecision, micr o−r ecall, and micr o− F1− scor e. The results of the experiments are presented in Figs. 2 and 3, Tables 4 and 5. Experimental results demonstrated that random forest and decision tree are effective machine learning techniques for both data sets. Decision tree and random forest accuracy differed by a tiny, almost negligible amount. Compared with the random
160
T. L. Dang and V. C. Do
Fig. 3 Results of running supervised learning algorithms with self-built dataset Table 5 Results of supervised learning algorithms with self-built dataset Model
Balanced-accuracy
Micro-precision
Micro-recall
Micro-F1-score
Predict time (s)
Decision Tree
0.962
0.978
0.978
0.978
0.518
Random Forest
0.957
0.975
0.975
0.975
4.941
Fully Connected NN
0.815
0.893
0.893
0.893
4.942
1D-CNN
0.812
0.889
0.889
0.889
7.341
forest, the decision tree was significantly faster concerning the prediction time. From the results of the two data sets above, the decision tree could be a suitable algorithm for real-time fine-grained traffic classification problems.
5 Conclusions This paper presents a solution for the fine-grained traffic classification problem, including the method to collect the data and the study concerning the operation of machine learning algorithms for the collected data. We have investigated an unsupervised machine learning algorithm called autoclass, which effectively solved the coarse-grained traffic classification problem in previous studies for solving the fine-grained traffic classification problem. We also investigated the efficiency of supervised machine learning algorithms in solving the fine-grained traffic classification problem. The investigated supervised machine learning algorithms were decision tree, random forest, fully connected neural network, 1D convolutional neural network. Experimental results showed that the autoclass did not obtain high accuracy with the fine-grained traffic classification problem. In addition, the decision tree and ran-
Fine-Grained Network Traffic Classification Using Machine Learning ...
161
dom forest gave a balanced accuracy of approximately 96% in our self-built data set. The decision tree algorithm had the lowest prediction time, so that this algorithm is well-suited to be deployed in real-time fine-grained traffic classification applications. A possible avenue for future research is to collect new data and investigate the effectiveness of our trained models with this new data. We will also collect data for more applications to build a model capable of classifying many applications. Besides, we will deploy the training model in a real application. Acknowledgments This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.02-2019.314.
References 1. Alizadeh, H., Zúquete, A.: Traffic classification for managing applications’ networking profiles. Secur. Commun. Netw. 9(14), 2557–2575 (2016) 2. Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surv. Tutorials 10(4), 56–76 (2008) 3. He, Y., Li, W.: Image-based encrypted traffic classification with convolution neural networks. In: IEEE Fifth International Conference on Data Science in Cyberspace (DSC) 2020, pp. 271– 278 (2020) 4. Cherif, I.L., Kortebi, A.: On using eXtreme gradient boosting (XGBoost) machine learning algorithm for home network Traffic Classification. Wirel. Days 2019, 1–6 (2019) 5. Shafiq, M., Yu, X., Bashir, A.K., et al.: A machine learning approach for feature selection traffic classification using security analysis. J. Supercomput. 74, 4867–4892 (2018) 6. Sun, G., Liang, L., Chen, T., Xiao, F., Lang, F.: Network traffic classification based on transfer learning. Comput. Electr. Eng. 69, 920–927 (2018) 7. Dias, K.L., Pongelupe, M.A., Caminhas, W.M., de Errico, L.: An innovative approach for real-time network traffic classification. Comput. Netw. 158, 143–157 (2019) 8. Menuka, P.K.J., Kandaraj, P., Salima, H.: Network traffic classification using machine learning for software defined networks. In: IFIP International Conference on Machine Learning for Networking, pp. 28–39 (2020) 9. Qazi, Z.A., Lee, J., Jin, T., Bellala, G., Arndt, M., Noubir, G.: Application-awareness in SDN. In: Proceedings ACM SIGCOMM 2013, Hong Kong, China, pp. 487–488 (2013) 10. Valenti, S., Rossi, D., Dainotti, A., Pescapé, A., Finamore, A., Mellia, M.: Reviewing traffic classification. Lecture Notes in Computer Science, pp. 123–147 (2013) 11. Rossi, D., Valenti, S.: Fine-grained traffic classification with Netflow data. In: Proceedings ACM IWCMC 2010, Caen, France, pp. 479–483 (2010) 12. Amaral, P., Dinis, J., Pinto, P., Bernardo, L., Tavares, J., Mamede, H.S.: Machine learning in software defined networks: data collection and traffic classification. In: Proceedings IEEE ICNP 2016, Singapore, pp. 1–5 (2016) 13. Cheeseman, P.C., Stutz, J.C.: Bayesian classification (AutoClass): theory and results. In: Advances in Knowledge Discovery and Data Mining (1996) 14. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: The IEEE Conference on Local Computer Networks 30th Anniversary (2005) 15. Erman, J., Arlitt, M., Mahanti, A.: Traffic classification using clustering algorithms. In: Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data - MineNet (2006) 16. McLachlan, G.J., Krishnan, T., Ng, S.K.: The EM Algorithm. In: Humboldt-Universität zu Berlin. Center for Applied Statistics and Economics (CASE), Berlin (2004)
162
T. L. Dang and V. C. Do
17. Rokach, L., Maimon, O.: Decision trees. In: Data Mining and Knowledge Discovery Handbook (2005) 18. Breiman, L.: Machine learning 45(1), 5–32 (2001) 19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 20. https://www.wireshark.org/docs/wsug_html_chunked. Accessed 31 Jul 2020 21. https://www.unb.ca/cic/research/applications.html#CICFlowMeter. Accessed 31 Jul 2020 22. https://pandas.pydata.org/docs/user_guide/index.html#user-guide. Accessed 31 Jul 2020 23. https://numpy.org/doc/stable/user/index.html. Accessed 31 Jul 2020 24. https://scikit-learn.org/stable/user_guide.html. Accessed 31 Jul 2020 25. https://autoclasswrapper.readthedocs.io/en/latest/index.html. Accessed 31 Jul 2020 26. Loh, W.Y.: Classification and regression trees. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 1(1), 14–23 (2011) 27. https://www.kaggle.com/jsrojas/ip-network-traffic-flows-labeled-with-87-apps. Accessed 31 Jul 2020 28. https://www.ntop.org/products/traffic-analysis/ntop/. Accessed 31 Jul 2020 29. https://bitly.com.vn/hpD7z. Accessed 31 Jul 2020 30. https://colab.research.google.com/notebooks/intro.ipynb. Accessed 31 Jul 2020
Soil Moisture Monitoring System Based on LoRa Network to Support Agricultural Cultivation in Drought Season Tien Cao-Hoang, Kim Anh Su, Trong Tinh Pham Van, Viet Truyen Pham, Duy Can Nguyen, and Masaru Mizoguchi Abstract This paper presents an Internet of Thing system based on the LoRa wireless sensor network to monitor soil moisture to support agricultural cultivation in drought season. Our proposed system was developed using Ai Thinker Ra-02 LoRa module integrated with Arduino pro mini board, which is responsible for gathering soil moisture and transmitting data to a gateway for forwarding to a data server on the internet. So that, farmers and researchers can monitor the soil condition remotely. Two experiments were conducted, which are the sensor calibration test to calculate the volumetric water content from the sensor’s analog signal and the network coverage test to find the LoRa module capability. The system had been deployed to test in the real situation. It is expected to support the farmer to monitor the soil condition of plants due to the effects of drought and salinity phenomena.
1 Introduction In the Mekong delta, the negative effects of drought and salinization on agriculture are increasing over several years and will be more severe in the future. Many hectares of agriculture area had been lost because of drought and salinity. It leads to the challenges to the agricultural cultivation, especially the limited irrigation water supply and the water quality degradation. The temporary solution is to bring water from the upstream area to irrigate the plants but the transportation cost was high. Another solution is using the salt-water filtration system but the system’s cost was also high. The agricultural problem that needs to be solved in the drought season is providing enough water for irrigation.
T. Cao-Hoang (B) · K. A. Su · T. T. Pham Van · V. T. Pham · D. C. Nguyen College of Rural Development, Can Tho University, Can Tho City, Vietnam e-mail: [email protected] M. Mizoguchi Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_14
163
164
T. Cao-Hoang et al.
In order to achieve agricultural systems that are socially, environmentally, economically sustainable, the water resources and other agricultural inputs must be used efficiently [1]. Several Internet of Thing – IoT systems have been studied and applied in the field of agriculture to solve the agricultural problem. Most of the systems were based on wireless sensor network – WSN system that consists of multi nodes sending data to a coordinator. The node was responsible for measuring environmental and soil information. In [2], a wireless soil monitoring system based on Zigbee protocol and GPRS was presented to automatic control sprinkler. The overall system provided realtime monitoring and control to water content, temperature and pH of the soil with acceptable water oversupply. Mare Srbinovska et al. developed the low power radio frequency - RF wireless sensor network application for precision agriculture in the pepper greenhouse. The environmental parameters such as temperature, humidity and illumination were continually monitored and controlled in order to provide optimal crop conditions [3]. Achmad et al. proposed the system that monitors soil moisture, soil pH in the starfruit plantation using long-range - LoRa technology to transmit data from nodes to the master node with the max range at 700 m [4]. In [5] the authors present a low-consumption solution using IoT based on WSN and LoRaWAN technology. The system monitors the soil and weather condition by deploying the sensor nodes in a remote area to transmit data (with the range up to 600 m) to a gateway where data are then forward to the network for a farmer can monitor crop condition remotely. Basing on the related works, it can be seen that the WSN based on LoRa technology seems to be the best fit for agriculture in our situation because of long-range transmission and independent on the third-party network like GPRS, 4G infrastructure that needs to pay an extra service cost. Also, LoRa technology is currently one of the most promising specifications of Low-Power Wide Areas Network for batterypowered wireless nodes [6]. In this paper, we proposed the IoT system which is the soil moisture monitoring system based on LoRa network to support agricultural practice in drought season. The purpose of the system is to save the irrigation water supply and to help plants to overcome the water stress through the drought season. We proposed the system that can measure the rise and fall of the amount (or percentage) of water in the soil in order to make an irrigation decision. Soil is made up of water, air, minerals, organic matter. As a component, water makes up a percentage of the total. Irrigating crops based on the necessity of a plant and in the right moment could avoid wasting water. The rest of the paper comprises of 3 sections. The system architecture will be present in Sect. 2. Section 3 represents the coverage test and the EC-5 soil sensor test. Finally, the discussion and conclusion are shown in Sect. 4.
Soil Moisture Monitoring System Based on LoRa Network ...
165
Fig. 1 Proposed soil moisture monitoring system architecture
2 Proposed System Architecture The architecture of the proposed system is described in Fig. 1. The system is based on the star-topology LoRa network, which consists of sensor nodes, gateway and enduser. The sensor node comprises a processor and LoRa transceiver module, which measures soil moisture information and transmits them wirelessly to the gateway via LoRa network. The soil data are then forwarded to the data server on the internet via WiFi, Ethernet where end-user can monitor the soil data remotely.
2.1 Sensor Node The sensor node is responsible for measuring and transmitting soil moisture data to the gateway. The Fig. 2 illustrates the sensor node structure which has 4 components: processor, wireless LoRa module, sensor and power bank. The processor controls the power supply to sensor (turn ON/OFF) via a switch that can be controlled using a microcontroller’s digital pin. It also reads the sensor data via analog pin and then transmits data to the gateway through LoRa wireless module. The node is power by 3xAA batteries which provide 4.5 V. The Fig. 3 shows the sensor node’s hardware which is put inside a box for deploying outside purpose.
166
T. Cao-Hoang et al.
Fig. 2 Sensor node structure
Fig. 3 Sensor node
2.1.1
Processor
The Arduino Pro Mini 3.3V version was used as the heart of the sensor node, which integrates an Atmel ATmega328p microcontroller. The sensor node’s firmware was developed using the Arduino IDE. The board is powered by 3xAA batteries. The sensor node software flowchart is illustrated in Fig. 5. Normally, the node is set in sleep mode in order to save power consumption. Every period of time or sampling time, it takes sensor measurement via an analog signal pinout and then transmits data
Soil Moisture Monitoring System Based on LoRa Network ... Table 1 Specifications of AI-Thinker RA-02 module
167
Characteristic
Description
Communication distance Sensitivity Programmable bit rates RSSI dynamic range Wireless frequency Working voltage Working temperature
Up to 15 km Down to −148 dBm Up to 300 kbps 127 dB 433 MHz 1.8–3.7 V −40–+80 ◦ C
to the gateway. After that, the node enters the sleep mode in order to lower its power consumption.
2.1.2
Wireless LoRa Module
In this system, we used the Ai-Thinker Ra-02 LoRa module which is a low-cost wireless transmission module based on SEMTECH’s SX1278 wireless transceiver. The module is relevant at 5 USD in the market. The SX1278 RF module is mainly used for long-range spread spectrum communication, and it can resist minimizing current consumption. The specification detail is shown in Table 1. The module operates on 433 MHz at 3.3 V and communicates to the Arduino via SPI protocol.
2.1.3
Sensor
In this research, the METER EC-5 soil sensor is used to measure soil moisture which is volumetric water content (VWC). Soil constituents in a known volume of soil. All of the components total 100%. Since volumetric water content equals the volume of water divided by the total soil volume. The sensor determines VWC by measuring the dielectric constant of the media using capacitance/frequency domain technology and delivers research-grade accuracy [7]. It can operate at 2.5–3.6 V excitation and its output is analog signal that can be read by the Arduino board via analog pin. The sensors produce an output voltage that depends on the dielectric constant of the medium surrounding the sensor and ranges between 10% and 50% of the excitation voltage. Because the sensor signal/output is a voltage value, it needs to be converted to VWC in order to know how much water currently stored in the soil. Usually, the signal is converted into VWC by using an additional device such as data logger, soil meter hand-held. Instead, we propose to use a simple algorithm which is explained in the Sect. 3. The method is to find the correlation between the value read by the sensor node and the one read by commercial data logger.
168
T. Cao-Hoang et al.
Fig. 4 Gateway hardware connection
The EC-5 sensor data was read by the sensor node. Firstly, the node powers the sensor with 3.3 V excitation, using a digital pin to turn ON 3.3 V power switch and then reads the sensor data in the corresponding analog pin. After taking the reading, the 3.3 V power switch was turned OFF by setting low the digital pin. The EC-5 signal is then calculated using 10 bit internal ADC (analog-digital-converter) ranging from 0–1023. A laboratory experiment has been conducted in order to calculate VWC from the raw sensor’s signal read by Arduino. The experiment detail is presented in Sect. 3.
2.2 Gateway The gateway is the center of the wireless system that collects the data sent from the sensor nodes and forwards data to the internet. In this research, we designed a gateway by using Arduino board integrated Ra-02 LoRa module, which is connected to Raspberry Pi via a serial port as shown in Fig. 4. The Arduino board operates as a LoRa receiver that collected data transmitted from sensor nodes. Each sensor node sends data including node ID, battery level and soil moisture value to the receiver via LoRa technology. Then the receiver transfers data to the Raspberry Pi via a serial port for the next process. The receiver software flowchart is shown in Fig. 6. After receiving data, the Raspberry Pi forwards the data to a server on the internet through MQTT protocol by using Node-RED program. The data were also inserted into a database which we established as the cloud-based MySQL database to store the soil data for further analysis. We also developed a web page to visualize soil data where the farmer can access the soil condition remotely. The website is developed basing on the PHP program integrated with MQTT via web socket protocol to show data immediately after receiving data from the gateway.
Soil Moisture Monitoring System Based on LoRa Network ...
169
Fig. 5 Sensor node software flowchart
3 Experiment Results In this section, we explain some experimental results that we conducted in this research including laboratory EC-5 soil sensor calibration test to calculate VWC of soil from the analog signal, network coverage test to find the LoRa module coverage capability and field test to check the system operation in the real situation.
3.1 EC-5 Soil Sensor Laboratory Calibration Experiment The goal of this experiment is to convert the output of the EC-5 sensor to VWC (m3 /m3 ). The method is that we measure the sensor signal by using both Arduino (Sensor Node) and METER EM50 data logger. EM50 is the commercial data logger that is capable of sampling and recording data from EC-5 soil sensor. It can connect to a computer using ECH2O software for setting and reading sensor’s data. To read the sensor’s data from the node, we connected the EC-5 sensor to an analog pin of the node and collected data using the Arduino IDE’s serial console. The calibration equation is derived by correlating the output of the sensor measured by Arduino (sensor node) against the readings measure with the EM50. The relationship between the readings, taken by the sensor node and the Meter Em50 data-logger in the soil samples with different water contents, has been shown in Fig. 7. The trend line has been built with R2 > 0.99. The following equation has been using to convert the Sensor Node’s ADC output to the Meter Em50’s output.
170
T. Cao-Hoang et al.
Fig. 6 Receiver software flowchart
Calibrated AdcOut = 4.6903 × AdcOut − 10.526
(1)
where AdcOut is the value taken by the Sensor Node. From CalibratedAdcOut we can calculate soil VWC (m3 /m3 ) basing on the soil type [7]. For example, the VWC (θ ) of mineral soil can be calculated using the following equation: θ = (8.5 × 10−4 )(Calibrated AdcOut) − 0.48
(2)
3.2 Network Coverage Test The goal of this test is to figure out the maximum distance which AI Thinker Ra-02 LoRa module can transmit data. The gateway has been placed in the fixed station and the node was moved in different distances from 10 to 500 m and then calculate the average RSSI (Receive Signal Strength Indication) and PDR (Packet Delivery Rate) for evaluation. The result of the test is shown in Table 2. At the range under 330 m, RSSI above −103 dBm and PDR at 100%. But the range above 400 m, there was no packet received. The network performance has been affected by obstacles such as trees, building.
Soil Moisture Monitoring System Based on LoRa Network ...
171
Fig. 7 Relationship between sensor node reading and EM50 reading Table 2 Network coverage test result Distance (m) RSSI (dBm) 10 30 50 80 100 200 260 330
−60.17 −63.9 −70.5 −74.7 −85.8 −94.2 −99.8 −103
PDR (%) 100 100 100 100 100 100 100 100
3.3 Field Test In this experiment, we deployed the sensor nodes on the experiment field and the gateway was placed in a campus building. The sampling time for reading sensor data was set to 10 min. As shown in the Fig. 8, we set up the sensor nodes on the experiment field. The EC-5 soil sensor was installed 15 cm under the soil surface.
172
T. Cao-Hoang et al.
Fig. 8 Installation of the sensor node and gateway
Users including researchers and farmers can access the data by surfing a specific website that we deployed on the internet (as shown in Fig. 9). So, the user can monitor the soil condition everywhere through the internet. The soil data which were converted into VMW can be visualized and exported as graphs and tables. Researchers can download raw data for further analysis. The latest data of soil and battery were also shown in the web page which could help the farmer to replace sensor node’s battery and take irrigation action if necessary in order to avoid the water stress on the plant.
Soil Moisture Monitoring System Based on LoRa Network ...
173
Fig. 9 Data visualization
4 Discussion and Conclusion In this study, a soil monitoring system based on the LoRa network was developed and tested to monitor soil moisture using the Meter Ec-5 sensor. The system is based on wireless star-topology which consists of a gateway and variety of sensor nodes, that have capable of transmitting and receiving data using LoRa technology. The gateway can connect to the internet via WiFi and can forward data to the data server on the internet where the user can easily access using a smart phone and computer. We conducted the experiments to figure out the LoRa network performance. The results show that the maximum transmitting distance of Ra-02 module is about 330 m with RSSI at −103 dBm. The methods of increasing the transmission range will be considered in our future work. A laboratory test had been conducted to find the equation for VWC calculation from the micro-controller ADC signal. The EC-5 soil sensor is power with 3.3 V excitation and the Arduino ADC is 10 bit. The system can measure the soil moisture in the whole day (every one hour for example). So that we can understand the soil condition more clearly. It can show how moisture changes. Irrigating crops based on the necessity of a plant and in the right moment could avoid wasting water. To assess the effectiveness of irrigation, we can place the soil moisture sensor probes in and below the root zone. If the moisture drops below the stress point in the period of time, the plant needs to be watered. The amount of sensor nodes placing on the field depends on the irrigation methods. There are much different irrigation methods among the farmers which are drip
174
T. Cao-Hoang et al.
irrigation, sprinkler irrigation and manually using water pipe. Drip irrigation is one of the most efficient irrigation methods because it allows directing water only where it’s needed (on the base of the plant, close to the roots). In this case, one sensor node could be enough. The sprinkler systems, which spread water on the whole field, even where it’s not necessary with consequent waste of the water resource. In the manually watering method, the farmer hands a water pipe to irrigate each plant in their crop. With these two methods, we need more sensor placing in the field in order to ensure that the plant receives enough water in the root zone. However, if the irrigation method is uneven, reading moisture sensor will not reliable. Moreover, if there are many soil types in different locations on the farm, we need to place more sensors since the soil moisture changes differently in each soil type. We understand that no system adapts perfectly to all situations because each requires special attention to identify the optimal irrigation system. We hope that our system can help the farmer to better understand their soil crop. The system has been tested in a real experiment situation. It showed that the wireless communication system is reliable and capable to monitor the changes in soil water content due to rain and dry soil. However, the sensor node with 10 bit ADC resolution could not be accurate enough to detect small changes in the soil water content. The EC-5 sensor is also expensive for farmers. Moreover, it is affected by saline water. We are looking at the solution to develop a cheaper soil sensor that can detect the timing of irrigation.
References 1. Payero, J.O., Nafchi, A.M., Davis, R., Khalilian, A.: An Arduino-based wireless sensor network for soil moisture monitoring using Decagon EC-5 sensors. Open J. Soil Sci. 7, 288–300 (2017) 2. Nagarajan, G., Minu, R.: Wireless soil monitoring sensor for sprinkler irrigation automation system. Wirel. Pers. Commun. 98, 1835–1851 (2018) 3. Srbinovska, M., Gavrovski, C., Dimcev, V., Krkoleva, A., Borozan, V.: Environmental parameters monitoring in precision agriculture using wireless sensor networks. J. Clean. Prod. 88, 297–307 (2015) 4. Rachmani, A.F., Zulkifli, F.Y.: Design of IoT monitoring system based on LoRa technology for starfruit plantation. In TENCON 2018-2018 IEEE Region 10 Conference, pp. 1241–1245. IEEE (2018) 5. Borrero, J.D., Zabalo, A.: An autonomous wireless device for real-time monitoring of water needs. Sensors 20, 2078 (2020) 6. Valente, A., Silva, S., Duarte, D., Cabral Pinto, F., Soares, S.: Low-cost LoRaWAN node for agro-intelligence IoT. Electronics 9, 987 (2020) 7. METER: Meter group (2020). http://publications.metergroup.com/Manuals/20431_EC-5_ Manual_Web.pdf. Accessed 15 July 2020
Optimization Under Fuzzy Constraints: Need to Go Beyond Bellman-Zadeh Approach and How It Is Related to Skewed Distributions Olga Kosheleva, Vladik Kreinovich, and Hoang Phuong Nguyen
Abstract In many practical situations, we need to optimize the objective function under fuzzy constraints. Formulas for such optimization are known since the 1970s paper by Richard Bellman and Lotfi Zadeh, but these formulas have a limitation: small changes in the corresponding degrees can lead to a drastic change in the resulting selection. In this paper, we propose a natural modification of this formula, a modification that no longer has this limitation. Interestingly, this formula turns out to be related to formulas to skewed (asymmetric) generalizations of the normal distribution.
1 Formulation of the Problem Need for Optimization Under Constraints. Whenever we have a choice, we want to select the alternative which is the best for us. The quality of each alternative a is usually described by a numerical value f (a). In these terms, we want to select the alternative aopt for which this numerical value is the largest possible: f (aopt ) = max f (a). a
(1)
Often, not all theoretically possible alternatives are actually possible, there are some constraints. For example, suppose we want to drive from point A to point B in the shortest possible time, so we plan the shortest path—but it may turn out that some of the roads are closed, e.g., due to an accident, or to extreme weather conditions, or O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] H. P. Nguyen Division Informatics, Math-Informatics Faculty, Thang Long University, Nghiem Xuan Yem Road, Hoang Mai District, Hanoi, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_15
175
176
O. Kosheleva et al.
to some public event. In such situations, we can only select an alternative that satisfies these constraints. Let us describe this situation in precise terms. Let A denote the set of all actually available alternatives—i.e., all alternatives that satisfy all the given constraints. In this case, instead of the original unconstrained optimization problem (1), we have a modified problem: we need to select an alternative aopt ∈ A for which the value of the objective function f (a) attains its largest possible value on the set A: f (aopt ) = max f (a). a∈A
(2)
Need for Optimization Under Fuzzy Constraints. The above formulation assumes that we know exactly which alternatives are possible and which are not, i.e., that the set A of possible alternatives is crisp. In practice, this knowledge may come in terms of words from natural language. For example, you may know that it is highly probable that a certain alternative a will be possible. A natural way to describe such knowledge in precise terms is to use fuzzy logic—technique specifically designed by Lotfi Zadeh to translate imprecise (“fuzzy”) knowledge from natural language to numbers; see, e.g., [3, 5, 9–11, 13]. In this technique, to each alternative a, we assign the degree μ(a) ∈ [0, 1] to which this alternative is possible: • degree μ(a) = 1 means that we are absolutely sure that this alternative is possible, • degree μ(a) = 0 means that we are absolutely sure that this alternative is not possible, • and degrees between 0 and 1 indicate that we have some—but not full—confidence that this alternative is possible. How can we optimize the objective function f (a) under such fuzzy constraints? Bellman-Zadeh Approach: A Brief Reminder. The most widely used approach to solving this problem was proposed in a joint paper [2] that Zadeh wrote in collaboration with Richard Bellman, one of the world’s leading authorities in optimization. Their main idea was to explicitly say that what we want is an alternative which is possible and optimal. We know the degree μ(a) to which each alternative is possible. To describe to the degree μopt (a) to which an alternative a is optimal, Bellman and Zadeh proposed to use the following formula: μopt =
f (a) − m , M −m
(3)
where m is the absolute minimum of the function f (a) and M is its absolute maximum. For example, we can define m and M by considering all alternatives for which there is at least some degree of possibility, i.e., for which μ(a) > 0:
Optimization Under Fuzzy Constraints ...
m = min
a:μ(a)>0
177
f (a), M = max f (a). a:μ(a)>0
(4)
Once we know the degrees μ(a) and μopt (a) that the alternative a is possible and that this alternative is optimal, to find the degree d(a) to which the alternative a is possible and optimal, we can use the usual idea of fuzzy logic—namely, apply an appropriate “and”-operation (t-norm) f & (a, b) to these degrees, resulting in d(a) = f & (μ(a), μopt (a))
(5)
In principle, we can use any “and”-operation: e.g., the operations min(a, b) and a · b proposed in the very first Zadeh’s paper on fuzzy logic, or any more complex operation. Once we have selected an “and”-operation and computed, for each alternative a, the degree d(a) to which a is desired, a natural idea is to select the alternative for which this degree is the largest possible: d(aopt ) = max d(a). a
(6)
Comment. In formulating the formula (6), we do not need to explicitly restrict ourselves to alternatives a for which μ(a) > 0: indeed, if μ(a) = 0, then, by the properties of an “and”-operation, we have d(a) equal to 0—i.e., to the smallest possible value. Limitations of the Bellman-Zadeh Approach. Degrees μ(a) describing the person’s degree characterize subjective feelings and are, thus, approximate; these values have some accuracy ε. This means that the same subjective feeling can be described by two different values μ and μ , as long as these values differ by no more than ε: |μ − μ | ≤ ε. In particular, the same small degree of possibility can be characterized by 0 and by a small positive number ε. It seems reasonable to expect that small—practically indistinguishable—changes in the value of the degrees would lead to small, practically indistinguishable, changes in the solution to the corresponding optimization problem. But, unfortunately, with the Zadeh-Bellman approach, this is not the case. To show this, let us consider a very simple example when: • each alternative is characterized by a single number, • the objective function is simply f (a) = a, • the membership function μ(a)—e.g., corresponding to “small positive”—is a triangular membership function μ(a) which is equal to 1 − a for a ∈ [0, 1] and to 0 for all other values a, and • the “and”-operation is f & (a, b) = a · b. In this case, the set {a : μ(u) > 0} is equal to [0, 1), so m = 0, M = 1, and
178
O. Kosheleva et al.
μopt (a) =
a−0 = a. 1−0
So, d(a) = f & (μ(a), μopt (a)) = (1 − a) · a = a − a 2 . Differentiating this expression with respect to a and equating derivative to 0, we conclude that the maximum of this function is attained when 1 − 2a = 0, i.e., for aopt = 0.5. On the other hand, if we replace 0 values of the degree μ(a) for a ∈ [−1, 0] with a small value μ(a) = ε > 0, then we get {a : μ(a) > 0} = [−1, 1), so m = −1, thus μopt (a) =
a+1 a − (−1) = . 1 − (−1) 2
For a ≤ 0, the product d(a) is increasing, so its maximum has to be attained for a ≥ 0. For values a ≥ 0, we have d(a) = f & (μ(a), μopt (a)) = (1 − a) ·
1 − a2 a+1 = . 2 2
This is a decreasing function, so its maximum is attained when aopt = 0. So, indeed, an arbitrarily small change in μ(a) can lead to a drastic change in the selected “optimal” alternative. What is Known About This Problem. What we showed is that a change in m can lead to a drastic change in the selected alternative. Interestingly, a change in M is not that critical: for the product “and”-operation f & (a, b) = a · b, we select an alternative that maximizes the expression d(a) = μ(a) ·
f (a) − m . M −m
If we multiply all the values of the maximized function by the same positive constant M − m, its maximum remains attained for the same value a. Thus, it is sufficient to find the alternative that maximized the product (M − m) · d(a) = μ(a) · ( f (a) − m). Good news is that this expression does not depend on M at all. It turns out (see, e.g., [7]) that f & (a, b) is the only “and”-operation for which there is no such dependence. Thus, in the following text, we will use this “and”-operation. On the other hand, in [7], it was also shown that no matter what “and”-operation we select, the result will always depend on m—and thus, will always have the same problem as we described above. Remaining Problem. So, to make sure that the selection does not change much if we make a small change to the membership function μ(a), we cannot just change the “and”-operation, we need to change the formulas (3) and (4).
Optimization Under Fuzzy Constraints ...
179
What We Do in This Paper. In this paper, we propose an alternative to the formulas (3) and (4), under which small changes in the degree μ(a) lead to small changes in the resulting selection.
2 Main Idea and the Resulting Definition Main Idea. We want to use the fact—mentioned several times by Zadeh himself—that the same uncertainty can be described both in terms of the probability density function ρ(x) and in terms of the membership function μ(x). In both cases, we start with the observed number of cases N (x) corresponding to different values x, but then the procedure differs: • to get a probability density function, we need to appropriately normalize the values N (x), i.e., take ρ(x) = c · N (x), where the constant c must be determined from the condition that the overall probability is 1: ρ(x) d x = 1; (7) • to get a membership function, we also need to appropriately normalize the values N (x), i.e., take μ(x) = c · N (x), where the constant c must be determined from the condition that the largest value of the membership function is 1: max μ(x) = 1. x
Because of this possibility, if we start with a membership function, we can normalize it into a probability density function ρ(x) = c · μ(x) by multiplying all the degrees μ(x) by an appropriate constant c. One can easily find this constant by substituting ρ(x) = c · μ(x) into the formula (7). As a result, we get ρ(x) =
μ(x) . μ(y) dy
How to Use This Idea: Analysis. Based on the known membership function μ(a), we can use the usual Zadeh extension principle (see, e.g., [3, 5, 9–11]) to find the membership function ν(x) corresponding to the value x = f (a): ν(x) =
sup μ(a).
a: f (a)=x
(8)
Based on this membership function, we can find the corresponding probability density function ρ(x) on the set of all the value of the objective function: ρ X (x) =
ν(x) . ν(y) dy
(9)
180
O. Kosheleva et al.
In these terms, a reasonable way to gauge how optimal is an alternative a with the value X = f (a) is by the probability F(X ) that a randomly selected value x will be smaller than or equal to X . If this probability is equal to 1, this means that almost all values f (a ) are smaller than or equal to f (a)—i.e., that we are practically certain that this alternative a is optimal. The smaller this probability, the less sure we are that this alternative is optimal. In probability and statistics, the probability F(X ) is known as the cumulative distribution function (see, e.g., [12]); it is determined by the formula X ρ X (x) d x. (10) F(X ) = −∞
Substituting the expression (9) into this formula, we can express F(X ) in terms of the membership function ν(x): X ν(x) d x F(X ) = −∞ . (11) ν(x) d x The probability ρ(a) that a is possible is also proportional to μ(a): ρ(a) = c · μ(a) for an appropriate coefficient c. The probability that an alternative a is possible and optimal can be estimated as the product ρ(a) · F( f (a)) of the corresponding probabilities. It is therefore reasonable to select an alternative for which this probability is the largest possible. Since c is a positive constant, maximizing the product ρ(a) · F( f (a)) = c · μ(a) · F( f (a)) is equivalent to maximizing a simpler expression μ(a) · F( f (a)). Thus, we arrive at the following idea. Resulting Idea. To select an alternative under fuzzy constraints, we suggest to find the alternative that maximizes the product μ(a) · F( f (a)), where the function F(X ) is determined by the formula X
ν(x) d x F(X ) = −∞ , ν(x) d x
(11)
and the corresponding function ν(x) is determined by the formula ν(x) =
sup μ(a).
a: f (a)=x
(8)
Discussion. One can see that if we make minor changes to the degrees μ(a), we will get only minor changes to the resulting selection. Simplest 1-D Case. In the 1-D case, when f (a) = a, we have ν(x) = μ(x) and thus, maximizing the product μ(a) · F( f (a))—or, equivalently, the product ρ(a) · F( f (a)) is equivalent to maximizing the product ρ(a) · F(a). Interestingly, the standard formula for the probability density function of the skewed generalization of normal distribution—skew-normal distribution—has
Optimization Under Fuzzy Constraints ...
181
exactly this form ρ(a) · F(a), where ρ(a) is the probability density function of the normal distribution and F(a) is the corresponding cumulative distribution function; see, e.g., [1, 8]. It is also worth mentioning that, vice versa, fuzzy ideas can be used to explain the formulas for the skew-normal distribution; see, e.g., [4]. Example. In the above example, F(X ) =
X
(1 − x) d x = X −
0
X2 , 2
a2 so we need to find the value aopt for which the product (1 − a) · a − attains the 2 largest possible value. Differentiating this expression with respect to a and equating the derivative to 0, we get a2 + (1 − a) · (1 − a) = 0, − a− 2 so −a + and
a2 + 1 − 2a + a 2 = 0, 2
3 2 · a − 3a + 1 = 0. 2
Thus, aopt =
3±
√
9−6 , 3
i.e., taking into account that a ≤ 1, we take aopt
√ √ 3 3− 3 =1− ≈ 0.42. = 3 3
One can see that for small ε > 0 we get very close values. Comment. The original Bellman-Zadeh formula can be described in the same way, but with the cumulative distribution function F(X ) corresponding to the uniform distribution on the interval [m, M]; see, e.g., [6]. From this viewpoint, our proposal can be viewed as a natural generalization of the original formula, a generalization that takes into account that not all the values from the interval [m, M] are equally possible. Acknowledgments This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
182
O. Kosheleva et al.
References 1. Azzalini, A., Capitanio, A.: The Skew-Normal and Related Families. Cambridge University Press, Cambridge (2013) 2. Bellman, R.E., Zadeh, L.A.: Decision making in a fuzzy environment. Manag. Sci. 17(4), B 141–B 164 (1970) 3. Belohlavek, R., Dauben, J.W., Klir, G.J.: Fuzzy Logic and Mathematics: A Historical Perspective. Oxford University Press, New York (2017) 4. Flores Muñiz, J.G., Kalashnikov, V.V., Kalashnykova, N., Kosheleva, O., Kreinovich, V.: Why skew normal: a simple pedagogical explanation. Int. J. Intell. Technol. Appl. Stat. 11(2), 113– 120 (2018) 5. Klir, G., Yuan, B.: Fuzzy Sets and Fuzzy Logic. Prentice Hall, Upper Saddle River (1995) 6. Kosheleva, O., Kreinovich, V.: Why Bellman-Zadeh approach to fuzzy optimization. Appl. Math. Sci. 12(11), 517–522 (2018) 7. Kreinovich, V., Kosheleva, O., Shahbazova, S.: Which t-norm is most appropriate for BellmanZadeh optimization. In: Shahbazova, S.N., Kacprzyk, J., Balas, V.E., Kreinovich, V. (eds.) Proceedings of the World Conference on Soft Computing, Baku, Azerbaijan, 29–31 May 2018 (2018) 8. Li, B., Shi, D., Wang, T.: Some applications of one-sided skew distributions. Int. J. Intell. Technol. Appl. Stat. 2(1), 13–27 (2009) 9. Mendel, J.M.: Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions. Springer, Cham (2017) 10. Nguyen, H.T., Walker, C.L., Walker, E.A.: A First Course in Fuzzy Logic. Chapman and Hall/CRC, Boca Raton (2019) 11. Novák, V., Perfilieva, I., Moˇckoˇr, J.: Mathematical Principles of Fuzzy Logic. Kluwer, Boston/Dordrecht (1999) 12. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman and Hall/CRC, Boca Raton (2011) 13. Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
Towards Parallel NSGA-II: An Island-Based Approach Using Fitness Redistribution Strategy Le Huy Hoang, Nguyen Viet Long, Nguyen Ngoc Thu Phuong, Ho Minh Hoang, and Quan Thanh Tho
Abstract Non-dominated sorting genetic algorithm II (NSGA-II) is introduced as a powerful variant of genetic algorithm because it alleviates computational complexity and removes sharing parameter in comparing to other multiobjective evolutionary algorithms (MOEAs). Master-slave, island model and diffusion model are three approaches to parallel MOEAs. However, in those approaches, to ensure that the crossover operator is performing efficiently across sub-populations on multiple threads remains a challenging issue. In this paper, we propose an approach based on island model with a new strategy that properly divides the population into islands, each of which runs in an individual thread, but still exchanges their chromosomes with good fitness to each other reasonably and effectively. We regard our strategy as fitness redistribution, which maximizes the chance of good fitness produced once paralleled. We show that the approach maintains optimized results, improves speed in comparing to the original NSGA-II and overcomes the disadvantages of previous island model-based algorithms.
1 Introduction Optimization is a well-known issue in computer science, which find their usage in many practical applications, such as scheduling or shortest path finding. In general, L. H. Hoang (B) · N. V. Long · N. N. T. Phuong · H. M. Hoang · Q. T. Tho Ho Chi Minh University of Technology, Ho Chi Minh City, Vietnam e-mail: [email protected] N. V. Long e-mail: [email protected] N. N. T. Phuong e-mail: [email protected] H. M. Hoang e-mail: [email protected] Q. T. Tho e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_16
183
184
L. H. Hoang et al.
this problem falls into N P-hard class which makes best solution a challenging problem to be addressed properly. Genetic algorithms (GAs) [1–3] have been emerging as powerful approaches for this problem. Basically, GA algorithm, inspired by Charles Darwin’s theory of natural evolution, can be referred to as search and optimization tools, which work differently compared to classical search and optimization methods. Because of their broad applicability, ease of use, and global perspective, GAs have been increasingly applied to various search and optimization problems in the recent past. Thereafter, GAs to handle constrained optimization problems are described. Because of their population approach, they have also been extended to solve other search and optimization problems efficiently, including multimodal [3], multiobjective [2–7] and scheduling problems [3], as well as fuzzy-GA [2, 3] and neuro-GA [2, 3] implementations. Once one needs to deal with multiobjective for an optimization problem, it is regarded as a multiobjective one. However, multiobjective evolutionary algorithms that use non-dominated sorting and sharing have been criticized for their high computational complexity, nonelitism approach and the need for specifying a sharing parameter. Notably, NSGA-II [8] is a well-known approach emerging among other multiobjective evolutionary algorithms due to its advantages. Firstly, NSGA-II reduces the computational complexity from O(M N 3 ) to O(M N 2 ) (where M is the number of objectives and N the population size). Secondly, NSGA-II is an elitist multiobjective object algorithm which enhances the performance of GA significantly. Lastly, NSGA-II is a parameter-less diversity-preservation mechanism. However, NSGA-II has its own pros and cons. NSGA-II implemented the crowding distance computation in ranking and evaluating non-dominated individuals. The crowding distance value of a solution provides an estimation of the density of solutions surrounding that solution or in short, which is the average distance of its two neighboring solutions. This approach can somewhat restrict convergence because when more than N members belong to the first non-dominated set in the combined parent-offspring population, some optimal solutions may give their places to other non-dominated yet non-optimal solutions. Besides, the main difference between single- and multi-objective evolutionary algorithms seems to be that in the multiobjective case, a set of solutions is sought rather than a single one. Therefore, it is natural to assign different parts of the population to different processors. However, the problem here is to ensure the optimal solution is still discovered among the divided population, where the chances of the crossover are now significantly less. Additionally, the running time of original NSGA-II is remarkably slow. Therefore, parallelizing this algorithm without significantly decreasing the accuracy compared with the original version is necessary. There are a number of prior papers that worked on parallelizing NSGA-II [10–13], and [1, 9]. However, several common drawbacks they shared are struggling on converging global optimal, high communication cost and especially not combining good individuals from separated sub-populations. They may turn down the chance of creating optimal offspring. In this paper, we propose a novel parallel version of NSGA-II. Still based on the island based approach, however we introduce a strategy, so-called fitness
Towards Parallel NSGA-II ... Fig. 1 Flowchart of genetic algorithm
185 INITIAL POPULATION SELECTION
no
MATING CROSSOVER MUTATION
TERMINATE
yes STOP
redistritbution to allow the “island” to exchange their good fitness reasonably. This strategy increases significantly the chance good individuals can mate with each other, hence strengthening the probability of generating optimal solution. In the following parts, we recall some preliminaries of NSGA-II in Sect. 2. In Sect. 3, we review some related work in paralleling GA. Section 4 give our underlying idea of the fitness redistribution strategy, followed by detailed discussion in Sect. 5. Next, we will show the results of the parallel version in solving classical and real life problems then comparing the results of NSGA-II in Sect. 6. Finally, in Sect. 7, we outline the conclusion of this paper.
2 Preliminaries 2.1 Genetics Algorithm (GA) GA is an adaptive method that can be used to find true or approximately solutions to optimization and search problems. Figure 1 describes how it works: Before GA runs, the problem will be represented as a set of parameters (called genes). Each solution is a combination of genes and often referred to as an individual or a chromosome. To determine how good each solution is, we devise a fitness function of this problem, which often returns a “fitness score” of each individual to compete with others.
186 Fig. 2 Example’s illustration
L. H. Hoang et al. f2 minimize
x2
x3 x1
f1 maximize
In this figure above, selection, crossover, and mutation are genetic operators applied by GA to generate the new offspring. The most basic forms of these operators are as follow: • Selection. Select pairs from the population as parents for mating, based on their fitness. Higher fitness value, higher chance to be selected. • Crossover. Each pair of parents exchange their genes to create a new offspring. • Mutation. Some genes of the new offspring will be mutated to maintain the diversity of the population. Generally, GA often terminates when either the maximum number of generations has been produced, or a satisfying fitness level has been reached for the population, or it has just run out of the large given time. Since the solutions (individuals) become better every generation, it is expected that they will converge to the optimal solution if the number of generations is large enough. But it not easy for us to apply GA in the multiobjective problems since the optimal solution of an objective may not be the optimal solution of the others. This issue is called objective conflict, which means we can not optimize all objectives simultaneously. In this case, there may not exist only one best solution respects to all objectives, but a set of solutions that dominate the rest when all objectives are considered. They are known as non-dominated solutions. Given two solutions x1 and x2 of a multiple objectives problem. If we consider all objectives one by one, solution x1 is no worse than x2 for all objectives, and x1 is strictly better than x2 in at least one objective, we can say that solution x1 dominate x2 or x2 is inferior to x1 . Assume an optimization problem P with two objective functions f 1 and f 2 . Our objectives are maximizing f 1 but minimizing f 2 . Figure 2 is the visualization of P into the 2D Cartesian coordinate system, with x1 , x2 , and x3 are three solutions of our problem. As we can see, solution x1 dominates solution x2 , and there is no domination between x1 and x3 . Non-dominated solutions are also referred to as Pareto solutions. A set of all Pareto solutions, perform a Pareto front, which represents the problem trade-offs: the score
Towards Parallel NSGA-II ... Fig. 3 Pareto front example
187 f2 minimize
f1 maximize
can not be improved on one objective without being worsened in another. Figure 3 is an example for a Pareto front: Many strategies have been introduced to deal with objective conflict. One of them is based on non-dominated sorting procedure, which is known as non-dominated sorting genetic algorithm, or NSGA.
2.2 Non-dominated Sorting Genetic Algorithm (NSGA) Non-dominated Sorting Genetic Algorithm (NSGA) [12] is a well known variant of GA to specifically handle the multiobjective problem. Compared to GA, NSGA is only different in the way the selection operator works. The crossover and mutation operators remain as usual [12]. The flowchart of NSGA is in Fig. 4. In NSGA, before the selection phase, the population P will be classified into several non-dominated fronts by using non-dominated sorting. Then, all individuals in the first front will be assigned to a large dummy value, called shared value. Each front has its own shared value calculated from the shared value of the first front by the shared function and assigned for all individuals belong to it. The higher rank of the front, the lower shared value it got. Since this way will reduce objective functions to one function for calculating the shared value of each front, NSGA can deal with the objective conflict problem mentioned above. However, NSGA still suffers from some issues with computational complexity, population diversity, and elitism. To reduce these disadvantages, an improved version of NSGA, NSGA-II, has been introduced by [9].
2.3 NSGA-II NSGA-II is a fast and elitist multiobjective genetic algorithm [3]. In this improved version, the authors introduced three new innovations: fast non-dominated sorting,
188
L. H. Hoang et al. Start
initialize population gen = 0
front = 1
is population classified?
no
yes
identify Nondominated individuals
assign dummy fitness
reproduction according to dummy fitness sharing in current front gen = gen + 1
crossover front= front+1 mutation
yes
is gen < maxgen?
no Stop
Fig. 4 Flowchart of non-dominated sorting genetic algorithm
Towards Parallel NSGA-II ...
189
Fig. 5 Crowding distance example [16]
fast crowded distance estimation procedure, and crowded comparison operator. With these new innovations, NSGA-II has reduced the disadvantages of NSGA. • Fast non-dominated sorting: this quickly returns the classified population by non-dominated rank over the population with less time complexity compared to the original non-dominated sorting. The lower rank a solution (individual) got, the better it is. • Crowded distance: This estimates the distance between an individual and two neighbors of it on the same Pareto front. While non-dominated rank is used to compare individuals on different fronts, crowded distance is used to compare individual on the same front. Individuals with large crowded distance prefer to preserve the population diversity. This technique not only keeps population diversity but also ensures the elitism (Fig. 5). • Crowded comparison operator: every individual in the population has two attributes: 1. non-domination rank (irank ); 2. crowding distance (i distance ). It is defined a partial order ≺n as i ≺n j (i is better than j) if: 1. irank < jrank . 2. Or irank = jrank and irank > jrank . Unlike supervised learning algorithms, whose possibility of overfitting increases as the training process is longer, Genetic Algorithms and especially NSGA-II are not affected. That implies, with NSGA-II and our parallelization version introduced in Sect. 5, once the number of generations is large enough, the increment of generation will not downgrade the model’s stability.
190
L. H. Hoang et al.
3 Related Work Because crossover, mutation, and fitness evaluation can be performed independently on different individuals, GA is very suitable for parallelizing. The main problem is only in the selection phase as it requires the whole information about the performance of each individual in the population to determine which ones will be selected. There are some strategies to parallelize the GA algorithm which have been introduced. Two main approaches include the followings. 1. Master-slave: This is a straight-forward strategy that uses one processor for controlling over selection (master), and other processors only for crossover, mutation and evaluating the fitness of individuals (slaves). This approach is not efficient when the number of processors is very large due to the complexity of communication between processors. 2. Island model: In this model, every processor runs algorithms independently from start to the end. These processors (islands) exchange their good individuals (migrations) during the running time. Multiple objectives EAs included NSGA-II, need more effective parallelization since they have to search the whole set of Pareto optimal solutions, not only search for a single optimum. Some people using the master-slave strategy for parallelizing NSGA-II [12, 13]. However, due to its low communication overhead, the island model seems to be the most appropriate parallelization scheme for today’s predominant parallel computer architecture, which is simply a cluster of PCs [1]. We also found some related works of [11] and [10], which use the idea of island model. In Hiroyasu et al. [11], sub-populations are divided from the initial population by sorting based on one of the objectives (which is chosen in turn), then distributed to different processors (islands). In other words, each sub-population is a group of “similar” individuals with respect to the chosen objective. After some generations, if the terminal condition is not satisfied, these sub-populations will be gathered and re-distributed again by the above sorting approach. However, the separation process is somewhat arbitrary, and the gathering process also requires a lot of time. In Branke et al. [10] approach, the search space is divided into several regions and assigned to different processors to make each island focus on their specific region. Whenever an individual has a very good fitness but violating its island constraints, it will be migrated to the other one where there is no constraint violation. This approach produced excellent results in two objectives problems, but struggled on converging when increase the number of objectives.
Towards Parallel NSGA-II ...
191
Fig. 6 Idea of fitness redistribution strategy. The lighter ellipses represent for the islands or the sub-populations. The darker one represents for mainland or the main thread consisting no subpopulation but instead controlling the operations of the sub-populations. The (1) arrows imply that after each evolution, sub-populations send the best individuals. Mainland does some selecting before migrating to sub-populations the optimal individual which is represented by (2)
4 Underlying Ideas Inspired by island model, sub-populations, after each evolution, send their best individuals for the mainland. In turn, mainland chooses the optimal individuals and then migrates to all sub-populations. Note that mainland contains no sub-population but only controlling the operations of islands. Additionally, sub-populations are independent with each other. All mentioned above are illustrated in Fig. 6. Figure 7 describes the fundamental steps of our fitness redistribution strategy. The algorithm starts by initiating sub-populations, in other words, splitting the population into smaller ones. The number of small populations equals to the given number of threads or processes. We use the number of evolving generations as the stopping condition. Unlike supervised learning algorithms whose the overfitting degree is deeply effected by the number of epochs, in the case of genetic algorithms, the more generations the algorithm experiences, the higher possibility it finds out the optimal solution. In each sub-population, the evolution process occurs as similarly with the population in the traditional NSGA-II. After each finished evolution, sub-populations interchanges the best individuals in a specified method. We call those individuals as the elites. The elites, in turn, go through a process to find out the optimal individuals which in the final step are migrated to qualified sub-populations.
192
L. H. Hoang et al.
Fig. 7 Flowchart of the model-based algorithm of NSGA-II
To exemplify for the above procedure, assume that it is expected to have a population with 1000 individuals and the expected number of small populations, or the number of threads, is 4. Therefore, each small population contains 250 individuals. After each evolution, two elites are picked from each sub-population before participating in the process of choosing the optimal individual from chosen elites. The advantages of our strategy are as follows. • Ensuring that the possibility of finding the best individuals is similar to traditional NSGA-II. • Preventing bad individuals from migrating across other sub-populations. In comparison with tradition NSGA-II, there is only one population, which leads to the genetic operations such as crossover and tournament selection are operated entirely. Consequently, it is likely that bad genes in randomly chosen individuals can overwhelm good ones. • Exchanging elites after each evolution, it is ensured that all sub-populations can keep up with each other in terms of finding optimal individuals and intuitively increase the rate to find out the optimal solution. • Opening the search space to every populations rather than limiting as [10], which also allows sub-populations to reach the optimal solution faster. However, there are problems as follows. • Since an individual in a sub-population by no mean can do crossover with individuals in other sub-population, the general diversity of 4 sub-populations is not as good as one big population in traditional NSGA-II. In Sect. 5.2, we propose a technique to tackle this issue.
Towards Parallel NSGA-II ...
193
Fig. 8 Flowchart of an Evolution process. The inputs are individuals in each sub-population. They experience steps exactly the same with traditional NSGA-II. The output is new sub-population containing more qualified individuals
• One common problem of parallel algorithms is the race condition occurring when at least 2 sub-populations are dealing with common variables during evolving. This later will be mentioned again in Sect. 5.3.
5 The Model-Based Approach to Parallel NSGA-II with Fitness Redistribution 5.1 Sub-populations’ Organization and Evolution Process In comparison with traditional NSGA-II’s population, each sub-population is organized similarly while the quantity of individuals in each sub-population is less. Additionally, all sub-populations will solve the same objectives. Note that although the term sub-population is used, there is in fact no difference between that and the population in traditional NSGA-II. Figure 8 depicts 5 steps, which are explained in Table 1, a sub-population must experience in Evolution process. An Evolution process plays the role as creating new more qualified individuals. Again, it is noteworthy that although the term subpopulation is used, there is in fact no difference between that and the population in traditional NSGA-II.
5.2 Fitness Redistribution Strategy After each Evolution process, all sub-populations experience the most important step in this algorithm namely fitness redistribution strategy which is illustrated in Fig. 9. Before diving into explaining, some terminologies used should be examined. • Elites are individuals standing at the top of each sub-individual after a loop.
194
L. H. Hoang et al.
Table 1 The role of steps in an evolution process Step Role Fast-dominated sort
Calculate crowding distance Generate new individuals
Calculate crowding distance again
Eliminate unqualified individuals
Group individuals who dominate or are dominated by other individuals into so-called Pareto fronts. In this step, the rank of individuals is also calculated Calculate crowding distance for each individual in sub-population Take advantage of genetic operations such as tournament selection, crossover and mutation to create new individuals from ones in sub-population. Additionally, crowding distance and rank calculated previously are used in tournament selection Calculate crowding distance for both old and new individuals at the Pareto front whose the accumulated number of individuals from the first Pareto front to that exceeds the limit of individual quantity of each sub-population This only takes place at the Pareto front in the above step. After this step, new individuals are obtained
Fig. 9 Flowchart of the step fitness redistribution strategy. Elites are chosen from each subpopulation and pushed into several small steps to select optimal individuals. Optimal ones (the grey ones) are migrated into selected sub-populations
• Optimal individuals are elites who dominate the other during selecting process. • Qualified (Selected) sub-populations are ones to which optimal individuals do not belong. For example, elite A is taken from first sub-population. After series of steps, A becomes one of optimal individuals. Thereafter on choosing subpopulation for migration, A cannot be migrated into first one but rather second, third or fourth. In the first step Selecting elites, 2 first individuals in each sub-population are chosen to be elites for that sub-population. The reasons for the number 2 are:
Towards Parallel NSGA-II ...
195
• After several first Evolution processes, individuals in a sub-population have unique genes which only exist in the scope of that sub-population. If its elites become the optimal individuals after all, it is believed that its unique genes are good too and should be migrated over other sub-population. However, if only one elite of that sub-population is populated to other, it is likely that those good unique genes are increasingly disappeared during next Evolution process. As a result, more than single individual are selected to become elites of a sub-population. • If we choose more than 2 individuals as “elites”, say 3, it is acceptable if they are belonging the same Pareto fronts, in other words no elite individual dominate the other. However, if some of 3 individuals, say 1 or 2 belong to another front and they win in the race of choosing optimal individuals, they will be migrated to other sub-population. It is not good because migrated individuals are bringing bad genes. The number 2 is appropriate because of 2 elites, if one of them belongs to another Pareto front, which is containing bad genes, those bad genes will be increasingly disregarded in next Evolution processes. The elites chosen from sub-populations participate in the race to find optimal individuals. The race contains 2 steps, one of which is already appeared in the Evolution process and it is Doing fast-dominated sort. Instead of taking all 2 elites of each sub-population, the first-ranked one is picked to participate in the race. This idea derives from the notion that we are finding individuals containing the best genes to migrate. Naturally, it is believed that if one individual brings the good gene, this will populate to the entire population and the most elite individual is one who expresses the good gene the most remarkably. When we make a comparison between 2 elites of 2 different populations, if elite of population 1 dominates one of population 2, it is implied that population 1 contains more superior gene(s). With that in mind, in this algorithm, we want to find the sub-population containing the best gene(s) and populate to other ones. To do so, we will look for the best among the best elites chosen from each sub-population. If one elite dominates all the other, we will take not only that but the elite of the same sub-population to migrate. That is why taking one elite from each sub-population to feed into Fast-dominated sort is sufficient. In the end, optimal individuals are migrated into qualified sub-population.
5.3 Race Condition Problem in Parallel Algorithm It is noteworthy to mention the common problem parallel algorithms have to deal with, the Race condition. That occurs when two or more threads are interacting with the same variables concurrently. In this algorithm, Race condition takes place when sub-populations during Evolution step interact with the global variable(s) used by fitness function(s). Consequently, the Evolution step results in the wrong and
196
L. H. Hoang et al.
disordered result in each sub-population. Note that this problem arises only when fitness functions use global variables that are outside the scope of functions. To address this issue, we propose a simple but effective solution: Duplicating fitness functions. Let’s explain the aforementioned idea by the following example. Assume there are 2 sub-populations and they are solving 3 fitness functions denoted as f 1 , f 2 and f 3 . Let’s also assume that f 1 is using global variable x1 , x2 and f 3 is using global variable x3 . To avoid the Race condition, we duplicate the following: • x1 to x11 and x12 . x11 and x12 are separated but holding exactly the same value with x1 . If x1 holds the class instance, x11 and x12 must be 2 different instances with the same value. Similar duplications are taken place with x2 and x3 . • f 1 to f 11 and f 12 . Technically, f 11 and f 12 are similar with f 1 . The only differences are while f 1 uses global variable x1 and x2 , f 11 uses global variables x11 , x21 and f 12 uses global variables x12 , x22 . Similar duplication is taken place with f3. With the above procedure, we ensure that all sub-populations solve the similar fitness functions which are using similar global variables. However, its drawbacks are: • The used memory is enlarged. • The preparation time for forming global variables and fitness functions before starting the algorithm takes longer.
6 Experiments All experiments are done with the following settings: • Number of generations is 1000 • Number of individuals is 70 • Each test is run 3 times and the final results are average numbers.
6.1 Experiments with Deterministic Problems Deterministic problems are ones we know the solution, or the optimal value, beforehand by using different methods such as Dynamic programming. In other words, it is possible to find optimal values of those by other methods and we can use those values to verify whether our proposed method, denoted as P-NSGA-II, and the traditional NSGA-II reach the optimal point. In details, we do experiments with Travelling salesman problem and Knapsack problem. Knapsack problem (KNAP) is a famous candidate in combinatorial optimization family. It is described as follows.
Towards Parallel NSGA-II ...
197
Table 2 Results in running times (seconds) and optimal values (no measurement) of traditional NSGA-II, P-NSGA-II (2 threads) and P-NSGA-II (4 threads) on 2 problems with different data sets. Algorithms TSP5 Time
TSP17
TSP42
TSP48
KNAP
Value
Time
Value
Time
Value
Time
Value
Time
Value
Traditional 23.17 NSGA-II
19
14.32
2569
29.90
699
29.62
111560 39.87
193458
P-NSGA-II (2 threads)
3.41
19
8.83
2891
10.00
699
18.55
112570
7.23
193458
P-NSGA-II (4 threads)
1.04
19
6.30
2891
3.64
699
7.47
115918
2.92
193458
Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.
Travelling salesman problem (TSP) is another problem in combinatorial optimization family. It is described as follows. Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city.
This problem has many applications in logistic, planning and manufacturing microchips. Additionally, it is used to test optimization algorithms. There are many methods to solve it such as Heuristic algorithms. In this experiment, the optimal value of this problem is the lowest total distance of routes connect cities. Also, we make tests with 4 data sets. The number of cities in data sets are 5, 17, 42 and 48 respectively. As you may see that we do test this problem with more data sets than Knapsack problem. This is because Travelling salesman problem is a kind that there does not exist such a general algorithm to fulfillingly solve, which is unlike Knapsack problem that we have Dynamic programming. That characteristic of Travelling salesman problem theoretically makes it impossible to know the optimal values before using P-NSGA-II and traditional NSGA-II. However, we still categorize it as Deterministic because the data sets we use do provide corresponding optimal values. Table 2 demonstrates the running times and optimal values of 5 tests. Looking at TSP problems with 4 data sets, it can be seen that in terms of time, P-NSGA-II (2 threads) runs faster than Traditional NSGA-II from 1.6 to 6 times. When the number of threads doubles, the running times of P-NSGA-II are even faster with nearly 2 and 4 times in TSP17 and TSP42. TSP42, KNAP and TSP5 witness significant increases with 8, 13 and 22 times respectively. When it comes to optimal value, TSP5, TSP 42 and KNAP reach the same value after running. Interestingly, these values are trully optimal ones. However, in TSP17 and TSP48, not only Traditional NSGA-II but P-NSGA-II do not reach optimal values. From what has been analysed, it can be easily seen that problems which cannot find out their corresponding optimal values witness the slower in the increase of running time.
198
L. H. Hoang et al.
Fig. 10 Example of data for PO problems. Each row represents an entry of a house. Each house has 39 attributes in the begining but during the reducing step, they decrease to 5
6.2 Experiments with Non-deterministic Problems Non-deterministic problem is a group including ones that there is no general method to find truly optimal value. Heuristic algorithms especially Genetic algorithms are widely used but the results are not ensures the optimality. In this section, we test P-NSGA-II with Price Optimization problem — a practical problem which is the motivation for the breed of P-NSGA-II. Price Optimization problem (PO) is defined as follow: There is a set of houses. Each house has 39 attributes such as house price, house area, geolocation... but we use a special formula to reduce to 5 value attributes only. We are given a set of rules to sort houses with respect to those attributes. We are also given a formula to calculate the score of each house. The formula has inputs are 5 attributes of a house and adjustable parameters. The question is what are parameters of the formula such that for each house, the rank sorted by its score is closed exactly with the rank sorted by the given rule.
Figure 10 demonstrates a piece of data used in PO problems. Each row represents a house. Each house has several initial attributes. Later, they are reduced to 5 served for PO problems. In fitness function, Mean Square Error is used as the fitness value. It calculates the square of total square of differences between two ranks of houses. In experiments, we test PO with 3 different data sets containing nearly 500, 600 and 4000 houses respectively. As in Table 3, the results are more optimistic compared with ones of Table 2. In all 3 tests, P-NSGA-II all reach very closed value with Traditional NSGA-II while the running times are significantly lower. For example, with P-NSGA-II running on 2 threads, the running times are from 1.5 to 1.7 times lower. Moving to 4 threads, P-NSGA-II runs 3.4 to 4.2 times faster in comparison with Traditional NSGA-II. It is empirically concluded that P-NSGA-II can reach the result of NSGA-II with lower in time.
Towards Parallel NSGA-II ...
199
Table 3 Results in running times (seconds) and optimal values (no measurement) of traditional NSGA-II, P-NSGA-II (2 threads) and P-NSGA-II (4 threads) on 3 different data sets of PO problem Algorithms PO500 PO600 PO4000 Time Value Time Value Time Value Traditional NSGA-II 55.94 P-NSGA-II (2 threads) 32.77 P-NSGA-II (4 threads) 13.45
1976.01 1976.62 1976.63
62.91 38.42 15.79
2828.32 2830.99 2831.48
381.65 241.98 112.65
60067.74 60085.29 60081.87
7 Conclusion In our paper, we have implemented Parallel - Non-dominated Sorting Genetic Algorithm (or P-NSGA-II) using a novel fitness redistribution strategy in solving deterministic and non-deterministic problems, it is clear that P-NSGA-II is a more efficient approach with equivalent result but much faster run time. As mentioned in Sect. 4, the run-time of P-NSGA-II algorithm is drastically reduced from the traditional approach, ranging from 3 to 20 times better which depends on the complexity of the problems. Most obvious results are shown in Table 2 and Table 3, those problems represent some of the most common problems that can use Genetic Algorithm as solutions. With the increment in executing speed and sustainable result, it can by far replace the old method NSGA-II approach. In our experiments, however, because of limitation of CPU core in our personal computers, we cannot fully implement our algorithm with more than 4 threads. But if it is extended, we are highly confident that P-NSGA-II can fully utilize system resources. Although critical section’s conflict problem is still present if the problem is poorly build or shared data is arbitrarily accessed, it is improvable in future version along with wider range of threads tested for the optimal number. Here is the code for this paper https://github.com/tommyjohn1001/parallel-nsga2. Acknowledgements We want to say thank to Atomic company for creating such a good opportunity to studying and publishing this research. They also provided data for experimenting the algorithm.
References 1. Hassani, A., Treijs, J.: An overview of standard and parallel genetic algorithms (2009) 2. Deb, K.: An introduction to genetic algorithms. Sadhana 24, 293–315 (1999). https://doi.org/ 10.1007/BF02823145 3. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002). https://doi.org/10.1109/4235. 996017 4. Srinivas, N., Deb, K.: Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput. 2, 221–248 (1994). https://doi.org/10.1162/evco.1994.2.3.221
200
L. H. Hoang et al.
5. Guo, Z., Yang, J., Wu, F.-Y., Tan, C., Zhao, H.: Application of an adaptive multi-population parallel genetic algorithm with constraints in electromagnetic tomography with incomplete projections. Appl. Sci. 9, 2611 (2019). https://doi.org/10.3390/app9132611 6. Gonçalves, J., Resende, M.: A multi-population genetic algorithm for a constrained twodimensional orthogonal packing problem. J. Comb. Optim. 22, 180–201 (2011). https://doi. org/10.1007/s10878-009-9282-1 7. Shi, X., Long, W., Li, Y., Deng, D.: Multi-population genetic algorithm with ER network for solving flexible job shop scheduling problems. PLoS ONE 15, e0233759 (2020). https://doi. org/10.1371/journal.pone.0233759 8. Srinivas, N., Deb, K.: Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput. (EC) 2, 221–248 (1994). https://doi.org/10.1162/evco.1994.2.3. 221 9. Deb, K., Zope, P., Jain, A.: Distributed computing of pareto-optimal solutions with evolutionary algorithms. In: Lecture Notes in Computer Science, pp. 534–549 (2003). https://doi.org/10. 1007/3-540-36970-8_38 10. Branke, J., Schmeck, H., Deb, K., Reddy, S.M.: Parallelizing multi-objective evolutionary algorithms: cone separation. In: Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat No. 04TH8753) (2004). https://doi.org/10.1109/cec.2004.1331135 11. Hiroyasu, T., Miki, M., Watanabe, S.: New model of parallel genetic algorithm in multiobjective optimization problems - divided range multi-objective genetic algorithm. In: Proceedings of the Congress on Evolutionary Computation, vol. 1, pp. 333–340 (2000). https:// doi.org/10.1109/CEC.2000.870314 12. Lanˇcinskas, A., Žilinskas, J.: Approaches to parallelize pareto ranking in NSGA-II algorithm. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wa´sniewski, J. (eds.) Parallel Processing and Applied Mathematics, PPAM 2011. Lecture Notes in Computer Science, vol. 7204. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31500-8_38 13. Durillo, J., Nebro, A., Luna, F., Alba, E.: A study of master-slave approaches to parallelize NSGA-II. In: Nature Inspired Distributed Computing (NIDISC) Workshop of the (IPDPS), pp. 1–8 (2008). https://doi.org/10.1109/IPDPS.2008.4536375 14. Knysh, D., Kureichik, V.: Parallel genetic algorithms: a survey and problem state of the art. J. Comput. Syst. Sci. Int. 49, 579–589 (2010). https://doi.org/10.1134/S1064230710040088 15. Sutar, S.R., Bichkar, R.S.: Parallel genetic algorithm for high school timetabling. Int. J. Comput. Appl. 170, 1–5 (2017). https://doi.org/10.5120/ijca2017914851 16. Yu, X., Gen, M.: Introduction to Evolutionary Algorithms. Springer, London (2010) 17. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996)
A Radial Basis Neural Network Approximation with Extended Precision for Solving Partial Differential Equations Thi Thuy Van Le, Khoa Le-Cao, and Hieu Duc-Tran
Abstract In this paper, a three-nodes integrated radial basis function networks (RBFNs) method with extended precision is reported for numerical solutions of partial differential problems. RBFNs can be considered as a universal approximation scheme, and have emerged as a powerful approximation tool and become one of the main fields of research in the numerical analysis [1]. Derivative approximations of variable fields are computed through the radial basis functions. They have the properties of universal approximation and mesh-free discretisation. Substantial enhancements in the solution accuracy, matrix condition number, and high convergence rate are achieved. Keywords Radial basis functions neural network · Partial differential problems · Numerical analysis
1 Introduction Popular discretisation methods in PDEs numerical computation include spectral methods, finite-difference (FD), finite-volume (FV), finite-element (FE), and meshfree methods based on radial/integrated radial basis functions (RBF/IRBF methods). This work is involved with the improvement of a computational framework that achieves both simultaneous accuracy and efficiency for solving PDEs. The proposed framework is underpinned by the exploitation of strengths of several recent advanced numerical tools, namely (i) the high-order accuracy of compact and IRBF approximations [2, 3]; (ii) the sparse system matrices of local three-point-stencils; (iii) the ability to produce higher accuracy in the flat regime of kernel RBF [4, 5]; (iv) and the flexible of the high-order approximation technique for problems defined on T. T. V. Le (B) · H. Duc-Tran Institute of Applied Mechanics and Informatics, Vietnam Academy of Science and Technology (VAST), Ho Chi Minh City, Vietnam e-mail: [email protected] K. Le-Cao Department of Mechanical Engineering, National University of Singapore, Singapore, Singapore © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_17
201
202
T. T. V. Le et al.
non-rectangular domains [6, 7]. It is expected that the outcome of the work will offer advantages for solving PDEs which govern many practical applications.
2 Proposed Numerical Procedure 2.1 IRBFN High-Order Approximations A smooth function and its derivatives of orders up to l can be approximated by an IRBFN-l scheme [2]. The IRBFN construction of the approximations for a network q node xi requires a set of local points {xi }i=1 plus two boundary points x H 1 and x H 2 . Second-order derivative of f is disintegrated to RBFs; expressions for first-order derivative and the field variable itself are then accomplished through the integration process. Noted that IBBF approximation is also applicable for time discretisation [8]. ∂ 2 f (x) = w g (x) = wi I¨i (x), i i ∂x2 i=1 i=1
(1)
∂ f (x) ˙ = wi Ii (x) + c1 , ∂x i=1
(2)
m
m
m
f (x) =
m
wi Ii (x) + c1 x + c2 ,
(3)
i=1 m where m = q + 2 is the total number of RBFs on the local region, {gi (x)}i=1 the m ¨ } {w the network weights to be determined, I (x) = radial basis functions, i i i=1 gi (x)d x d x and c1 , c2 the integral congi (x), I˙ i (x) = gi (x)d x, Ii (x) = stants. These constants are used to add boundary conditions to the conversion from the RBF weight space to physical space. For more details please refer to [2].
2.2 Flat Kernel Integrated RBF Flat kernel RBF was employed in [4, 5]. Here, the influence domain is a three-node stencil [xi−1 , xi , xi+1 ] (Fig. 1). The three-node stencil is shifted along the grid line (i.e. i runs from 1 to q). The construction of the present local-flat-IRBF approximations for a grid node xi involves only three nodes [xi−1 , xi , xi+1 ], rather than all the RBF network nodes. The inversion matrices is now a series of small matrices (one for each three-node RBFs). To improve the accuracy, second derivatives of the field variable obtained from the previous calculation step (iterative solver) are incorporated into the three-node stencil. The method of local-flat-IRBFs, therefore, results in a significant decrease in computational expense in solving system matrices
A Radial Basis Neural Network Approximation ...
203
in comparison with global RBFN methods. The proposed technique applies a form of multiquadric function as follows I¨i (x) =
(x − ei )2 + ai2 ,
a2 (x − ei ) A + i B, I˙ i (x) = 2 2 2 a 2 (x − ei ) −ai (x − ei )2 A+ i Ii (x) = + B, 3 6 2 where A =
(x − ei )2 + ai2 ; and B = ln (x − ei ) + (x − ei )2 + ai2 ; ai and ei
are the width and the centre of the ith RBF node, respectively. A set of collocation points is chosen to be a set of centers. We choose the width according to ai = βdi , where β-a scalar and di the shortest distance between the centre ei and its neighbours. The MQ function is being gradually flat when ai → ∞. Evaluation of (2) and (3) at collocation points xi−1 , xi and xi+1 results in ⎞ ⎛ ⎞ ⎛ f (xi−1 ) wi−1 I ⎜ wi ⎟ ⎜ f (xi ) ⎟ ⎟ ⎟ ⎜ B ⎜ ⎜ wi+1 ⎟ ⎜ f (xi+1 ) ⎟ = (4) ⎜ ⎟ ⎜ ∂ 2 f (xi−1 ) ⎟ ⎝ ⎠ ⎠ ⎝ c 1 ∂x2 C ∂ 2 f (xi+1 ) c2 2 ∂x
where ⎡
⎤ Ii−1 (xi−1 ) Ii (xi−1 ) Ii+1 (xi−1 ) xi−1 1 I = ⎣ Ii−1 (xi ) Ii (xi ) Ii+1 (xi ) xi 1 ⎦ , Ii−1 (xi+1 ) Ii (xi+1 ) Ii+1 (xi+1 ) xi+1 1 I¨i−1 (xi−1 ) I¨i (xi−1 ) I¨i+1 (xi−1 ) 1 0 , B= ¨ Ii−1 (xi+1 ) I¨i (xi+1 ) I¨i+1 (xi+1 ) 1 0
The unknown network weights w can be solved by the system (4) ⎛
⎞ ⎛ ⎞ f (xi−1 ) wi−1 ⎜ wi ⎟ ⎜ f (xi ) ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ wi+1 ⎟ = C −1 ⎜ f (xi+1 ) ⎟ ⎜ ⎟ ⎜ ∂ 2 f (xi−1 ) ⎟ ⎝ c1 ⎠ ⎝ ⎠ ∂x2 2 ∂ f (x ) i+1 c2 2 ∂x
where C −1 can be found by inverting the matrix C.
(5)
204
T. T. V. Le et al.
Fig. 1 A 3 nodes RBF on a grid line. It is noted that the grid line includes interior points xi and boundary points x H 1 , x H 2
The first and second derivatives of f at the nodal point xi are calculated from (2) and (5) ⎛ ⎞ f (xi−1 ) ⎜ f (xi ) ⎟ ⎜ ⎟ . . ∂ f (xi ) . −1 ⎜ f (x i+1 ) ⎟ (6) = I i−1 (xi ) I i (xi ) I i+1 (xi ) 1 0 C ⎜ 2 ⎟, ∂x ⎝ ∂ f (x2i−1 ) ⎠ ∂x ∂ 2 f (xi+1 ) ∂x2
and (1) ⎛
⎞ f (xi−1 ) ⎜ f (xi ) ⎟ ⎜ ⎟ .. .. ∂ 2 f (xi ) .. −1 ⎜ f (x i+1 ) ⎟ C = (x ) (x ) (x ) 0 0 I I I i−1 i i i i+1 i ⎜ ⎟. 2 ∂x2 ⎝ ∂ f (x2i−1 ) ⎠
(7)
∂x ∂ 2 f (xi+1 ) ∂x2
Consider a grid line in Fig. 1, for a special point xi lying next to a boundary point x H 1 or x H 2 , one has f (xi−1 ) = f (x H 1 ) if xi ≡ x1 , and f (xi+1 ) = f (x H 2 ) if xi ≡ xq . RHSs of (6) and (7) can be decomposed into two terms: one associated with nodal variable values and the other with nodal derivative values (denoted by kˆ1x for Eq. (6) and kˆ2x for Eq. (7) ). Assume that the values f (x H 1 ), f (x H 2 ), ∂ f 2 (xi−1 )/∂ x 2 , and ∂ f 2 (xi+1 )/∂ x 2 are given. The approximated expression for f and its derivative are written in compact forms ∂f 1x f + k1x , =D ∂x
(8)
A Radial Basis Neural Network Approximation ...
and
where
205
2f ∂ 2x =D f + k2x , ∂x2
(9)
⎛ ∂ f (x1 ) ⎞ ∂x
⎜ ∂ f (x2 ) ⎟ ∂f ⎜ ∂x ⎟ = ⎜ . ⎟, ⎝ .. ⎠ ∂x ∂ f (xq ) ∂x
⎛ ∂ 2 f (x ) ⎞ 1
⎜ ∂ 2∂fx(x2 ) ⎟ 2f ⎜ ∂x2 ⎟ ∂ ⎜ . ⎟, = ⎜ . ⎟ ∂x2 ⎝ . ⎠ 2
∂ 2 f (xq ) ∂x2
1x and D 2x are the product of two matrices on the right hand side the matrices D of Eqs. (6)–(7), and the vectors kˆ1x and kˆ2x are related to boundary conditions and second derivative of f (compact component). The three-nodes IRBF expressions for derivatives are already derived as functions of the nodal function of f and take in to account the given value of boundary conditions. Consequently, one just has to put them to the governing equation. Through applying the governing equation to all inner RBF node, a square matrix representing algebraic equation framework is generated. As reported in FD, FV, and FE methods, the presented IRBF approximation matrices are also generated by a similar assembly process. Numerical results show that the conversion matrices for both local and compact RBF are ill-conditioned when the RBF width is large, so some special treatments are needed. There are some techniques employed for flat RBF. For example, using Contour-Pade algorithm to construct the RBF interpolation, when the basis function are flat, the numerical solution is stable [4]. Different from Fornbergs approaches, here we simply use extended precision (e.g. function VPA or variable precision arithmetic in MATLAB)-a straightforward way to handle ill-conditioned problems. It is expected that the usage of higher precision will enhance the stability of RBF numerical solution. The employment of function VPA to raise the number of significant decimal digits from 32 to 50 is applied for building up the conversion matrix as well as calculating its inverse, but there is a requirement of computational cost. However, it is acknowledged that by specifying the stencil on a unit length, the conversion matrix needs to be computed only once and the output can be scaled to any size of the grid.
206
T. T. V. Le et al.
3 Numerical Examples 3.1 Flat IRBF Solution of ODE A typical second order ODE has the form of 399 −5x du 2 )e , = −10025 cos(100x − arctan 2 dx 40
(10)
The analytical solution is u e (x) = (2 tan(50x)e−5x )/(tan(50x)2 + 1)
(11)
The proposed IRBF method is applied to solve (10) and the obtained results are compared with (11). It is found that the value of condition number of the matrix A is very small and keeps similar over a wide variety of β, however the condition number of C increases quickly at a rate of 4.6 and then becomes ill-conditioned at a high value of β (Fig. 2). To handle this problem, employing VPA approach, extended precision is employed as a treatment to solve this equation with 0 ≤ x ≤ 1, and grid size N x = 1021. As seen in the Fig. 3, the 3-nodes IRBF solution is robust at high values of β as well as the optimum value of β is explicitly observed.
Fig. 2 ODE: Condition number of A (system matrix) and C (conversion matrix) depend on RBF width β
A Radial Basis Neural Network Approximation ...
207
3.2 Flat IRBF Solution of PDE Considering second-order PDE as follows. ∂2 f ∂2 f + = −18π 2 sin(3π x) sin(3π y) ∂x2 ∂ y2
(12)
As β rises, the increasing rate is approximately 3.08 for condition number of C and 0.00 for that of A matrix (Fig. 4).
Fig. 3 ODE: Numerical solution performance obtained by 32 digits (double precision) and 50 digits (extended precision)
Fig. 4 PDE: Condition number of A (system matrix) and C (conversion matrix) depend on RBF width β
208
T. T. V. Le et al.
Fig. 5 PDE: Investigation of β and N e 81 × 81
The presented approach is thus evaluated by means of a solution on a two dimensional domain. The test problem is governed by the equation (12) and has Dirichlet boundary conditions. The interested area is the domain within the 1 × 1rectangle. The analytical solution to this example is sin(3π x) × sin(3π y) from which the Dirichlet boundary conditions can be determined in an analytical manner. A wide range of β, namely {2, 4, · · · , 100}, is then used to analyse the convergence rate of the numerical solution. The solution is achieved with a tolerance of 1 × 10−9 . Results concerning the error N e and β are given in Fig. 5. It can be seen that the local-Flat-IRBFN solution is robust even at high value of β, also the optimum value of β is explicitly described. Condition numbers of the system matrix, a very important property for the direct solver (i.e. inverse of system matrices), is relatively low (e.g. 104 for a grid of 111 × 111). With a local approximation property, the present method results in a sparse system matrix (Fig. 6). Comparison of the computational accuracy between the 1D-IRBF, Higher-order compact finite difference methods (HOCFD) [10], couple compact IRBF (CIRBF) [9] and the present flat (IRBF) is shown in Table 1, where the grid increases as {21 × 21, 31 × 31, ...}. It can be seen that the Flat kernel IRBF provides the most accurate results. For example, it can reach a low error 1.8 × 10−5 at grid (31 × 31). In order to get the same accuracy, 1D-IRBF needs a grid of 81 × 81; HOCFD method, CCIRBF need a grid of 41 × 41. The mesh convergence of 1D-IRBF, HOCFD, CCIRBF and the present flat IRBF is illustrated in Fig. 7. As shown in Fig. 7 the flat kernel IRBF is most accurate.
A Radial Basis Neural Network Approximation ... Fig. 6 PDE: Local-flatIRBFN yields symmetric and sparse system matrices
209
0 100 200 300 400 500 600 700
0
100
200
300 400 500 nz = 3808
600
700
Table 1 PDE: Accuracy obtained by the other RBFNs (1D-IRBF, HOCFD [10], CCIRBF [9]) and the proposed flat-IRBF methods (noted that a(−b) equivalents to a × 10−b ). Grid RMS 1D-IRBF HOCFD CCIRBF Present flat IRBF 21 × 21 31 × 31 41 × 41 51 × 51 61 × 61 71 × 71 81 × 81 91 × 91 101 × 101 111 × 111
1.2311(−3) 3.6879(−4) 1.5624(−4) 7.9915(−5) 4.6060(−5) 2.8837(−5) 1.9185(−5) 1.3375(−5) 9.6748(−6) 7.2123(−6)
3.3492(−4) 5.6674(−5) 1.4594(−5) 4.7148(−6) 1.9227(−6) 9.2935(−7) 4.6935(−7) 3.0597(−7) 1.5204(−7) 1.4662(−7)
2.5405(−4) 4.2362(−5) 1.0997(−5) 3.7709(−6) 1.5371(−6) 7.1799(−7) 3.8210(−7) 2.0317(−7) 1.3230(−7) 7.8442(−8)
9.6500(−5) 1.8700(−5) 5.7300(−6) 2.2300(−6) 1.0000(−6) 4.9300(−7) 2.5600(−7) 1.3600(−7) 7.1600(−8) 3.5300(−8)
210
T. T. V. Le et al.
Fig. 7 PDE: Mesh convergence of 1D-IRBF, HOCFD, CCIRBF and Present flat IRBF
4 Conclusion A 3-nodes IRBF method with flat kernels has been developed, and the proposed method is validated successfully through analytic solutions of second-order parabolic equations. It can be shown that this method can achieve sparse system matrix (95%) and high level of accuracy. When RBF width is at very large values, interpolation matrices become ill-conditioned, so some special treatments were positively proposed. Here, the extended precision approach is adopted. Improved accuracy and a reliable solution are accomplished with extended precision (at the cost of increasing computational resource). The proposed method employed with the extended precision approach is capable of generating a consistent solution over broad RBF width ranges. Acknowledgement This research is supported by Computational Engineering and Science Research Centre (CESRC), University of Southern Queensland, and Institute of Applied Mechanics and Informatics (IAMI), HCMC Vietnam Academy of Science and Technology (VAST).
References 1. Haykin, S.: Multilayer perceptrons. In: Neural Networks: A Comprehensive Foundation, pp. 156-255 (1999) 2. Mai-Duy, N., Tran-Cong, T.: Numerical solution of differential equations using multiquadric radial basis function networks. Neural Netw. 14(2), 185–199 (2001) 3. Mai-Duy, N., Le-Cao, K., Tran-Cong, T.: A Cartesian grid technique based on one-dimensional integrated radial basis function networks for natural convection in concentric annuli. Int. J. Numer. Meth. Fluids 57, 1709–1730 (2008)
A Radial Basis Neural Network Approximation ...
211
4. Fornberg, B., Wright, G.: Stable computation of multiquadric interpolants for all values of the shape parameter. Comput. Math. Appl. 48(5), 853–867 (2004) 5. Fornberg, B., Piret, C.: A stable algorithm for flat radial basis functions on a sphere. SIAM J. Sci. Comput. 30(1), 60–80 (2007) 6. Le-Cao, K., Mai-Duy, N., Tran, C.D., Tran-Cong, T.: Numerical study of stream-function formulation governing flows in multiply-connected domains by integrated RBFs and Cartesian grids. Comput. Fluids 44(1), 32–42 (2011) 7. Le-Cao, K., Mai-Duy, N., Tran-Cong, T.: An effective integrated-RBFN Cartesian-grid discretisation to the stream function-vorticity-temperature formulation in non-rectangular domains. Numer. Heat Transf. Part B 55, 480–502 (2009) 8. Le, T.T.V., Mai-Duy, N., Le-Cao, K., Tran-Cong, T.: A time discretisation scheme based on integrated radial basis functions (IRBFs) for heat transfer and fluid flow problems. Numer. Heat Transf. Part B (2018) 9. Tien, C.M.T., Thai-Quang, N., Mai-Duy, N., Tran, C.D., Tran-Cong, T.: A three-point coupled compact integrated RBF scheme for second-order differential problems. CMES Compu. Model. Eng. Sci. 104(6), 425–469 (2015) 10. Tian, Z., Liang, X., Yu, P.: A higher order compact finite difference algorithm for solving the incompressible Navier Stokes equations. Int. J. Numer. Meth. Eng. 88(6), 511–532 (2011)
Why Some Power Laws Are Possible and Some Are Not Edgar Daniel Rodriguez Velasquez, Vladik Kreinovich, Olga Kosheleva, and Hoang Phuong Nguyen
Abstract Many dependencies between quantities are described by power laws, in which y is proportional to x raised to some power a. In some application areas, in different situations, we observe all possible pairs (A, a) of the coefficient of proportionality A and of the exponent a. In other application areas, however, not all combinations (A, a) are possible: once we fix the coefficient A, it uniquely determines the exponent a. In such case, the dependence of a on A is usually described by an empirical logarithmic formula. In this paper, we show that natural scale-invariance ideas lead to a theoretical explanation for this empirical formula.
1 Formulation of the Problem Power Laws Are Ubiquitous. In many application areas, the dependence between two quantities x and y is described by the formula y = A · xa
(1)
E. D. Rodriguez Velasquez Department of Civil Engineering, Universidad de Piura in Peru (UDEP), Av. Ramón Mugica 131, Piura, Peru e-mail: [email protected]; [email protected] Department of Civil Engineering, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA V. Kreinovich (B) · O. Kosheleva University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] H. P. Nguyen Division Informatics, Math-Informatics Faculty, Thang Long University, Nghiem Xuan Yem Road, Hoang Mai District, Hanoi, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_18
213
214
E. D. Rodriguez Velasquez et al.
for constants a and A. Such dependencies are known as power laws. Power laws are truly ubiquitous. Let us just give a few examples: • power laws describe how the aerodynamic resistance force depends on the plane’s velocity, • they describe how the perceived signal depends on the intensity of the signal that we hear and see, • they describe how the mass of celestial structures—ranging from small star clusters to galaxies to clusters of galaxies—depends on the structure’s radius, etc.; see, e.g., [3, 7]. Sometimes, Not All Power Laws Are Possible. The parameters A and a have to be determined from the experiment. In some application areas, all pairs (A, a) are possible. In some other applications areas, however, not all such pairs are possible. Sometimes, a is fixed, and A can take all possible values. In other application areas, we have different values of A—but for each A, we can only one have one specific value of a. One such example can be found in transportation engineering: it describes the dependence of number y of cycles until fatigue failure on the initial strain x; see, e.g., [2, 4–6, 8]. In many such situations, the value of a corresponding to A is determined by the following empirical formula [2, 4–6, 8]: a = c0 + c1 · ln(A).
(2)
Comment. The case when the value a is fixed can be viewed as a particular case of this empirical formula, corresponding to c1 = 0. Resulting Challenge. How can we explain the formula (2)? What We Do in This Paper. In this paper, we provide a theoretical explanation for this formula. To come up with this explanation, we recall the reason why power laws are ubiquitous in the first place—because they correspond to scale-invariant dependencies. We then use the scale-invariance idea to explain the ubiquity of the formula (2).
2 Power Laws and Scale Invariance: A Brief Reminder Scaling. The main purpose of data processing is to deal with physical quantities. However, in practice, we only deal with the numerical values of these quantities. What is the difference? The difference is that to get a numerical value, we need to select a measuring unit for measuring the quantity. If we replace the original measuring unit with a new one which is λ times smaller, then all numerical values are multiplied by λ: x → X = λ · x. For example, if we move from meters to centimeters,
Why Some Power Laws Are Possible and Some Are Not
215
all the numerical values will be re-scaled: multiplied by 100, e.g., 1.7 m becomes 1.7 · 100 = 170 cm. Scale-Invariance. In many application areas, there is no fixed measuring unit, the choice of the measuring unit is rather arbitrary. In such situations, it is reasonable to require that the dependence y = f (x) between the quantities x and y not depend on the choice of the unit. Of course, this does not mean that y = f (x) imply y = f (X ) = f (λ · x) for the exact same function f (x)—that would mean that f (λ · x) = f (x) for all x and λ, i.e., that f (x) is a constant and thus, that there is no dependence. What we need to do to keep the same dependence is to accordingly re-scale y, to Y = μ · y for some μ depending on λ. For example, the area y of a square is equal to the square of its size y = x 2 . This formula is true if we use meters to measure length and square meters to measure area. The same formula holds if we use centimeters instead of meters—but then, we should use square centimeters instead of square meters. In this case, λ = 100 corresponds to μ = 10000. So, we arrive at the following definition of scale-invariance: for every λ > 0 there exists a value μ > 0 for which, for every x and y, the relation y = f (x) implies that Y = f (X ) for X = λ · x and Y = μ · y. Scale-Invariance and Power Laws. It is easy to check that every power law is scale-invariant. Indeed, it is sufficient to take μ = λa . Then, from y = A · x a we get Y = μ · y = λa · y = λa · A · x a = A · (λ · x)a = A · X a , i.e., indeed Y = f (X ). It turns out that, vice versa, the only continuous scale-invariance dependencies are power laws; see, e.g., [1]. For differentiable functions f (x), this can be easily proven. Indeed, by definition, scale-invariance means that μ(λ) · f (x) = f (λ · x).
(3)
Since the function f (x) is differentiable, the function μ(λ) =
f (λ · x) f (x)
is also differentiable, as the ratio of two differentiable functions. Since both functions f (x) and μ(λ) are differentiable, we can differentiate both sides of the equality (3) with respect to λ: μ (λ) · f (x) = x · f (λ · x), where f , as usual, means the derivative. In particular, for λ = 1, we get μ0 · f (x) = def x · f (x), where we denoted μ0 = μ (1), i.e.,
216
E. D. Rodriguez Velasquez et al.
μ0 · f = x ·
df . dx
We can separate the variables x and f is we divide both sides by x · f and multiply both sides by d x, then we get df dx = μ0 · . f x Integrating both sides, we get ln( f ) = μ0 · ln(x) + c, where c is the integration constant. Thus, for f = exp(ln( f )), we get f (x) = exp(μ0 · ln(x) + c) = A · x a , def
def
where we denoted A = exp(c) and a = μ0 .
3 Main Idea and Resulting Explanation Main Idea. Since, in principle, for the corresponding application areas, we can have different values A and a, this means that the value of the quantity y is not uniquely determined by the value of the quantity x, there must be some other quantity z that influences y. In other words, we should have y = F(x, z).
(4)
for some function F(x, z). Different situations—i.e., different pairs (A, a)—are characterized by different values of the auxiliary quantity z. Main Assumption. The very fact that for each fixed z, the dependence of y on x is described by a power law means that when the value of z is fixed, the dependence of y on x is scale-invariant. It is therefore reasonable to conclude that, vice versa, for each fixed value x, the dependence of y on z is also scale-invariant. This Assumption Leads to the Desired Explanation of the Above Empirical Formula. Let us show that this assumption indeed explains the formula (2). Indeed, the fact that for each z, the dependence of y on x is described by the power law, with coefficients A and a depending on z, can be described as F(x, z) = A(z) · x a(z) .
(5)
Similarly, the fact that the dependence of y on z is scale-invariant means that for each x, the dependence of y on z can also described by the power law, with the coefficients depending on x: (6) F(x, z) = B(x) · z b(x) ,
Why Some Power Laws Are Possible and Some Are Not
217
for appropriate coefficients B(x) and b(x). By equating two different expressions (5) and (6) for F(x, z), we conclude that A(z) · x a(z) = B(x) · z b(x)
(7)
for all x and z. In particular, for x = 1, the formula (7) implies that A(z) = B(1) · z b(1) .
(8)
Similarly, for z = 1, the formula (7) implies that B(x) = A(1) · x a(1) .
(9)
Substituting expressions (8) and (9) into the formula (7), we conclude that B(1) · z b(1) · x a(z) = A(1) · x a(1) · z b(x) .
(10)
In particular, for x = e, we get B(1) · z b(1) · ea(z) = A(1) · ea(1) · z b(e) , hence exp(a(z)) =
A(1) · exp(a(1)) b(e)−b(1) ·z . B(1)
(11)
From the formula (8), we conclude that z b(1) =
A , B(1)
and thus, z=
A1/b(1) . B(1)1/b(1)
(12)
Substituting the expression (12) into the formula (11), we conclude that exp(a) =
1 A(1) · exp(a(1)) · · A(b(e)−b(1))/b(1) , B(1) B(1)(b(e)−b(1))/b(1)
i.e., that exp(a) = C0 · Ac1 for some values C0 and c1 . Taking logarithms of both sides, we now get the desired dependence a = c0 + c1 · ln(A), where we denoted def c0 = ln(C0 ). So, we indeed have the desired derivation.
218
E. D. Rodriguez Velasquez et al.
Acknowledgments This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Aczel, J., Dhombres, J.: Functional Equations in Several Variables. Cambridge University Press, Cambridge (1989) 2. Cooper, K.E., Pell, P.S.: The effect of mix variables on the fatigue strength of bituminous materials. Report LR 663, UK Transport and Road Research Laboratory (1974) 3. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics. Addison Wesley, Boston (2005) 4. Myre, J.: Fatigue of asphalt pavements. In: Proceedings of the Third International Conference on Bearing Capacity of Roads and Airfields. Norwegian Institute of Technology (1990) 5. Priest, A.L., Timm, D.H.: Methodology and calibration of fatigue transfer functions for mechanistic-empirical flexible pavement design. NCAT report 06-03, US National Center for Asphalt Technology (NCAT), Auburn, Alabama, December 2006 6. Rauhut, J.B., Lytton, R.L., Darter, M.I.: Pavement damage functions for cost allocations, vol. 2, descriptions of detailed studies. Report FHWA-RD-84-019, Federal Highway Administration (FHWA), Washington, D.C. (1984) 7. Thorne, K.S., Blandford, R.D.: Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics. Princeton University Press, Princeton (2017) 8. Timm, D.H., Newcomb, D.E., Birgisson, B.: Mechanistic-empirical flexible pavement thickness design: the Minnesota method, Minnesota Department of Transportation, Staff paper MN/RC-P99-10, St. Paul, Minnesota (1999)
How to Estimate the Stiffness of a Multi-layer Road Based on Properties of Layers: Symmetry-Based Explanation for Odemark’s Equation Edgar Daniel Rodriguez Velasquez, Vladik Kreinovich, Olga Kosheleva, and Hoang Phuong Nguyen Abstract When we design a road, we would like to check that the current design provides the pavement with sufficient stiffness to withstand traffic loads and climatic conditions. For this purpose, we need to estimate the stiffness of the road based on stiffness and thickness of its different layers. There exists a semi-empirical formula for this estimation. In this paper, we show that this formula can be explained by natural scale-invariance requirements.
1 Formulation of the Problem Need to Estimate Stiffness of Multi-layer Roads. Most roads consist of several layers: • First, there is a layer of soil—if needed, stabilized by adding lime, cement, etc. • Then there is a layer—usually compacted—of crushed rocks. • Finally, an asphalt or concrete layer is placed on top. The road has to have a certain stiffness, i.e., a certain value of the modulus characterizing this stiffness. It is therefore desirable to estimate the stiffness of the designed E. D. Rodriguez Velasquez Department of Civil Engineering, Universidad de Piura in Peru (UDEP), Av. Ramón Mugica 131, Piura, Peru e-mail: [email protected]; [email protected] Department of Civil Engineering, University of Texas at El Paso, 500 W. University El Paso, El Paso, TX 79968, USA V. Kreinovich (B) · O. Kosheleva University of Texas at El Paso, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] H. P. Nguyen Division Informatics, Math-Informatics Faculty, Thang Long University, Nghiem Xuan Yem Road Hoang Mai District, Hanoi, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_19
219
220
E. D. Rodriguez Velasquez et al.
road with the layers of given thickness. In other words, we need to be able to solve the following problem: • we know the modulus E i and the thickness h i of each layer; • based on this information, we need to estimate the overall modulus E of the road. Odemark’s Equation. One of the methods for solving this problem was proposed in 1949 by N. Odemark [4]; his formula is ⎛ ⎜ E =⎝
i
√ ⎞3 h i · 3 Ei ⎟ ⎠ . hi
(1)
i
This formula is still in use; see, e.g., [3, 6]. Comment. The formula (1) corresponds to the usual case when different layers have similar Poisson ratios. For situations when the Poisson ratios of different layers are significantly different, there are more accurate versions of this formula. How Can We Explain This Formula. Odemark’s formula is based on a simplified mechanical model of the pavement, where many important factors are ignored to make a simple formula possible. In principle, several different simplifications are possible; the formula produced by this particular simplification has been confirmed by empirical data. How can we explain this formula? What We Do in This Paper. In this paper, we provide a theoretical explanation for this formula, an explanation based on the ideas of symmetry – namely, on the ideas of scale-invariance.
2 Scale-Invariance: Reminder To measure a physical quantity, we need to select a measuring unit. In some cases, there is a physically natural unit—e.g., in the micro-world, we can use the electric charge of an electron as a natural measuring unit for electric charges. However, in many other situations, there is no such fixed unit. In such cases, it is reasonable to require that the dependence between the physical properties remains the same—i.e., described by the same formula—if we change the measuring unit. If we replace the original measuring unit with a unit which is λ times smaller, then all numerical values of the quantity will be multiplied by λ: x → λ · x. This transformation is known as re-scaling, and invariance with respect to this transformation is known as scale-invariance. Scale invariance is ubiquitous in physics; see, e.g., [2, 5].
How to Estimate the Stiffness of a Multi-Layer Road ...
221
3 Towards an Explanation Analysis of the Problem. Let us first consider the simplified case when all the layers have the same thickness. The overall stiffness E is the “average” stiffness, i.e., the stiffness that the road would have if all its layers have the same stiffness E. Let us denote the overall effect of n layers with stiffness E 1 , . . . , E n by E1 ∗ . . . ∗ En , for an appropriate combination operation a ∗ b. In these terms, the stiffness of the nlayer road in which each layer has stiffness E is described by the formula E ∗ . . . ∗ E. Thus, the desired overall effect E can be described by the formula E ∗ . . . ∗ E = E1 ∗ . . . ∗ En .
(2)
The air layer with 0 stiffness does not contribute to the overall stiffness, so we should have a ∗ 0 = a. If we have layers of different thickness h i , then we can divide each of these layers into parts of the same thickness, and apply the same formula (2), i.e., we get E ∗ . . . ∗ E (h 1 + . . . + h n times) = E 1 ∗ . . . ∗ E 1 (h 1 times) ∗ . . . ∗ E n ∗ . . . ∗ E n (h n times).
(3)
Natural Properties of the Combination Operation a ∗ b. In the first approximation, we can ignore the dependence on the order, and assume that a ∗ b = b ∗ a, i.e., assume that the combination operation is commutative. It is also reasonable to assume that the result of applying this operation to a 3-layer road does not depend on which layer we start with, i.e., that we should have a ∗ b ∗ c = (a ∗ b) ∗ c = a ∗ (b ∗ c). In other words, the combination operation should be associative. If we made one the layers stiffer, the stiffness of the road should increase. So, the combination operation should be strictly monotonic: if a < a , then a ∗ b < a ∗ b. Small changes in E i should lead to small changes in the overall stiffness. In mathematical terms, this means that the combination operation should be continuous. Finally, we require that the combination operation be scale-invariant, i.e., that if a ∗ b = c, then, for every λ, we should have the same relation for re-scaled values λ · a, λ · b, and λ · c: (λ · a) ∗ (λ · b) = λ · c. (4)
222
E. D. Rodriguez Velasquez et al.
Main Result. We will show that every commutative, associative, strictly monotonic, continuous, and scale-invariant combination operation for which a ∗ 0 = a has the form (5) a ∗ b = (a p + b p )1/ p for some p > 0. Discussion. In other words, a ∗ b = c is equivalent to a p + b p = c p , and, more generally, that a ∗ . . . ∗ b = c means that a p + . . . + b p = c p . In view of the formula (3), this means that
p hi · E p = h i · Ei , i
i
hence
⎞ p 1/ p h i · Ei ⎟ ⎠ . hi
⎛ ⎜ E =⎝
i
i
For p = 1/3, we get exactly Odemark’s formula!
4 Proof of the Main Result 1◦ . Let us first prove that the operation a ∗ b has the form f ( f −1 (a) + f −1 (b)) for some monotonic function f (a). In other words, we want to prove that a ∗ b = c is equivalent to (6) f −1 (a) + f −1 (b) = f −1 (c), or, equivalently, that f (a) ∗ f (b) = f (c) is equivalent to a + b = c, i.e., that f (a + b) = f (a) ∗ f (b).
(7)
def
Indeed, let us take f (1) = 1. Then, for every natural number m, we take def
f (m) = 1 ∗ . . . ∗ 1 (m times). In this case indeed, f (m) ∗ f (m ) = 1 ∗ . . . ∗ 1 (m times) ∗ 1 ∗ . . . ∗ 1 (m times) = 1 ∗ . . . ∗ 1 (m + m times) = f (m + m ), i.e., we have the desired property (7).
(8)
How to Estimate the Stiffness of a Multi-Layer Road ...
223
Due to monotonicity, for each natural number n, we have 0 ∗ . . . ∗ 0 (n times) = 0 < 0 ∗ . . . ∗ 0 ∗ 1 = 1, and 1 = 0 ∗ . . . ∗ 0 ∗ 1 < 1 ∗ . . . ∗ 1 (n times). From 0 ∗ . . . ∗ 0 (n times) < 1 < 1 ∗ . . . ∗ 1 (n times) and continuity of the combination operation, we conclude that there exists a value vn for which vn ∗ . . . ∗ vn (n times) = 1. 1 = vn . f n
We will then take
We will then define f
m n
1 1 ∗ ... ∗ f (m times). = f n n
def
One can check that for thus defined function f (a), we indeed always have the formula (7) for rational values a and b, and by continuity, we can extend the function f (a) to all non-negative real values a. 2◦ . Let us now prove that the inverse function f −1 (a) is a power function – and thus, its inverse is also a power function. Indeed, in terms of the formula (6), scale-invariance means that if the formula (6) is satisfied, then we have f −1 (λ · a) + f −1 (λ · b) = f −1 (λ · c). def
def
def
(9)
Let us denote p = f −1 (a), q = f −1 (b), r = f −1 (c), so that a = f ( p), b = f (q), def and c = f (r ). Let us also denote tλ (x) = f −1 (λ · f (x)), so that tλ ( p) = f −1 (λ · f ( p)) = f −1 (λ · a), tλ (q) = f −1 (λ · f (q)) = f −1 (λ · b), and
tλ (r ) = f −1 (λ · f (r )) = f −1 (λ · c).
224
E. D. Rodriguez Velasquez et al.
In this form, scale-invariance takes the following form: if p + q = r , then tλ ( p) + tλ (q) = tλ (r ). In other words, we have tλ ( p + q) = tλ ( p) + tλ (q) for all p and q. For integer values p = n, we thus have 1 1 1 + . . . + tλ (n times) = n · tλ , tλ (1) = tλ n n n 1 1 = · tλ (1). tλ n n
Thus
Similarly, for every m, we have tλ
m n
= tλ
1 1 1 m + . . . + tλ (m times) = m · tλ = · tλ (1). n n n n
In other words, we conclude that tλ (x) = x · tλ (1)
(10)
for all rational x. By continuity, we can conclude that this property holds for all real values as well. By definition of tλ (x), the equality (10) means that f −1 (λ · f (x)) = tλ (1) · x, i.e., that for y = f (x), for which x = f −1 (y), we have f −1 (λ · y) = tλ (1) · f −1 (y).
(11)
It is known (see, e.g., [1]) that every continuous solution to this functional equation has the form f −1 (x) = A · x a for some A and a. Thus, we get the desired formula for the combination operation a ∗ b = f ( f −1 (a) + f −1 (b)). The result is proven. Comment. The result from [1] can be easily proven, if instead of continuity, we make a stronger assumption that the combination operation – and thus, the function f (a) – is differentiable. Indeed, in this case, tλ (1) is a differentiable function of λ, as a ratio of two differentiable functions. Thus, we can differentiate both sides of the equality (11) by λ and take λ = 1; then, we get x · F (x) = c · F, def
(12)
where F(x) = f −1 (x), F (x) means the derivative, and c is the derivative of the expression tλ (1) when λ = 1. The formula (12) can be rewritten as x·
dF = c · F, dx
How to Estimate the Stiffness of a Multi-Layer Road ...
225
i.e., equivalently, dx dF =c· . F x Integrating both parts, we get ln(F) = c · ln(x) + C, where C is the integration constant. Applying exp(z) to both sides, we get the desired power law. Acknowledgment This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science) and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Aczel, J., Dhombres, J.: Functional Equations in Several Variables. Cambridge University Press, Cambridge (1989) 2. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics. Addison Wesley, Boston (2005) 3. Guzzarlapudi, S.D., Adigopula, V.K., Kumar, R.: Comparative study of flexible pavement layers moduli backcalculation using approximate and static approach. In: Thomas, B.S., Vyas, R., Goswami, P.K., Cseteny, L.J. (eds.) Proceedings of the International Conference on Recent Trends in Engineering and Material Sciences (ICEMS’2016), Jaipur, India, 17–19 March 2016, Published in Material Today Proceedings, 2017, vol. 4, no. 9, pp. 9812–9816 (2016) 4. Odemark, N.: Investigations as to the Elastic Properties of Soils and Design of Pavements According to the Theory of Elasticity. Staten Vaeginstitut, Stockholm (1949) 5. Thorne, K.S., Blandford, R.D.: Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics. Princeton University Press, Princeton (2017) 6. US Federal Highway Administration (FHWA): Mechanistic Evaluation of Test Data from LTTP Flexible Pavement Test Sections, vol. 1, Publication FHWA-RD-98-012, April 1998
Need for Diversity in Elected Decision-Making Bodies: Economics-Related Analysis Nguyen Ngoc Thach, Olga Kosheleva, and Vladik Kreinovich
Abstract On a qualitative level, everyone understands the need to have diversity in elected decision-making bodies, so that the viewpoint of each group be properly taken into account. However, when only the usual economic criteria are used in this election—e.g., in the election of company’s board—the resulting bodies often underrepresent some groups (e.g., women). A frequent way to remedy this situation is to artificially enforce diversity instead of strictly following purely economic criteria. In this paper, we show the current seeming contradiction between economics and diversity is caused by the imperfection of the used economic models: in an accurate economics-related decision making model, optimization directly implies diversity.
1 Diversity and Economics-Related Decision Making: Formulation of the Problem Need for Elected Bodies. In a small community or a small company, decisions can be made by all people getting together. This is how decisions are usually made in a university’s department—by having a faculty meeting, so that each faculty member has a chance to express his or her opinion, and these opinions are taken into account when making a decision. However, for a larger group—e.g., for all the university’s faculty—there are already so many folks that it is not possible to give everyone a chance to talk. In such situations, a usual idea is to elect a decision-making body.
N. Ngoc Thach Institute for Research Science and Banking Technology, Banking University Ho Chi Minh City, 39 Ham Nghi Street, District 1, Ho Chi Minh City 71010, Vietnam e-mail: [email protected] O. Kosheleva · V. Kreinovich (B) University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA e-mail: [email protected] O. Kosheleva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_20
227
228
N. Ngoc Thach et al.
Cities elect city councils, countries elect parliaments, shareholders elect a company’s board, university faculty elect a faculty senate, etc. Diversity in Elected Bodies Is Important. Populations are diverse, we have people of different ethnicity, different gender, different ages, etc. These people have somewhat different agenda, somewhat different preferences. It is desirable that the opinions of each group are taken into account when decisions are made. For this purpose, it is desirable that all these groups are properly represented in the elected body. Even with the most democratic election procedures, however, some groups are under-represented. For example, women are under-represented on the boards of most companies and in most countries’ parliaments. How can improve this situation? Usual Approach: Enforce Diversity by Limiting Democracy. The fact that some groups are under-represented in democratically elected decision-making bodies is usually interpreted as the need to enforce diversity by limiting democracy. For example, some countries and some companies have a quota on female representation—and on representation on other under-represented groups. The Problem with the Usual Approach. In many cases, elections are based on economics-related criteria. For example, when shareholders elect board members, their main objective is to maintain the economic prosperity of the company. So, naturally, they elect candidates who have shown to be successful in economic leadership. For cities and countries, this is also largely true: usually, leaders who lead to economic prosperity are re-elected, while leaders under whom economy tanked are not re-elected. From this viewpoint, it seems clear what we want from an elected body: there are economic criteria that we want to impose. The need to enforce diversity disrupts this straightforward idea. It is no longer clear what should we optimize, how should we combine traditional economic criteria with this new diversity requirements. But Is Diversity Indeed Inconsistent with Economics? Many folks argue—in our opinion, convincingly—that diversity actually helps economy. Their arguments are very straightforward: economics is complicated and very competitive. To make economy successful, we need to use all the talent we have. If in some country, citizens, e.g., consistently ignore females and only elect male board members and male CEOs, they are not using half of the country’s talent—and, as a result, this country will eventually lose competition with countries that utilize all their talent. From this viewpoint, diversity is not only consistent with economics—it should follow from the economic considerations. How Can We Translate this Informal Argument into a Precise Model? Informally, the above argument makes sense, but the existing economic considerations still lead to under-representation of different groups. How can we translate the above informal argument into a precise model? This is what we attempt to do in this paper.
Need for Diversity in Elected Decision-Making Bodies ...
229
2 Accurate Economics-Related Decision Making Model and How Its Optimization Implies Diversity Individual Decision Making According to Decision Theory. The traditional decision theory describes how a rational person should make decisions. Reasonable rationality criteria lead to the conclusions: • that preferences of a rational decision maker can be described by a special function u(x) called utility function, and • that out of several alternatives a, the rational decision maker should select the one with the largest value of expected utility. def
u(a) = E a [u(x)] =
p j (a) · u(x j ),
(1)
where x j are possible consequences of making the decision a and p j (a) is the probability of the consequence x j ; see, e.g., [1–3, 6–8]. Utility Is Defined Modulo a Linear Transformation. Utilities are determined modulo a linear transformation u → a · u + b. Usually, when we make a decision, there is a status quo situation whose utility can be taken as 0. If we use this status quo situation as a starting point, then the only remaining transformations are transformations of the type u → a · u. Group Decision Making. What if a group of n people needs to make a decision? For each participant i, and for each alternative a, we can determine the expected utility u i (a) of this participant corresponding to the alternative a. So, each alternative is characterized by a tuple U (a) = (u 1 (a), . . . , u n (a)) . Based on these tuples, we need to decide which alternative is better for the group. Since utility of each participant i is defined modulo a linear transformation u i → ai · u i , it is reasonable to require that the comparison between two tuples U (a) and U (b) not change if we apply such transformations. It turns out that this natural requirement uniquely determines group decision making—namely, we should select an alternative for which the product n
u i (a)
(2)
i=1
of expected utilities is the largest; see, e.g., [1–8]. This criterion was first formulated by the Nobelist John Nash in [4]. It is therefore known as Nash’s bargaining solution. Analysis of the Situation. Before we go into a more serious analysis, let us first mention that there is a computational problem related to the direct use of the formula (2). Indeed, the population size n is usually large—since, as we have mentioned, the
230
N. Ngoc Thach et al.
very need for a elected body only appears when n is large. In the computer, a product of the large number of values very fast leads either to a number which is too small to be represented in a computer, or to a number which is too large. For example, in a city of 1 million people, if u i = 2, we get the value 21000000 which is too large, and if u i = 1/2, we get the value 2−1000000 which is too small. The usual way to avoid this computational problem is to use logarithms, since the logarithm of a product is equal to the sum of the logarithms. From this viewpoint, maximizing the expression (2) is equivalent to maximizing its logarithm n
ln (u i (a)) .
(3)
i=1
Adding millions of numbers may also lead to computational problems, so an even better idea is to divide the expression (3) by n and thus, to maximize the average instead of the sum: n 1 · ln (u i (a)) . (4) n i=1 In these terms, each person i is characterized by a unique tuple L i formed by the values def (5) L i (a) = ln (u i (a)) corresponding to different alternatives a. Since the population size n is large, we can say that we have a probability distribution ρ(L) on the set of all such tuples L—just like we can say that there is a probability distribution of people by age, by height, or by weight. In terms of the probability distribution, the average value (5) can be described as the expected value (a) =
ρ(L) · L(a) d L .
(6)
What Happens If We Select a Decision-Making Body. The formula (6) describes the ideal decision making, when the opinion of each person is explicitly taken into account. As we have mentioned, for large n, this is not realistically possible. Instead, we elect a decision-making body, and this body makes decisions. In the ideal world, decisions of this body also follow Nash’s bargaining solution—i.e., equivalently, this body selects an alternative that maximizes the expected value B (a) =
ρ B (L) · L(a) d L ,
(7)
where the probability measure ρ B (L) describes the distribution of tuples L among the members of the elected body.
Need for Diversity in Elected Decision-Making Bodies ...
231
What We Want. We want to make sure that the decisions of the elected body reflect the opinions of the people. In other words, we want to make sure that the decisions based on the value (7) coincide (or at least are close) to decisions based on the value (6). Thus, for every alternative a, the values (6) and (7) of the corresponding criteria must coincide—or at least be close to each other. This Leads to Diversity. The only way to guarantee that the values (6) and (7) always coincide (or are close) is to make sure that the corresponding probability measures ρ(L) and ρ B (L) coincide (or are close). In other words, for each group of people characterized by special values of the tuple L, the proportion of this group’s representatives in the elected board (as described by the probability measure ρ B (L)) should be close to the proportion of this group in the population as a whole (as described by the probability measure ρ(L)). This is exactly what perfect diversity looks like. So indeed, for an accurate economics-related description of decision making, optimization leads to diversity. Acknowledgment This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), HRD-1834620 (CAHSI Includes), and HRD-1242122 (Cyber-ShARE Center of Excellence).
References 1. Fishburn, P.C.: Utility Theory for Decision Making. Wiley, New York (1969) 2. Kreinovich, V.: Decision making under interval uncertainty (and beyond). In: Guo, P., Pedrycz, W. (eds.) Human-Centric Decision-Making Models for Social Sciences, pp. 163–193. Springer (2014) 3. Luce, R.D., Raiffa, R.: Games and Decisions: Introduction and Critical Survey. Dover, New York (1989) 4. Nash, J.: The bargaining problem. Econometrica 18(2), 155–162 (1950) 5. Nguyen, H.P., Bokati, L., Kreinovich, V.: New (simplified) derivation of Nash’s bargaining solution. J. Adv. Comput. Intell. Intell. Inform. (JACIII) 6. Nguyen, H.T., Kosheleva, O., Kreinovich, V.: Decision making beyond Arrow’s ‘impossibility theorem’, with the analysis of effects of collusion and mutual attraction. Int. J. Intell. Syst. 24(1), 27–47 (2009) 7. Nguyen, H.T., Kreinovich, V., Wu, B., Xiang, G.: Computing Statistics under Interval and Fuzzy Uncertainty. Springer, Heidelberg (2012) 8. Raiffa, H.: Decision Analysis. McGraw-Hill, Columbus (1997)
Fuzzy Transform for Fuzzy Fredholm Integral Equation Irina Perfilieva and Pham Thi Minh Tam
Abstract This work pursues two goals: to solve a fuzzy Fredholm integral equation of the second kind and to propose a new numerical method inspired by the theory of fuzzy (F-) transforms. This approach allows us to transform the fuzzy Fredholm integral equation into a system of algebraic equations. The solution to this algebraic system determines an approximate solution to the original problem. The conditions of the existence and uniqueness of an exact solution are proved. Keywords Fredholm integral equation · F-transform · Fuzzy valued function
1 Introduction Fuzzy-valued functions are attracting researchers from the AI community due to the growing role of multidimensional systems in data and, in particular, in image analysis, see [1, 2, 5–7, 11, 12]. The latter can be explained by the fact that the rendering of images using fuzzy functions is less sensitive to arbitrary distortion than the bitmap representation [6]. The use of these functions gives a new impetus to the development of traditional mathematical models of various dynamical systems, especially those affected by ill-posedness, response with a time delay, and other instability factors [1]. Due to their dynamic nature, these models are based on differential or integral equations, so their analysis has been adapted to new spaces of fuzzy objects. Traditionally, dynamical systems are analyzed in spaces of real or complex-valued functions, which leads the consideration of functions with fuzzy values defined on real numbers. The latter are called fuzzy numbers [4, 13], and their definition is more specific than the definition of a fuzzy set. I. Perfilieva (B) · P. T. M. Tam Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, 30. dubna 22, Ostrava, Czech Republic e-mail: [email protected] P. T. M. Tam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_21
233
234
I. Perfilieva and P. T. M. Tam
The proposed contribution is focused on fuzzy Fredholm integral equations of the second kind b
y(t) = f (t) + λ
k(t, s)y(s)ds, a
where f (y) is a given (unknown) fuzzy-valued function, and other parameters are real-valued. In general, Fredholm integral equations belong to the most important ones because of their numerous applications in physics and biology, optimal control systems, communication theory, mathematical economics, etc. Let us remark that due to fuzziness, these equations are more complicated that their classical prototypes and are solved numerically. The known methods are divided into two groups: iterative computation [3, 14] and approximation of a solution by simple functions [11, 12]. Our research is in the second group. The main difference between the methods from [11, 12] and our innovation is the construction of a set of approximations (solution and other functions involved) on overlapping subdomains that make up the so-called fuzzy partition of the entire area. This allows using elementary approximating functions and obtaining the required quality of approximation by increasing the number of subdomains. In more detail, we replace the involved fuzzy-valued functions by real vector functions and reduce the original problem to the system of integral equations over ordinary functions. Then we propose a numerical solution based on the theory of fuzzy (F)-transforms. The latter theory successfully combines fuzzy and conventional methods and extends the latter with modern artificial intelligence techniques.
1.1 Fuzzy Numbers and Fuzzy Arithmetic Operations Definition 1 A fuzzy number u on R in parametric form is an ordered pair u = (u, u) of two real functions u, u : [0, 1] → R, that satisfy the following requirements: 1. u is a bounded, monotonically increasing, left continuous on (0, 1] and right continuous at 0, 2. u is bounded, monotonically decreasing, left continuous on (0, 1] and right continuous at 0, 3. u(1) ≤ u(1). For arbitrary fuzzy numbers u = (u, u), v = (v, v) and real k ∈ R, addition u + v and multiplication ku are defined pointwise by: • • • •
(u + v)(r ) = (u(r ) + v(r )), 0 ≤ r ≤ 1, (u + v)(r ) = (u(r ) + v(r )) 0 ≤ r ≤ 1, ku(r ) = ku(r ), ku(r ) = ku(r ), for k ≥ 0, 0 ≤ r ≤ 1 ku(r ) = u(r ), ku(r ) = u(r ), for k ≤ 0, 0 ≤ r ≤ 1.
Fuzzy Transform for Fuzzy Fredholm Integral Equation
235
It is worth noting that an ordinary (not fuzzy) real number c ∈ R is a particular case of the fuzzy one, i.e., c = (c, c), and the representing pair (c, c) consists of two equal and constant functions. Then, we have c(r ) = c(r ) = c. The set of all fuzzy numbers on R is denoted by E. Below, we introduce the distance D on E. Definition 2 For arbitrary fuzzy numbers u, v, the distance D(u, v) is given by D(u, v) = sup max{|u(r ) − v(r )|, |u(r ) − v(r )|.} 0≤r ≤1
Definition 3 Let f : [a, b] → E. For every partition P = {x0 , ..., xn } of [a, b] where a = x0 ≤ x1 ≤ . . . ≤ xn = b, and for arbitrary ξi ∈ [xi−1 , xi ], 1 ≤ i ≤ n, let RP =
n
f (ξi )(xi − xi−1 ),
i=1
Δ P := max{|xi − xi−1 |, i = 1, ...n}. Assume that there exists a fuzzy number I with the following property: for arbitrary > 0, there exists γ > 0, such that for any partition P with Δ P < γ , we have D(I, R P ) < . We say that I is a definite integral of f over [a, b] and denote it as I =
b
f (t)dt.
a
If fuzzy-number-valued function f is continuous in the metric D, its definite integral exists and b
a
b
f (t)dt =
b
f (t, r )dt,
a
f (t, r )dt ,
a
where ( f (t, r ), f (t, r )) is the parametric form of f (t). Further, we will work with fuzzy-number-valued functions and call them fnvfunctions.
1.2 Fuzzy Fredholm Integral Equations We consider the fuzzy Fredholm integral equation (FFIE) of the second kind y(t) = f (t) + λ
b
k(t, s)y(s)ds, a
(1)
236
I. Perfilieva and P. T. M. Tam
where we assume: λ = 1, a = 0, b = T > 0, real-valued kernel k(t, s) is given on the domain D = [0, T ] × [0, T ] and fnv-function f is given on [0, T ] with values in E. The goal is to find an fnv-function y : [0, T ] → E. Based on [11], fuzzy Fredholm equation (1) can be rewritten as follows: y(t, r ) = f (t, r ) +
T
k+ (s, t)y(t, r ) − k− (s, t)y(t, r ) dt,
T
k+ (s, t)y(t, r ) − k− (s, t)y(t, r ) dt,
0
y(t, r ) = f (t, r ) +
(2)
0
where k+ (s, t) =
k(s, t), if k(s, t) ≥ 0, 0, otherwise,
and k− (s, t) =
−k(s, t), if k(s, t) ≤ 0, 0 otherwise.
The system (2) can be now represented in terms of real-valued vector functions and written as T k(t, s)y(s, r )ds, (3) y(t, r ) = f (t, r ) + 0
where y(t, r ) = [y(t, r ), y(t, r )]T and f (t, r ) = [ f (t, r ), f (t, r )]T k(t, s) =
k+ (t, s) −k− (t, s) . −k− (t, s) k+ (t, s)
1.3 F-Transform of Functions of Two Variables In this section, we use the notion of the direct and inverse F-transforms introduced in [8], for two-variable functions. Definition 4 (Fuzzy partition) Let n > 2, and a = x0 = x1 < . . . < xn = xn+1 = b, be fixed nodes within [a, b] ⊆ R. Fuzzy sets A1 , . . . , An : [a, b] → [0, 1], identified with their membership functions defined on [a, b], establish an h-uniform fuzzy partition of [a, b], if for each k = 1, . . . , n, the following conditions are fulfilled: 1. h = (b − a)/n, and xk = xk−1 + h, 2. Ak (x) = 0 if x ∈ [a, xk−1 ] ∪ [xk+1 , b],
Fuzzy Transform for Fuzzy Fredholm Integral Equation
237
3. Ak (x) is continuous, 4. Ak (x) > 0 if x ∈ (xk−1 , xk+1 ), 5. for all k = 1, ..., n − 1, Ak (x) strictly increases on [xk−1 , xk ] and strictly decreases on [xk , xk+1 ], 6. for all k = 1, ..., n − 1, and for all x ∈ [0, h], Ak (xk − x) = Ak (xk + x), 7. for all k = 1, ..., n − 1 and for all x ∈ [xk−1 , xk+1 ], Ak (x) = Ak−1 (x − h), 8. A0 (x) strictly decreases on [x0 , x1 ] and An (x) strictly increases on [xn−1 , xn ]. Remark: It is easy to see that if a fuzzy partition A0 , ..., An , of [a, b] is h-uniform, then there exists an even function A : [−1, 1] → [0, 1] such that for all k = 1, ..., n − 1, x − xk , x ∈ [xk−1 , xk+1 ]. Ak (x) = A h and A0 (x) = A
x − x0 , h
x ∈ [x0 , x1 ]
An (x) = A
and
x − xn , h
x ∈ [xn−1 , xn ].
We call A a generating function of a uniform fuzzy partition. Definition 5 (F-transform) Let f ∈ C[a, b], and x0 , ..., xn be h-equidistant in [a, b], such that a = x0 , b = xn . Let A0 (x), A 1 (x), ..., An (x) be an h-uniform partition of [a, b]. Then (see [8]), the expression nk=0 ck Ak (x), where the coefficients are the F-transform components of f , such that c0 = cn = ck =
2 2
x1 x0
xn xn
xk+1 xk−1
f (x)A0 (x)d x h f (x)An (x)d x h f (x)Ak (x)d x
, ,
h
for all 1 ≤ k ≤ n − 1,
is known as the inverse F-transform of f , and the corresponding to it function approximates f , so that n f (x) ≈ ck Ak (x). (4) k=0
In (see [8]), we estimated the quality of approximation of functions f ∈ C[a, b] and showed that at the nodes of an h-uniform partition it is of order O(h 2 ). Below we repeat [8] to get the extension of the F-transform to functions with two variables. Let f (x, y) ∈ C([a, b] × [c, d]), n ≥ 2, and rectangle [a, b] × [c, d] ⊂ R2 , has h 1 -equidistant nodes x0 , x1 , . . . , xn , and h 2 -equidistant nodes y0 , y1 , . . . , ym , where a = x0 , xn = b and c = y0 , ym = d.
238
I. Perfilieva and P. T. M. Tam
Let A0 (x), A1 (x), . . . , An (x) be h 1 -uniform fuzzy partition with respect to x, and B0 (y), B1 (y), . . . , Bm (y) be h 2 -uniform fuzzy partition respect to y. Then the inverse F-transform fˆn,m of f approximates f , and is expressed by
fˆn,m =
m n
ckl Ak (x)Bl (y),
(5)
k=0 l=0
where
xk+1 yl+1 ckl ck0
= =
ckm = c0l
=
cnl = c00
=
c0m = cn0
=
cnm =
xk−1
2 2 2 2 4 4 4 4
yl−1
f (x, y)Ak (x)Bl (y)d yd x
xk+1 y1 xk−1
y0
xk+1 ym xk−1
ym−1
yl−1
xn yl+1 xn−1
yl−1
x1 y1 x0
y0
yn−1
xn y1 xn−1
y0
, h1h2 f (x, y)A0 (x)Bm (y)d yd x h1h2 f (x, y)An (x)B0 (y)d yd x
xn ym xn−1
h1h2 f (x, y)An (x)Bl (y)d yd x
for all 1 ≤ k ≤ n − 1, for all 1 ≤ k ≤ n − 1, for all 1 ≤ l ≤ m − 1, for all 1 ≤ l ≤ m − 1,
h1h2 f (x, y)A0 (x)B0 (y)d yd x
x1 yn x0
h1h2 f (x, y)Ak (x)Bm (y)d yd x
h1h2 f (x, y)A0 (x)Bl (y)d yd x
x1 yl+1 x0
for all 1 ≤ k ≤ n − 1, 1 ≤ l ≤ m − 1,
h1h2 f (x, y)Ak (x)B0 (y)d yd x
ym−1
,
, h1h2 f (x, y)An (x)Bm (y)d yd x h1h2
.
We denote ϕ(x) = [A0 (x), A1 (x), A2 (x), ..., An (x)]T , ψ(y) = [B0 (y), B1 (y), B2 (y), ..., Bm (y)]T ,
(6) (7)
and F = (ckl ) as a real matrix of size (n + 1) × (m + 1). Then by (5), f (x, y) ≈ ϕ T Fψ.
(8)
Fuzzy Transform for Fuzzy Fredholm Integral Equation
239
2 Function Approximation Focusing on solving the system in (2), we first get an approximate version of it, replacing all involved functions y(t, r ) = [y(t, r ), y(t, r )], f [t, r ] = [ f (t, r ), f (t, r )] and k+ (t, s), k− (t, s) by their corresponding inverse F-transforms. Since the domain in (2) is [0, T ], we choose h 1 -equidistant nodes t0 , t1 , ..., tn , where 0 = t0 , tn = T . Let A0 (t), A1 (t), ..., An (t) be an h 1 -uniform fuzzy partition [0, T ] with generating function A : [−1, 1] → [0, 1] and the corresponding (n + 1)dimensional vector ϕ as in (6). Similarly, we choose h 2 -equidistant nodes r0 , r1 , ..., rm in [0, 1], where 0 = r0 , rm = 1, and establish an h 2 -uniform fuzzy partition B0 (r ), B1 (r ), ..., Bm (r ) of [0, 1] with generating function B : [−1, 1] → [0, T ] and the corresponding (m + 1)dimensional vector ψ as in (7). Using (8), we obtain the following approximations for y, f, k+ , k− y(t, r ) ≈ [ϕ T (t)Y ψ(r ), ϕ T (t)Y ψ(r )] on [0, T ] × [0, 1], f (t, r ) ≈ [ϕ T (t)Fψ(r ), ϕ T (t)Fψ(r )] on [0, T ] × [0, 1], k+ (t, s) ≈ ϕ T (t)K 1 ϕ(s) and k− (t, s) ≈ ϕ T (t)K 2 ϕ(s) on [0, T ] × [0, T ], (9) where Y , Y , F, F are (n + 1) × (m + 1) real matrices and K 1 , K 2 are (n + 1) × (n + 1) real matrices.
2.1 Some Auxiliary Properties of ϕ and ψ Theorem 1 Let ϕ be defined as in (6). Let the (n + 1) × (n + 1) matrix P be defined by
T
P :=
ϕ(t)ϕ T (t)dt.
0
Then, the matrix components are specified as follows: pi,i = h 1 α for all 1 ≤ i ≤ n − 1, α p0,0 = pn,n = h 1 , 2 pi,i+1 (t) = h 1 β for all 0 ≤ i ≤ n − 1, pi,i−1 (t) = h 1 β for all 1 ≤ i ≤ n, pi, j = 0 the otherwise,
(10)
240
I. Perfilieva and P. T. M. Tam
1
1 where α = −1 A2 (t)dt and β = 0 A(t)A(1 − t)dt and A is a generating function. Therefore, we have ⎛
h 1 α2 ⎜h 1 β ⎜ ⎜ 0 ⎜ ⎜ P=⎜ 0 ⎜ .. ⎜ . ⎜ ⎝ 0 0
h1β h1α h1β 0 .. .
0 h1β h1α h1β .. .
0 0
0 0
0 0 h1β h1α .. .
... ... ... ... .. .
0 0 0 0 .. .
0 0 0 0 .. .
0 0 0 0 .. .
⎞
⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ 0 ... h 1 β h 1 α h 1 β ⎠ 0 ... 0 h 1 β h 1 α2
Proof By Definition 4, for all t ∈ [0, ti−1 ] ∪ [ti+1 , T ], Ai (t) = 0, and Ai (t) = i ). Therefore, A( t−t h1 pi,i = p0,0 = pn,n = pi,i+1 (t) = pi,i−1 (t) =
T 0
T 0
T 0
T 0
T 0
Ai2 (t)dt = A20 (t)dt = A2n (t)dt =
x i+1 xi−1
x 1 0
T xn
Ai2 (t)dt = h 1
A20 (t)dt = h 1 A2n (t)dt = h 1
Ai (t)Ai+1 (t)dt = Ai (t)Ai−1 (t)dt =
x i+1 xi
x i xi−1
1
1
0 0 −1
−1
A2 (t)dt for all 1 ≤ i ≤ n − 1,
α A2 (t)dt = h 1 , 2 A2 (t)dt = h 1
Ai (t)Ai+1 (t)dt = h 1
Ai (t)Ai−1 (t)dt = h 1
α , 2 1 0
1 0
A(t)A(1 − t)dt = h 1 β for all 0 ≤ i ≤ n − 1,
A(t)A(1 − t)dt = h 1 β for all 1 ≤ i ≤ n.
For k = 0, ..., n, we denote ωk : [0, T ] → R t → ωk (t) =
t − t 1 k − . h1 2
(11)
For l = 0, ..., m, we denote ζl : [0, 1] → R r → ζl (r ) =
r − r 1 l − . h2 2
(12)
Lemma 1 Assume that A0 , . . . , An is a fuzzy partition of [0, T ] and ωk is as in (11). Then tk+1 ωk (t)Ak (t)Ak+1 (t)dt = 0 for all 0 ≤ k ≤ n − 1. tk
Fuzzy Transform for Fuzzy Fredholm Integral Equation
241
Proof We have
tk+1
I =
tk
1 1 t − t 2 2 t − tk 1 t − tk 1 1 k A A − t dt = h 1 A 1− dt = h 1 − tA t + tg(t)dt, 1 h1 2 h1 h1 2 2 −2 − 21
where g(t) = A t + 21 A 21 − t . We have 1 1 A + t = g(t). g(−t) = A − t + 2 2 Then, g(t) is even function. Hence, I = 0. Lemma 2 Let ω0 (t) = ht1 − 21 , then
T
I =
ω0 (t)A20 (t)dt = 0.
0
Proof We have I =
T 0
1 1 h 1 2 1 2 x 1 1 x A A (t)dt = h 1 2 t A2 t + d x = h1 t− − dt 1 h 2 h 2 2 0 −2 0 1 1 1 0 1 1 1 1 1 1 dt + 2 t A2 t + dt = h 1 − 2 t A2 − t + dt + 2 t A2 t + dt t A2 t + = h1 1 2 2 2 2 −2 0 0 0 1 1 1 − A2 − t + = h 1 2 t A2 t + dt. 2 2 0
ω0 A20 dt =
Since A strictly decreases in [0, 1], then, A2 is decrease in [0, 1]. Hence, for all t ∈ (0, 21 ) 1 1 < A2 − t + . A2 t + 2 2 Then I < 0. Corollary 1 Under the assumption of Lemma 2
Lemma 3 Let ωn (t) =
t−T h1
0
1 2 A (t)dt < 0. t− 2
−
1 2
1
, where t ∈ [0, T ], then
I = 0
Proof We have
T
ωn A2n dt = 0.
242
I. Perfilieva and P. T. M. Tam
T
I = 0
ωn A2n dt =
T T −h 1
T 0 t − T t−T 1 2 2 1 2 t − T 1 2 − An (t)dt = − dt = h 1 t− A A (t)dt. h1 2 h1 2 h1 2 T −h 1 −1
For all t ∈ (−1, 0), we have t −
< 0. Therefore, t − 21 A2 (t) < 0. Hence,
1 2
1 2 A (t) < 0. t− 2 −1
I = h1
0
Corollary 2 Under the assumption of Lemma 3
1 2 A (t)dt < 0. t− 2 −1
Lemma 4 Let ωk (t) =
t−tk h1
0
−
1 2
, k = 1, ..., n − 1. Then
0
T
ωk A2k dt = 0.
Proof
T 0
ωk A2k dt
1 t − t 1 2 1 2 k Ak (t)dt = h 1 A (t)dt t− = − h1 2 2 −1 xk−1 0 1 1 2 1 2 A (t)dt + A (t)dt = I1 + I2 , t− t− = h1 2 2 −1 0 xk+1
0
1 1 1 2 t − A t − A2 (t)dt. (t)dt and I = h 2 1 −1 0 2 2
0 By Corollary 2, we have I1 = h 1 −1 t − 21 A2 (t)dt < 0.
1 By Corollary 1, we have I2 = h 1 0 t − 21 A2 (t)dt < 0.
where I1 = h 1
T Theorem 2 Let ω = ω0 A0 ω1 A1 ω2 A2 ... ωn An , where ωk is taken from (11).
T Then Q = 0 ωϕ T dt is a lower triangular matrix with non-zero diagonal. Proof We have ⎛ T 2 0 ω0 A0 dt ⎜ T ⎜ 0 ω1 A1 A0 dt ⎜ T ω A A dt Q=⎜ ⎜ 0 2 2 0 .. ⎜ ⎝ .
T 0 ωn An A0 dt
T 0 ω0 A0 A1 dt T ω A2 dt
T0 1 1 0 ω2 A2 A1 dt .. .
T 0 ωn An A1 dt
T ω A A dt
0T 0 0 2 0 ω1 A1 A2 dt T 2 0 ω2 A2 dt .. .
T 0 ωn An A2 dt
⎞
T ... 0 ω0 A0 An dt
T ⎟ ... 0 ω1 A1 An dt ⎟
T ⎟ ... 0 ω2 A2 An dt ⎟ . ⎟ .. .. ⎟ ⎠ . .
T 2 ... 0 ωn An dt
Fuzzy Transform for Fuzzy Fredholm Integral Equation
243
Because for all t ∈ [0, tk−1 ] ∪ [tk+1 , T ], Ak (t) = 0, then for all i, j ∈ {0, 1, 2, ..., n}, such that | j − i| ≥ 2, we have
T
T
ωi A j Ai dt =
0
ω j A j Ai dt = 0.
0
By Lemma 1, for all k = 0, ..., n − 1, we have
T
tk+1
ωk Ak Ak+1 dt =
0
ωk Ak Ak+1 dt = 0.
tk
By Lemma 2, Lemma 3 and Lemma 4, for all 0 ≤ k ≤ n, we have
T
ωk A2k (t)dt = 0.
0
Therefore, Q is a lower trianglular matrix and its determinant n+1 T
|Q| =
k=0
0
ωk A2k dt = 0.
Therefore, Q is invertible. Lemma 5 Assume that B0 , . . . , Bn is a fuzzy partition of [0, T ] and ζk is as in (12), then rl+1 ζl (r )Bl (r )Bl+1 (r )dr = 0 for all 0 ≤ l ≤ m − 1. rl
Lemma 6 Let ζ0 (r ) =
r h2
−
1 2
, then
1
I = 0
Lemma 7 Let ζm (r ) =
r −1 h2
−
1 2
ζ0 (r )B02 (r )dr = 0.
, then
1
I = 0
ζm Bm2 dr = 0.
Lemma 8 For all l = 1, ..., m − 1, let ζl (r ) = 0
1
r −rl h2
ζl Bl2 dr = 0.
− 21 , we have
244
I. Perfilieva and P. T. M. Tam
T Theorem 3 Let ζ = ζ0 B0 ζ1 B1 ζ2 B2 ... ζm+1 Bm+1 , where ζ is taken from (12).
1 Then Qˆ = 0 ψζ T dr is a upper triangular matrix and invertible. The proofs of Theorem 3 and Lemma 5, Lemma 6, Lemma 7 and lemma 8 are similar to the proofs of Theorem 2 and Lemma 1, Lemma 2, Lemma 3 and Lemma 4.
3 General Scheme of the Proposed Method Let us substitute approximations (9) into (2) and obtain ⎛ T ⎛ T ⎞ ⎛ T ⎞ ⎞⎛ T ⎞ T ϕ (t)Y ψ(r ) ϕ (t)K 1 ϕ(s) −ϕ T (t)K 2 ϕ(s) ϕ (t)Fψ(r ) ϕ (s)Y ψ(r ) ⎝ ⎝ ⎠=⎝ ⎠+ ⎠⎝ ⎠ ds. ϕ T (t)Y ψ(r )
0
ϕ T (t)Fψ(r )
−ϕ T (t)K 2 ϕ(s) ϕ T (t)K 1 ϕ(s)
ϕ T (s)Y ψ(r )
(13) For the first row of (13), we have ϕ T (t)Y ψ(r ) = ϕ T (t)Fψ(r ) +
T ϕ T (t)K 1 ϕ(s)ϕ T (s)Y ψ(r ) − ϕ T (t)K 2 ϕ(s)ϕ T (s)Y ψ(r ) ds 0
T T ϕ(s)ϕ T (s)ds Y − K 2 ϕ(s)ϕ T (s)ds Y ψ(r ), = ϕ T (t) F + K 1 0
0
and by (10), ϕ T (t)Y ψ(r ) = ϕ T (t) F + (K 1 PY − K 2 PY ) ψ(r ).
(14)
Multiplying (14) by ω(t) from the left, and by ζ T (r ) from the right, we obtain ω(t)ϕ T (t) Y ψ(r )ζ T (r ) = ω(t)ϕ T (t) F + (K 1 PY − K 2 PY ) ψ(r )ζ T (r ) .
Then, integrating with respect to t and r , we have T 0
1 T 1 ω(t)ϕ T (t)dt Y ψ(r )ζ T (r )dr = ω(t)ϕ T (t)dt F + (K 1 PY − K 2 PY ) ψ(r )ζ T (r )dr 0
0
ˆ ⇔ QY Qˆ = Q F + (K 1 PY − K 2 PY ) Q.
0
(15)
By Theorem 2 and Theorem 3, we know that Q −1 and Qˆ −1 exist. Multiplying (15) by Q −1 from the left and by Qˆ −1 from the right, we obtain Y = F + (K 1 PY − K 2 PY ).
Fuzzy Transform for Fuzzy Fredholm Integral Equation
245
Repeating the same procedure with the second row of (13), we have ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞ Y F Y F K 1 PY − K 2 PY K 1 P −K 2 P ⎝ ⎠=⎝ ⎠+⎝ ⎠=⎝ ⎠+⎝ ⎠⎝ ⎠. −K 2 P K 1 P Y F F Y −K 2 PY + K 1 PY Thus, ⎛ ⎝
I − K1 P K2 P
⎞⎛ ⎞ ⎛ ⎞ Y F ⎠⎝ ⎠ = ⎝ ⎠. I − K1 P Y F K2 P
(16)
In the last section, we will prove that (16) is solvable.
4 Existence of a Unique Solution Recall that the original fuzzy Fredholm equation (1) was reduced to the system (2) over real vector functions, and then, after approximation by the inverse F-transforms of all involved functions, it was, secondly, reduced to the linear system (16) in terms of the F-transform components. In this section, we analyze the solvability of the final system (16) and do this in two steps. In Sect. 4.1, we formulate conditions that guarantee solvability of (2). Then, we discuss the solvability of the system (16).
4.1 Existence of a Unique Solution to Eq. (1) Let g ∈ C[0, T ] satisfy the Dini-Lipschitz condition, i.e., γ (δ, g) log(δ) → 0, provided that δ → 0,
(17)
where γ (δ, g) is the modulus of continuity of g with respect to δ. In the previous section, we observed that the solution of the fuzzy Fredholm integral equation (1) satisfies the system (3) of (non-fuzzy) Fredholm integral equations of the second kind, i.e. (I − R)y(t, r ) = f (t, r ),
(18)
246
I. Perfilieva and P. T. M. Tam
where I : (C DL [0, T ])2 → (C DL [0, T ])2 is the identity operator and R : (C DL [0, T ])2 → (C DL [0, T ])2 , is a linear operator, defined by R(y(t, r )) :=
T
k(t, s)y(s, r )ds.
(19)
0
The vector space (C DL [0, T ])2 stands for the space of Dini-Lipschitz continuous vector functions where (C DL [0, T ])2 = {g = [g1 , g2 ] | g1 , g2 ∈ C[0, T ], g1 , g2 satisfies (17)} with
g = max{ g1 ∞ , g2 ∞ } where g = [g1 , g2 ] ∈ (C DL [0, T ])2 . By [11], we know that if R is bounded and
R < 1, then (I − R)−1 exists and is bounded. This assumption guarantees the existence of a unique solution to (18) and therefore, to (3), respectively to (1).
4.2 Existence of a Fuzzy Approximate Solution Below, we analyze the solvability of the system (16). We denote ⎛
T
Rn (y(t, r )) := 0
⎝
ϕ T (t)K 1 ϕ(s) −ϕ T (t)K 2 ϕ(s) −ϕ (t)K 2 ϕ(s) ϕ (t)K 1 ϕ(s) T
T
⎞⎛ ⎠⎝
y(s, r )
⎞ ⎠ ds,
y(s, r )
where K 1 , K 2 are (n + 1) × (n + 1) matrices of F-transform components of the corresponding kernels k+ , k− , see (9). Operator Rn is an approximate version of (19). As a first step, we examine the following approximate form of Eq. (18): (I − Rn )y(t, r ) = ϕ T Fψ,
(20)
where F = [F, F], and by (9), ϕ T Fψ = [ϕ T (t)Fψ(r ), ϕ T (t)Fψ(r )] ≈ f (t, r ). Let us recall the following general theorem. Theorem 4 ([11]) Let R : X → X be a bounded linear operator in a Banach space X and let I − R be injective. Assume Rn is a sequence of bounded operators with
R − Rn → 0,
(21)
Fuzzy Transform for Fuzzy Fredholm Integral Equation
247
as n → ∞. Then for all sufficiently large n > n 0 , the inverse operators (I − Rn )−1 exists and is bounded in accordance with
(I − Rn )−1 ≤
(I − R)−1 . 1 − (I − R)−1 (R − Rn )
(22)
Let us apply Theorem 4 to our particular case and show that for every sufficiently large n > n 0 , there exists an approximate solution to the Eq. (20), determined by Rn . This means to prove that operators Rn in (20) fulfill assumption (21). Lemma 9 Let k ∈ C([0, T ]2 ) and f ∈ (C DL ([0, T ] × [0, 1]))2 . Denote M1,n =
sup (s,t)∈[0,T ]2
|ϕ T (t)K 1 ϕ(s) − k+ (s, t)|,
and M2,n =
sup (s,t)∈[0,T ]2
|ϕ T (t)K 2 ϕ(s) − k− (s, t)|.
We claim that M1,n → 0 as n → ∞, M2,n → 0 as n → ∞. Proof We remark that both M1,n and M2,n are defined correctly because k+ and k− are in C([0, T ]2 ). Then by Theorem 4.4 in [10], we know that for all > 0 there exists n > 0, and a fuzzy partition A0 , . . . , An of [0, T ], such that ∀(s, t) ∈ [0, T ]
2
|ϕ T (t)K 1 ϕ(s) − k+ (s, t)| < , |ϕ T (t)K 2 ϕ(s) − k− (s, t)| < .
Therefore, M1,n = M2,n =
sup (s,t)∈[0,T ]2
sup (s,t)∈[0,T ]2
|ϕ T (t)K 1 ϕ(s) − k+ (s, t)| < , |ϕ T (t)K 2 ϕ(s) − k− (s, t)| < .
We choose n such that γ ( nT , k+ ) < 2 , where γ is the modulus of continuity of function k+ . Then by Theorem 4.5 in [10], we have that for all n ≥ n , and for all (s, t) ∈ [0, T ]2 , T T |ϕ T (t)K 1 ϕ(s) − k+ (s, t)| ≤ 2γ ( , k+ ) ≤ 2γ ( , k+ ) < , n n
248
I. Perfilieva and P. T. M. Tam
so that M1,n < . Similarly, we have M2,n < . To conclude, M1,n and M2,n converge to 0 as n → ∞. Using Lemma 9, we can easily prove the assumption (21) of Theorem 4.
R − Rn ∞ =
sup
y ∞ ≤1
(R − Rn )y
⎛ T ⎞⎛ ⎞ T y(s, r ) ϕ (t)K 1 ϕ(s) − k+ −ϕ T (t)K 2 ϕ(s) + k− ⎝ ⎠ ⎝ ⎠ ds = 0 ∞ y(s, r ) −ϕ T (t)K 2 ϕ(s) + k− ϕ T (t)K 1 ϕ(s) − k+
(23)
Consider the first row of (23), we have
T
0
(ϕ T (t)K 1 ϕ(s) − k+ )y(s, r)ds −
T 0
(ϕ T (t)K 2 ϕ(s) + k− )y(s, t)ds
∞
≤ M1,n T y ∞ + M2,n T y ∞ ≤ (M1,n + M2,n )T y ∞
Similarly with the second row of (23), we have
0
T
(−ϕ T (t)K 2 ϕ(s) + k− )y(s, r)ds +
0
T
(ϕ T (t)K 1 ϕ(s) − k+ )y(s, t)ds
∞
≤ (M1,n + M2,n )T y ∞ .
Finally,
R − Rn ∞ ≤ (M1,n + M2,n )T y ∞ → 0 as n → ∞, and the assumption (21) is confirmed. We continue the analysis of the solution to (16), using the same reasoning as in [11]. Theorem 5 Let k ∈ C([0, T ]2 ), f ∈ (C DL ([0, T ] × [0, 1]))2 and R ∞ < 1. Then, the solution Yn of the system (16) exists and it approximates the solution of (1) for a sufficiently large n > n 0 .
5 Conclusion In this paper, we proposed a new numerical method based on the theory of fuzzy (F-) transforms for solving the fuzzy Fredholm integral equation of the second kind. We analyzed the existence of both exact and approximate fuzzy solutions. Our further research will focus on the estimation of the rate of convergence of approximate solutions based on F-transforms of higher degrees. Acknowledgment This work was partially supported by the project AI-Met4AI, CZ.02.1.01/0.0/ 0.0/17-049/0008414. The authors thank the anonymous reviewers for helpful comments, corrections, and improvements.
Fuzzy Transform for Fuzzy Fredholm Integral Equation
249
References 1. Agarwal, R.P., Baleanu, D., Nieto, J.J., Torres, D.F., Zhou, Y.: A survey on fuzzy fractional differential and optimal control nonlocal evolution equations. J. Comput. Appl. Math. 339, 3–29 (2018) 2. Alijani, Z., Kangro, U.: Collocation method for fuzzy Volterra integral equations of the second kind. Math. Model. Anal. 25, 146–166 (2020) 3. Bica, A.M., Ziari, S.: Iterative numerical method for solving fuzzy Volterra linear integral equations in two dimensions. Soft Comput. 21(5), 1097–1108 (2017) 4. Dubois, D., Prade, H.: Operations on fuzzy numbers. Int. J. Syst. Sci. 9, 613–626 (1978) 5. Haussecker, H., Tizhoosh, H.R.: Fuzzy image processing. In: Jähne, B., Haussecker, H., Geissler, P. (eds.) Handbook of Computer Vision and Applications, vol. 2, Academic Press (1999) 6. Hurtik, P., Madrid, N., Dyba, M.: Sensitivity analysis for image represented by fuzzy function. Soft Comput. 23, 1795–1807 (2019) 7. Khastan, A., Alijani, Z., Perfilieva, I.: Fuzzy transform to approximate solution of two-point boundary value problems. Math. Methods Appl. Sci. 40, 6147–6154 (2017) 8. Perfilieva, I.: Fuzzy transforms. Fuzzy Sets Syst. 157, 993–1023 (2006) 9. Perfilieva, I., Dankova, M., Bede, B.: Towards a higher degree F-transform. Fuzzy Sets Syst. 180, 3–19 (2011) 10. Novak, V., Perfilieva, I., Dvorak, A.: Insight into Fuzzy Modeling, 272 p. Wiley (2016) 11. Shiri, B., Perfilieva, I., Alijani, Z.: Classical approximation for fuzzy Fredholm integral equation. Fuzzy Sets Syst. 404, 159–177 (2021) 12. Tomasiello, S., Macias-Diaz, J.E., Khastan, A., Alijani, Z.: New sinusoidal basis functions and a neural network approach to solve nonlinear VolterraFredholm integral equations. Neural Comput. Appl. 31, 4865–4878 (2019) 13. Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning. Inf. Sci. 8, 199–249 (1975) 14. Zakeri, K.A., Ziari, S., Araghi, M.A.F., Perfilieva, I.: Efficient numerical solution to a bivariate nonlinear fuzzy fredholm integral equation. IEEE Trans. Fuzzy Syst. 29(2) , 442–454 (2021)
Constructing an Intelligent Navigation System for Autonomous Mobile Robot Based on Deep Reinforcement Learning Nguyen Thi Thanh Van, Ngo Manh Tien, Nguyen Manh Cuong, Ha Thi Kim Duyen, and Nguyen Duc Duy
Abstract Motion planning plays an essential role in motion control for autonomous mobile robots (ARMs). When the information about the operating environment and robot’s position obtained from a simultaneous localization and mapping (SLAM) system, a navigation system guarantees that the robot can autonomously and safely move to the desired position in the virtual environments and simultaneously avoid any collisions. This paper presents an intelligent navigation system in unknown 2D environments based on deep reinforcement learning (DRL). Our work was constructed base on the Robot Operating System (ROS). The proposed method’s efficiency and accuracy are shown in Gazebo’s simulation results and the physical robot’s actual results.
1 Introduction In intelligent robot systems, the autonomous robot is an area that attracts significant attention from the scientific community because of its essential role in daily life and automatic lines in industrial factories. Moreover, robots can automatically move flexibly within a particular area and perform predetermined tasks replacing human factors with several economic benefits, labor productivity, and even better human health. For AGVs (Automated Guided Vehicles) widely used in factories or N. T. T. Van (B) Faculty of Electrical and Electronic Engineering, Phenikaa University, Hanoi, Vietnam e-mail: [email protected] N. M. Tien · N. M. Cuong Institute of Physics, Vietnam Academy of Science and Technology, Hanoi, Vietnam e-mail: [email protected] H. T. K. Duyen · N. D. Duy Hanoi University of Industry, Hanoi, Vietnam e-mail: [email protected] N. D. Duy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_22
251
252
N. T. T. Van et al.
warehouses, the robot can automatically move to the position and avoid obstacles while moving is gradually being studied and considered an alternative to the traditional navigating method using magnetic roads. Moreover, as it has to operate in an environment that continually changes due to human activity, the robot must move flexibly and adapt to the operating environment. AGVs have a significant drawback that requires an available path to follow, such as a marked line, a magnetic stripe network glued to the floor, or wall-mounted mirror lights. Therefore, when the production process or the product is changed, the navigation systems for AGV must be changed. This problem leads to downtime costs and incurred costs. When AGVs meet the obstacles, they will stop until obstacles are removed. Unlike AGVs, the AMRs can avoid stationary or moving obstacles and auto-routing when needed. The AMRs path changes automatically without human intervention. This feature helps in making operations more flexible and reducing the total cost of ownership. The AMRs can navigate safely without using magnetic stripe or wall-mounted beacons. Firstly, AMRs create a base map using suitable sensors, which can detect its surroundings continuously. As processes change, the AMR can easily change and create new routes or be reprogrammed easily for new tasks. Typically, mobile robots’ intelligent navigation system consists of two main parts: a simultaneous localization and mapping (SLAM) system and a motion planning system. The complete autonomous mobile robot systems are designed based on the robot programming operating system called Robot Operating System (ROS), as shown in [1–3]. ROS is optimal for designing and programming consistent robotic systems, especially combining, calibrating, and transmitting data from sensors with a central microcontroller circuit. In [1], a whole autonomous robot control system was designed on ROS. Besides, an AMR robot system used in the warehouse was also constructed on the ROS platform [2]. Furthermore, practical AMR systems with fully built hardware and software connections using ROS have been learned [3]. With algorithms in the field of robotics probability, an intelligent navigation system is designed based on data from the environment map and robot location obtained from the SLAM algorithms, as shown in [4–7]. The SLAM system relies on odometry and data from vision sensors such as lidar to locate the robot and map the environment. SLAM problem stems from mobile robots’ navigation through unknown areas, with no prior information about that environment. After conducting SLAM, environmental information will be provided to the robot through the data as a grid map of the environment that provided motion planning. Therefore, the robot can operate in unknown environments by simultaneously performing the SLAM system with a motion planning system, as shown in [7]. However, combining these two systems has to consider the computational burden, especially in spacious environments. SLAM requires enormous storage variables for mapped data. Furthermore, when placing the robot in a new operating environment, the SLAM system needs to start from the beginning. Designing intelligent navigation systems based on deep reinforcement learning networks is researched to improve robot performance instead of simultaneously using
Constructing an Intelligent Navigation System …
253
SLAM and motion planning in unknown environments. The deep reinforcement learning network is for the four-wheeled omnidirectional mobile robot with discrete output values [8]. However, with these discrete values, the velocity values will be constant and fixed during the movement process. Thus, in many cases, the robot cannot perform smooth and flexible movements, especially in dynamic environments. Additionally, the deep reinforcement learning network based on navigation and motion planning systems have been designed for nonholonomic autonomous mobile robots in [9–12]. The results have shown the effectiveness of this method compared to previous approaches. Especially in [11], a deep reinforcement learning network has been constructed and verified if the robot does not know the environment in advance and tests its physical model. From the above analysis, the paper proposes an intelligent navigation system for AMR robots in an unknown flat environment based on deep reinforcement learning with main contributions: (1) The output of the deep reinforcement learning network considers the dynamic constraints of AMR robot and calculate the continuous values for the reference trajectory for the robot’s kinematic model; (2) The input of the deep reinforcement learning network includes the combination of lidar signals and 3D camera so that the robot can avoid obstacles of different sizes; (3) AMR robot does not need to know in advance information about the operating environment, and the robot can move automatically with the trained network model in an unknown 2D flat environment. There are five main parts to this paper. After the introduction, the robot kinematic model was presented in Sect. 2. Next, the intelligent navigation system based on deep learning was described in Sect. 3. The evaluation of the proposed method’s effectiveness, simulation and experiment results based on ROS are shown in Sect. 4. Finally, Sect. 5 serves as the conclusion and future work.
2 Kinematic Model of AMR Robot The nonholonomic AMR robot was constructed with two active steering wheels placed at the center, four load-bearing wheels at the corners around the robot to increase load capacity, and balance the robot during the process work. Robot movements were divided into two types of motion: translational motion and rotational motion. The robot kinematic model is to calculate the robot odometry and update the robot’s current position. The intelligent navigation system can calculate the reference trajectory for the robot by the change of state variables. The kinematic model is built on the robot model, as shown in Fig. 1. First, a state vector describing the robot position in the coordinate system is denoted by T q = x y θ , where x and y are the robot’s coordinates along Ox and Oy axis, θ is T the robot’s rotational angle. Besides, a vector v = vx ω contains velocity values in the dynamic coordinate attached to the robot’s center, as shown in Fig. 1.
254
N. T. T. Van et al.
Fig. 1 Model of an AMR robot
Consider the relative geometry in Fig. 1, an equation describing the relationship between the robot position in global coordinate and the robot velocity is the kinematic equation of AMR robot ⎡
⎤ sin θ 0 q˙ = ⎣ cos θ 0 ⎦v = H.v, 0 1
(1)
Where H is a transition matrix. Next, vr and vl are defined as the right and left wheel speeds, respectively; ωr and ωl are the angular velocities. The robot speed is calculated according to the speed of the wheels by the expression ⎧ ωr + ωl vr + vl ⎪ =R ⎨ vx = 2 2 ⎪ ⎩ ω = vr − vl = R (ωr − ωl ) d d
(2)
Where d is the distance between the two rudders.
3 Design a Intelligent Navigation System Based on Deep Reinforcement Learning 3.1 Input Normalization The network inputs are chosen as signals from the lidar sensor, 3D camera, target position, and an output value at the previous time.
Constructing an Intelligent Navigation System …
255
Fig. 2 Laser scan signal from lidar sensor combined with the signal from the 3D camera
First, the lidar sensor generates laser scan values measuring the distance between the robot and environmental obstacles, which are sampled into 12 values and the scanning angle, as shown in Fig. 2. The signal is defined by T z t = z 1 z 2 ... z 12
(3)
The lidar values are distance values in the 2D plane at a specific elevation. However, during the movement, the obstacles below or above the range of the lidar will not be considered. Thus, signals from the 3D camera are used to return depth values taking into account obstructions of different sizes and heights. Measured values from the camera are normalized to laser scan signal form and sampled into four values with ◦ each angle equal to 15 , defined by T ct = c1 c2 c3 c4
(4)
An additional camera sensor is used to reduce the network input vector’s size, and the camera’s laser scan values will be combined with the corresponding lidar position values, as shown in Fig. 2. T The robot’s target position is defined by Pt = xdt ydt , and this value will be converted to the polar coordinate system when taken into account in the network. The T velocity value set for the robot at the time (t − 1) is vt−1 = vxt−1 ωt−1 . Finally, we have a network input vector of the form T St = Y Pt vt−1
(5)
Where Y is the vector value formed by combining laser scan signal between lidar and camera and normalized value in the range (0, 1).
256
N. T. T. Van et al.
3.2 Construct a Network Structure In this section, the neural network structure is constructed for the deep reinforcement learning based on the actor-critic model in [12], with the network’s output expanded for the computation of continuous outputs. From (5), St is also defined as the state vector for the deep reinforcement learning network, with the agent being the AMR T robot performing the actions vt = vxt ωt , which is also the output of the network. As shown in Fig. 3, every time a robot performs a deliberate action from the network, the environmental states are changed, and the robot will receive the corresponding reward point. The network will be trained to maximize the reward values. The proposed network structure is shown in Fig. 4, which consists of two networks: Actor and Critic networks.
Fig. 3 Principle of the deep reinforcement learning network
Fig. 4 Structure of the deep reinforcement learning network
Constructing an Intelligent Navigation System …
257
Actor-network performs the computation of the reference velocity values, which are also the actions the robot performs to interact with the environment and change the states at the time St . The network structure contains five fully connected layers, and then these values are fed into the first fully connected layer with the Sigmoid function to calculate the desired value for the translational velocity vx . It is also fed into the second fully connected layer with the Tanh function to calculate the angular velocity’s desired value ω. The robot’s translational velocity is dynamically constrained to within (0, 1) (m/s) by the Sigmoid function with the limit that the robot’s backward motion is not taken into account. Besides, the angular velocity is constrained by the Tanh function in the range (−1, 1) (rad/s). Therefore, velocity values are not only calculated as continuous values, but the kinematic constraints in terms of velocity limit have also been considered in the construction of the deep reinforcement learning structure. Additionally, the Critic network plays as an evaluation network based on the value returned by the Actor-network and the corresponding state values for that output value. The Critic network contains six fully connected layers and in which the computed actions from the Actor-network are put into the third layer. The last fully connected layer is the layer with a linear function to compute the Q-value. Through computational layers, the Q-value is used to evaluate the performance of the deep reinforcement learning network. The reward function used for the Critic network is proposed as follows ⎧ rn i f t < τd , t < τ ⎪ ⎪ ⎨ rlinear i f t < τd r (St , vt ) = ⎪ r i f min(Yt ) < τcol ⎪ ⎩ col λ(t − t−1 )
(6)
where rn is the positive reward point. When the distance between the robot’s current position and the target position t is less than the threshold value t, and the difference between the robot’s current rotation angle from the target position t is less than the threshold τd . rlinear is the positive reward point value when the distance from the robot’s current position to the target position is less than the threshold value. In contrast, rcol is a negative reward value when the robot is likely to collide with an environmental obstacle when one of the values is extracted from the observed value from the sensor Yt . In the last case, the reward point value will be calculated by λ(t − t−1 ) with λ as the parameter. This calculating will encourage the robot to tend to move closer towards the target position.
4 Simulation The intelligent navigation system with a unified robot system is built and executed on the ROS operating system both in the simulation environment and in the actual AMR robot model.
258
N. T. T. Van et al.
4.1 Simulation Result Robot simulation model and virtual environment for robot training are built on Gazebo software under the ROS operating system. The robot is built with a Rplidar sensor placed on top with a 3D camera in front. The blue area in Fig. 5 represents the laser scan signals from the sensors. The learning rate for the critic and actor-network is set to 0.0001, where the hyper-parameter for the selected reward function is a small positive value. With the two environments, the critic network’s reward value is shown in Fig. 6, with the red line being the reward point in the 1-st environment and the blue line being the reward in the 2-nd environment. Figure 7 shows how the reward value changes. In the beginning, the robot proceeds to randomly select the actions, so the robot goes far from the destination and collides with obstacles, and the bonus points earned will be negative. After about 500 epochs, the reward point value starts to tend to be positive since the neural network is now trained so that the robot can perform the task. After
Fig. 5 ARM model with Rplidar and Astra camera based on ROS (left) and simulation environment used to train neural networks and verify results (right)
Fig. 6 Environment 1 (left) and environment 2 (right)
Constructing an Intelligent Navigation System …
259
Fig. 7 Reward value
Fig. 8 Visualization Interface
thousands of epochs, the states of the medium are gradually determined. Therefore, the reward saw an uptrend. Once the robot’s knowledge reaches a certain threshold, the robot is capable of moving smoothly to the desired target, and that leads to an increase in the positive value of the points as well as verifying the effectiveness of the method.
260
N. T. T. Van et al.
4.2 Experiment Result The robot’s motion tracking software interface is built on the ROS, as shown in Fig. 8, with the image extracted from the camera in the lower-left corner. The tested robot moves with randomly arranged obstacles in the test environment. The results show that the robot can move to any position within the test environment range while avoiding obstacles while moving. Moreover, the intelligent navigation system for AMR robots can work effectively in a real environment where the robot has not previously known the information about the map or environmental obstacles.
5 Conclusion This paper has proposed a navigation system for AMR robots based on an enhanced deep learning method. The sensors’ data, including lidar and 3D cameras, are normalized and used as input to the neural network. A neural network consists of two subnets built and trained on the ROS operating system. With the combination of sensors for environmental recognition and trained deep learning networks, the robot can move to the target position in a real environment where the robot has no information before.
References 1. Köseoˇglu, M., Çelik, O.M., Pekta¸s, Ö.: Design of an autonomous mobile robot based on ROS. In: IDAP 2017 - International Artificial Intelligence and Data Processing Symposium (2017). https://doi.org/10.1109/IDAP.2017.8090199 2. Zhang, H., Watanabe, K.: ROS based framework for autonomous driving of AGVs. conf.ejikei.org. Published online 2019 3. Beinschob, P., Reinke, C.: Graph SLAM based mapping for AGV localization in large-scale warehouses. In: 2015 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 245–248 (2015). https://doi.org/10.1109/ICCP.2015.7312637 4. Chen, Y., Wu, Y., Xing, H.: A complete solution for AGV SLAM integrated with navigation in modern warehouse environment. In: 2017 Chinese Automation Congress (CAC), pp. 6418– 6423 (2017). https://doi.org/10.1109/CAC.2017.8243934 5. Schueftan, D.S., Colorado, M.J., Mondragon Bernal, I.F.: Indoor mapping using SLAM for applications in Flexible Manufacturing Systems. In: 2015 IEEE 2nd Colombian Conference on Automatic Control (CCAC), pp. 1–6 (2015). https://doi.org/10.1109/CCAC.2015.7345226 6. Quang, H.D., Manh, T.N., Manh, C.N., et al.: Mapping and navigation with four-wheeled omnidirectional mobile robot based on robot operating system. In: 2019 International Conference on Mechatronics, Robotics and Systems Engineering (MoRSE), pp. 54–59 (2019). https://doi. org/10.1109/MoRSE48060.2019.8998714 7. Thanh, V.N.T., Manh, T.N., Manh, C.N., et al.: Autonomous navigation for omnidirectional robot based on deep reinforcement learning. Int. J. Mech. Eng. Robot. Res. 9(8), 1134–1139 (2020). https://doi.org/10.18178/ijmerr.9.8.1134-1139
Constructing an Intelligent Navigation System …
261
8. Tai, L., Paolo, G., Liu, M.: Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 31–36 (2017). https://doi.org/10.1109/IROS.2017. 8202134 9. Ota, K., Sasaki, Y, Jha DK, Yoshiyasu Y, Kanezaki A. Efficient Exploration in Constrained Environments with Goal-Oriented Reference Path. Published online 2020. http://arxiv.org/abs/ 2003.01641 10. Lillicrap, T.P., Hunt, J.J., Pritzel, A., et al.: Continuous control with deep reinforcement learning. In: 4th International Conference Learning Representations, ICLR 2016 - Conference Track Proceedings (2016) 11. Walenta, R., Schellekens, T., Ferrein, A., Schiffer, S.: A decentralised system approach for controlling AGVs with ROS. In: 2017 IEEE AFRICON: Science, Technology and Innovation for Africa, AFRICON 2017 (2017). https://doi.org/10.1109/AFRCON.2017.8095693 12. Niroui, F., Zhang, K., Kashino, Z., Nejat, G.: Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments. IEEE Robot. Autom. Lett. 4(2), 610–617 (2019). https://doi.org/10.1109/LRA.2019.2891991 13. Kanezaki, A., Nitta, J., Sasaki, Y.: GOSELO: goal-directed obstacle and self-location map for robot navigation using reactive neural networks. IEEE Robot. Autom. Lett. 3(2), 696–703 (2018). https://doi.org/10.1109/LRA.2017.2783400
One-Class Support Vector Machine and LDA Topic Model Integration—Evidence for AI Patents Anton Thielmann, Christoph Weisser, and Astrid Krenz
Abstract The present contribution suggests a two-step classification rule for unsupervised document classification, using one-class Support Vector Machines and Latent Dirichlet Allocation Topic Modeling. The integration of both algorithms allows the usage of labelled, but independent training data, not stemming from the data set to be classified. The manual labelling when trying to classify a specific class from an unlabelled data set can thus be circumvented. By choosing appropriate document representations and parameters in the one-class Support Vector Machine, the differences between the independent training class and the data set to be classified become negligible. The method is applied to a large data set on patents for the European Union.
1 Introduction A common problem in document classification is the inscrutable amount of often unlabelled data. Unsupervised document classification algorithms (see for example [1, 14, 27]) are often only used in order to get a broad overview of the topics in A. Thielmann (B) Centre for Statistics, Georg-August-Universität Göttingen, Humboldtallee 3, 37073 Göttingen, Germany e-mail: [email protected] C. Weisser Centre for Statistics and Campus-Institut Data Science (CIDAS), Georg-August-Universität Göttingen, Humboldtallee 3, 37073 Göttingen, Germany e-mail: [email protected] A. Krenz Digital Futures at Work (Digit) Research Centre, University of Sussex, Jubilee Building, Brighton BN1 9SN, UK e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_23
263
264
A. Thielmann et al.
these large data sets. A more thorough classification of documents often involves manually creating a labelled training data set. The present paper circumvents the manual labelling by making use of one-class Support Vector Machines (SVM) [24] and Latent Dirichlet Allocation (LDA) topic modeling [1] using a large data set on patent applications from the European Patent Office (EPO). Patent classification is an important task in assessing technological progress as well as defining innovative capability [5]. The international patent corpus is an enormous source which already contains more than 100 million patents. The patent data comes with a set of industrial classification codes, the so-called Cooperative Patent Classification (CPC). For the official patent data, the classification is mostly done manually, and there exists by now a set of approximately 250,000 predefined classification entries [8]. A problem arises for new patent applications from entirely new technology areas: there are simply no CPC codes available, and as such these patents cannot easily be assigned to a classification code. This applies to patents in the area of artificial intelligence (AI), for example. The classification problem gets more severe since the number of patents in the field of AI is increasing [31]. The primary goal of this paper is to circumvent some of the problems that come along with unsupervised document classification and to classify patents that cover the topic of AI. This is mainly achieved by taking advantage of the similarity between scientific papers and patent documents. The applied methods are arranged in an innovative manner to yield an order that allows topic models to contribute to classification problems beyond a mere descriptive function, resulting in a two-step classification rule. The approach can easily be transferred to other unsupervised document classification problems.
2 Related Literature This paper especially draws from the literature on natural language processing (NLP) and patent classification. The approaches to patent classification are manifold and range from simple keyword search [37] and subsequent classification to the application of neural networks [9, 15]. However, most of these classification approaches either make use of an already labelled data set [9, 15, 30] or have experts scan the data set and label it manually [34]. When it comes to unsupervised patent classification, the literature is more focused on either keyword approaches or topic modeling [28]. Other methods used in unsupervised document classification as k-means or sequential clustering [3, 27] are mostly used in hierarchical classification algorithms [6, 12] where unsupervised methods are combined with SVM or k-Nearest Neighbour (kNN) algorithms to obtain accurate classification results. The present classification thus applies two methods, the use of one-class SVM on a web-scraped data set and LDA topic modeling. To the best of our knowledge, one-class document classification has so far not been used in patent classification. This could be due to the complexity of the data, the large number of subclasses (250,000 CPC codes, and 70,000 International Patent
One-Class Support Vector Machine and LDA Topic Model Integration ...
265
Classification (IPC) codes) and the possibility of multiple labelling. However, similar to the present approach, Fujino and Isozaki [10] made use of the similarity between scientific papers and patent abstracts, labelling scientific papers with IPC codes. The successful application of one-class SVM in document classification has already been shown by Manevitz and Yousef [17] and is used in a very similar way in the present approach. The present approach makes use of these two successful methods and combines them with LDA topic modeling [1].
3 Data The data set at hand is the so-called global PATSTAT [21] data set from the EPO, which contains bibliographical and legal event patent data from leading industrialized and developing countries. It is comprised of more than 100 million patents which are extracted from the EPO’s various databases. In order to get a more suitably sized data set, the complete data set was filtered in the following way: Only patents of type ‘A’ (the pure patents) from 1980—today are included. Excluded are all patents for which no address of a patent holder is known, or which do not originate from an EU country,1 or which are not written in English. This guarantees that we are working with data for the European Union. From the roughly remaining 540,000 patents all abstracts are taken into account, preprocessed and analyzed. The preprocessing of the text data follows the common text preprocessing in NLP [32, 35]. All words are put in lowercase letters and tokenized. All numbers and symbols are removed and stopword removal [26] is applied using spaCys built-in dictionary [11] extended by patent specific words as “method” or “patent”. The remaining words are lemmatized, resulting in patent abstracts of 54 words on average.
4 Method In this section, the one-class SVM [24] for document classification using term frequency inverse document frequency (tf-idf) [23] vectors as input and the LDA topic model [1] are described. We describe how the training data by using abstracts of scientific papers is generated. The general idea is to create a classification rule that classifies a majority of all positive documents as positive, while false positive classifications are negligible. Subsequently, the falsely positive classified documents are identified using LDA topic models. Before going into detail for each step, the 1 Considering
28 EU countries, i.e. Austria, Belgium, Bulgaria, Czech Republic, Croatia, Cyprus, Denmark, Estonia, Finland, France, Germany, Great Britain, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden.
266
A. Thielmann et al.
Fig. 1 Prediction procedure
algorithm as seen in Fig. 1 is summarized: First the training data is generated by simple web-scraping, which means that the relevant data is extracted from a defined website [19]. The similarity between scientific paper abstracts and patent abstracts is used, and abstracts of scientific papers covering the relevant topic are scraped. Second, a one-class SVM is trained on the single class training data, which consists of the scraped scientific papers. Thirdly, the model predictions on the original data set are obtained. Subsequently, the so-classified patents are analyzed with the help of LDA topic models. The relevant topics are identified with the help of visual representations [4, 25]. Depending on the perceived quality of the one-class SVM classification and the LDA topic models, the patents related to the relevant topic will be identified.
4.1 One-Class Support Vector Machines The obvious advantage of one-class classification algorithms is the fact that only one class needs to be synthesized. Schölkopf et al. [24] introduced such an algorithm which extends the classical SVM [2, 33] algorithm and only incorporates a single class in training. Considering x1 , ...x ∈ X as training data belonging to a single class and ∈ N being the number of observations, the classical SVM optimization
One-Class Support Vector Machine and LDA Topic Model Integration ...
267
problem is extended only slightly, to obtain a decision function f (x) which is positive ¯ on S (with S as a subset of the feature space H) and negative on the complement S, so that +1 i f x ∈ S f (x) = (1) −1 i f x ∈ S¯ Informally, the probability that a test point from the data sets’ probability distribution lies outside of S is bounded by some a priori specified value ν ∈ (0, 1] [17, 24]. More formally, the optimization problem becomes: min
w∈H,ξ ∈R+ ,ρ∈R
1 1 ||w||2 + ξi − ρ 2 ν i=1
(2)
subject to: (w · (xi )) ≥ ρ − ξi i = 1, 2, ...,
ξi ≥ 0
where is a kernel map X → H, such that can be computed by evaluating a simple kernel function, w is the normal vector to the hyperplane H, ρ is an offset parameterizing the hyperplane in the feature space and ξi are nonzero slack variables. The tradeoff between the decision function f (x) = sgn((w · (x)) − ρ)
(3)
being positive for most of the training data x1 , ...x ∈ X and ||w|| being small, is controlled by ν, which is firstly an upper bound on the fraction of outliers [24] and secondly a lower bound of the fraction of support vectors in relation to the total number of training data. In the present case, as we are using training data not originating from the original data set, finding the optimal ν is of crucial importance. The idea is to set ν very low and thus “create” a larger subset S of the feature space H in order to avoid overfitting problems and to obtain a classification rule that is applicable to a more diverse data set, thus integrating a training class that does not stem from the data set to be classified. Using the one-class SVM for document classification, the documents need to be represented in a more suitable way. Similar to Manevitz and Yousef [17] we used more than one document representation, namely a binary representation and a tf-idf representation [23], with tf-idf(wor d) = f r equency(wor d) · [log
k + 1] K (wor d)
(4)
where k is the total number of words in the dictionary and K(word) is giving the total number of documents the word appears in.
268
A. Thielmann et al.
4.2 Topic Modeling To ensure the integration of the unrelated training data and thus justify the first step of the classification, we make use of LDA topic models [1]. LDA topic modeling is an unsupervised machine learning technique that detects word and phrase patterns, defined as topics, in a set of documents. The general idea is that topics are characterized by a distribution over words, no matter the positional occurrence of these words. Thus, as defined by Blei et al. [1], documents are represented by a random mixture over these latent topics. Basically, we look for the joint posterior probability of a distribution of topics for each document, N topics for each document and a distribution of words for each topic, given the corpus of all documents. Formally it can be stated as: p(θ , z, w|α, β) (5) with the corpus D = {w1 , ..., w M } consisting of M documents, while a document w is denoted as a sequence of words w = (w1 , ..., w N ), θ representing the probability of the i-th document to contain the j-th topic, α representing the distribution-related parameters of a Dirichlet distribution and β representing the corpus-wide word probability matrix. Thus, the joint probability is equal to: p(θ , z, w|α, β) = p(θ |α)
N
p(z n |θ) p(wn |z n , β)
(6)
n=1
with z n , n = 1, ..., N being the document topic variables associated with the corresponding words. In order to get a general understanding of the data and the topics that the data set covers, topic models could have already been applied to the original data set right from the beginning. However, by applying the topic models only to those documents classified as positive by the classifier, which is what our approach in this paper does, the resulting topics are much easier to interpret.
4.3 Training Class Since the patent data is not labelled in such a way that it is possible to create a training class that adequately represents AI, a different, more complex approach is required. To generate a valid data set as training data without requiring experts to scan and read a large part of the 540,000 patent abstracts, the similarity between scientific papers and patents is used, in a similar manner as done by Fujino and Isozaki [10]. Both document types have in common that a short summary of the full text, the abstract, is given at the beginning of the document. Scientific papers, however, often come with so called “keywords”, which thus implicitly assign a topic to the paper. As such, labelled training data was scraped from IEEE Xplore [13]. 1008 papers which had
One-Class Support Vector Machine and LDA Topic Model Integration ...
269
Fig. 2 Wordcloud LDA topic model—web-scraping for scientific AI papers
the keyword “artificial intelligence” were scraped as the training class. One should keep in mind that other relevant keywords such as “machine learning”, e.g., could also be web-scraped, and one could try out different websites to scrape the training data from, which is something that could be done in future analyses. For our paper, however, we focused on the scraping of the main AI keyword. In order to confirm our expectation that the web-scraping of scientific AI papers was successful, LDA topic models were performed on the corpus of the scientific AI papers. We kept the models simple and fixed the number of topics to 10, not using bigram or trigram models [36], but the simpler unigram LDA model [1]. To evaluate and interpret the model, we looked at LDAvis plots [25]. The results are very satisfying, with some of the most salient terms being “artificial”, “network”, “intelligence” or “algorithm”. This can be further confirmed when looking at a wordcloud corresponding to one randomly selected topic generated by the LDA topic model (see Fig. 2).
5 Results Similar to Manevitz and Yousef [17] the one-class SVM yielded the best results when using the radial basis function (rbf) as kernel function. However, unlike Manevitz and Yousef [17] the best results in terms of document representation are achieved with the tf-idf representation, using Scikit-learns [22] built-in Vectorizer applying the L2 normalization, accounting for different document lengths. The one-class SVM, using the rbf as kernel function and tf-idf document representations, classified 1272 patents as “AI-patents”. Regarding the number of about 540,000 original patents, this amounts to about 0.24% of all patents that are classified as AI patents. We set ν very low at 0.005 as we found that the web-scraping resulted in a very accurate training class which does not include a lot of outliers. Setting ν at 0.5, for example, yielded as expected significantly worse classification results. Furthermore, the main objective in the first classification level is to avoid relevant patents being classified as non-AI. The results of the topic modeling, again using simple unigram models with a fixed number of 10 topics, reveal that the patents
270
A. Thielmann et al.
Fig. 3 Wordclouds for topics for the classified AI patents
Fig. 4 Wordclouds of two sub-fields of AI
share a large resemblance with the scraped scientific AI papers, but do not include words such as “artificial” or “intelligence” (see Fig. 3). Using the LDAvis representation [25] to conduct further analyses on AI subcategories and heterogeneity, the topic models revealed two topics that clearly describe sub-fields of AI, namely NLP and neural networks. They include the prominent terms “neural” and “network” as well as “language” and “process” (see Fig. 4). Depending on the identified topics, as well as the underlying topics’ distribution, the final classification rule regarding the sub-set and heterogeneity analyses is dependent on the researcher/ the research question. Choosing a very conservative approach and only selecting the patents, where the most dominant topic is one of the two defined sub-fields of AI, we were able to identify 126 patents that make use of neural networks and 80 patents that make use of NLP. If, for example, we select those patents, that have a prevalence of at least 15% for the “neural network” topic, we find 207 patents covering the topic of neural networks.
6 Conclusion The implementation of a one-class SVM and LDA topic modeling on an independent data set on European patents allows very good classification results. By this methodological approach the problems of unsupervised document classification, the identification and classification of sparsely represented classes, can be circumvented without the need for manual labelling. The presented method is especially useful when a reliable extraction of a negative training class is not possible. Further implementations could be to use the classification results from the LDA topic model to
One-Class Support Vector Machine and LDA Topic Model Integration ...
271
train a different, more powerful classifier with the original data, as Naive Bayes [18, 29], Random Forest [38], Artificial Neural Networks [20] or classical SVMs [7, 16] including a negative class during training. The idea of the current approach would shift towards generating a suitable training data set and subsequently using classification algorithms to classify the documents.
References 1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993– 1022 (2003) 2. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992) 3. Buchta, C., Kober, M., Feinerer, I., Hornik, K.: Spherical k-means clustering. J. Stat. Softw. 50(10), 1–22 (2012) 4. Chaney, A.J.B., Blei, D.M.: Visualizing topic models. In: Sixth International AAAI Conference on Weblogs and Social Media (2012) 5. Chen, Y., Yang, Z., Shu, F., Hu, Z., Meyer, M., Bhattacharya, S.: A patent-based evaluation of technological innovation capability in eight economic regions in PR China. World Patent Inf. 31(2), 104–110 (2009) 6. Chen, Y.L., Chang, Y.C.: A three-phase method for patent classification. Inf. Process. Manage. 48(6), 1017–1030 (2012) 7. Colas, F., Brazdil, P.: Comparison of SVM and some older classification algorithms in text classification tasks. In: IFIP International Conference on Artificial Intelligence in Theory and Practice, pp. 169–178. Springer (2006) 8. EPO (2020). https://www.epo.org/index.html 9. Fall, C.J., Törcsvári, A., Benzineb, K., Karetka, G.: Automated categorization in the international patent classification. In: ACM SIGIR Forum, vol. 37, pp. 10–25. ACM, New York (2003) 10. Fujino, A., Isozaki, H.: Multi-label classification using logistic regression models for ntcir-7 patent mining task. In: NTCIR. Citeseer (2008) 11. Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear) 12. Hu, J., Li, S., Yao, Y., Yu, L., Yang, G., Hu, J.: Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2), 104 (2018) 13. IEEE Xplore: Digital Library (2020). https://ieeexplore.ieee.org/Xplore/home.jsp 14. Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics (2000) 15. Li, S., Hu, J., Cui, Y., Hu, J.: DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics 117(2), 721–744 (2018) 16. Liang, J.Z.: SVM multi-classifier and web document classification. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 04EX826), vol. 3, pp. 1347–1351. IEEE (2004) 17. Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. J. Mach. Learn. Res. 2, 139–154 (2001) 18. McCallum, A., Nigam, K., et al.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998) 19. Mitchell, R.: Web Scraping with Python: Collecting more Data from the Modern Web. O’Reilly Media, Inc, Newton (2018)
272
A. Thielmann et al.
20. Moraes, R., Valiati, J.F., Neto, W.P.G.: Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst. Appl. 40(2), 621–633 (2013) 21. Patstat Data Catalog (EPO) 2019 Autumn Edition. EPO World-wide Patent Statistical Database 22. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825– 2830 (2011) 23. Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24(5), 513–523 (1988) 24. Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001) 25. Sievert, C., Shirley, K.: LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63–70 (2014) 26. Silva, C., Ribeiro, B.: The importance of stop word removal on recall values in text categorization. In: Proceedings of the International Joint Conference on Neural Networks, vol. 3, pp. 1661–1666. IEEE (2003) 27. Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136 (2002) 28. Suominen, A., Toivanen, H., Seppänen, M.: Firms’ knowledge profiles: mapping patent data with unsupervised learning. Technol. Forecast. Soc. Chang. 115, 131–142 (2017) 29. Ting, S., Ip, W., Tsang, A.H.: Is Naive Bayes a good classifier for document classification. Int. J. Softw. Eng. Appl. 5(3), 37–46 (2011) 30. Tran, T., Kavuluru, R.: Supervised approaches to assign cooperative patent classification (CPC) codes to patents. In: Ghosh, A., Pal, R., Prasath, R. (eds.) Mining Intelligence and Knowledge Exploration, pp. 22–34. Springer, Cham (2017) 31. Tseng, C.Y., Ting, P.H.: Patent analysis for technology development of artificial intelligence: a country-level comparative study. Innovation 15(4), 463–475 (2013) 32. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014) 33. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995) 34. Venugopalan, S., Rai, V.: Topic-based classification and pattern identification in patents. Technol. Forecast. Soc. Chang. 94, 236–250 (2015) 35. Vijayarani, S., Ilamathi, M.J., Nithya, M.: Preprocessing techniques for text mining - an overview. Int. J. Comput. Sci. Commun. Netw. 5(1), 7–16 (2015) 36. Wang, S.I., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 90–94 (2012) 37. WIPO: Technology Trends 2019: Artificial Intelligence. World Intellectual Property Organization, Geneva (2019) 38. Xu, B., Guo, X., Ye, Y., Cheng, J.: An improved random forest classifier for text categorization. J. Comput. 7(12), 2913–2920 (2012)
HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data Tat-Huy Tran, Tuan-Dung Cao, and Thi-Thu-Huyen Tran
Abstract Fast technology development in recent years has led to a rapid increase in data generation. Most of the data are raw and need to be classified and processed into clean, usable data. However, classifying a huge volume of data is not an easy task, and it requires massive human efforts to label the data. To assist humans and shorten this labelling process time, data clustering is one of the most used methods in the first step of data cleaning. Among popular data clustering algorithms, Hierarchical Clustering is one of the best that can help to separate noise to recognize data patterns. In this paper, we conducted intensive experiments to verify the effectiveness of Hierarchical Clustering on multiple datasets using Apache Spark. We evaluated Hierarchical Clustering by comparing its performance in different sizes. Their volumes are provided in the experiment section.
1 Introduction Over the past decade, data has grown on a large scale across different industries. Along with that, the era of data mining and artificial intelligence has opened up, more and more applications based on artificial intelligence are developed and the demand for labeled data is growing. It has been shown that simple models trained with huge amount of data is better than sophisticated models trained with small data. This has led to the occurrence of data labeling techniques, which is to use clustering as an unsupervised method [15]. Clustering algorithms have long been powerful tools in analyzing and exploring patterns of similarity from dataset. Density-based clustering is one of clustering methods which divides the dataset into high density regions and low density regions. The data points which are not in high density regions are marked as noise. Among T.-H. Tran (B) · T.-D. Cao (B) Hanoi University of Science and Technology, Hanoi, Vietnam e-mail: [email protected] T.-T.-H. Tran Hung Yen University of Technology and Education, Hung Yen, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_24
273
274
T.-H. Tran et al.
clustering algorithms, hierarchical density-based clustering is chosen because it does not require a user-defined threshold, and provides a description of intrinsic data structure. However, for large data, the disadvantage is that the demand for processing resources makes it difficult to obtain results, probably impossible on a personal computer as it is known that the time complexity of hierarchical clustering is quadratic lower bound in the number of data points [16]. To overcome the time complexity of hierarchical density-based clustering, we implement a parallel version of it. Apache Spark is a set of cluster computing tools based on (random access memory) (RAM) and optimized query execution that increase processing speed and can be computed through distributed systems. However, the programming model required to split data across the nodes of the cluster, formed a resilient distributed dataset (RDD) which is an immutable collection of elements that can be operated in parallel. Hierarchical clustering needs to make pairwise comparison between data points, so it can be a burden to run in parallel effectively. In this paper, we implement and evaluate the parallel version of single-linkage HDBSCAN over Spark to handle large and diverse volumes of data. The experiments show the feasibility of the algorithm on big datasets and which factor needs to consider when running it. Several performance metrics are collected: 1, Total Execution Time; 2, The Merge Factor; 3, The Speedup. The total execution time is to consider the total run time of the whole program with different in the number of cores. Spark is greatly scalable as the number of cores increases. The merge factor indicates the number of the spanning tree merged in each iteration. The speedup is the metrics used to illustrate the effect of the algorithm when the size of cluster scales up and the number of data increases. The rest of paper is organized as follows. In Sect. 2, we briefly introduce related works. Section 3 presents theories and concepts related to the approach and configuration. Section 4 covers the experiments and evaluation results from different metrics. Section 5 gives a brief conclusion and our work on the future.
2 Related Works According to Rokach [18], there are 5 types of clustering algorithms: Hierarchical, Partitioning, Density-based, Model-based, Grid-based, and Soft-computing. Most of these algorithms have disadvantages in dealing with randomness and heterogeneity of unexplored data: partitioning method and soft-computing required a pre-defined the number of clusters; the model-based can only detect known groups of data; the grid-based use a finite number of partitions to do clustering. Only density-based and hierarchical-based are suitable methods. Density-based methods are more dynamic, they assume the data is a mixture of several distributions, so they can discover clusters of random data. Meanwhile, Hierarchical methods are also good at represent the data characteristics in the form of a tree, then divided into clusters by cutting the tree at the desired level. However, these algorithms do not design to run on a massive volume
HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data
275
of data within limited resources such as a personal computer, or its parallel version is not available. MapReduce is a programming model that performs the processing of large datasets. The idea of this model is to divide the problem into smaller parts to process separately, processing those small parts parallel and independently on distributed computers, synthesized the obtained results to get the final results. Apache Spark [2] is an open-source cluster computing framework that allows building predictive models quickly with calculations performed on a group of computers and it can compute at the same time on the entire data without having to extract test calculation samples. Spark’s high processing speed is thanks to the computation being performed on many different machines concurrently. At the same time, calculations are done in-memories or done entirely in RAM. The methods apply MapReduce mainly aim to parallelize the clustering algorithm on small datasets, and then aggregate them for the final results. For instance, Shen Wang and Haimonti Dutta developed PARABLE [3] (or PArallel RAndompartition Based HierarchicaL ClustEring Algorithm), a two-step algorithm. First, it creates local clusters at server nodes that are used by the map and reduce steps. Then, it rearranges the dendrogram together to aggregate the results. Specifically, it randomly splits the large data set into several smaller partitions on the mapper. Each partition is distributed to a reducer on which the sequential hierarchical clustering algorithm is executed. After this local clustering step, the dendrograms obtained at each reducer is aligned together. Other parallel and distributed hierarchical clustering algorithms have been examined in some studies, such as [5–8] were aimed at the SingleLINKage (SLINK) clustering algorithm [11], which can be seen as a special case of HDBSCAN [9, 10]. SHRINK (SHaRed-memory SLINK) [4] is a memory sharing algorithm for a tree clustering algorithm that uses a single-linkage method to combine the bottom-up cluster, for overlapping subproblems. The main parallelization strategy for SHRINK is to divide the original dataset into overlapping subsets, calculate the hierarchical dendrogram for each subset using the state-of-the-art SHC algorithm SLINK [11], and reconstruct the dendrogram for the full dataset by combining the solutions of the subsets. MapReduce implementations of partitioning density-based clustering algorithms have been proposed in the study, particularly for DBSCAN [12–14]. However, DBSCAN haved some limitations, such as the choice of the density threshold is both difficult and critical; the algorithm may not be able to distinguish between clusters of very different densities, and a flat clustering solution cannot describe hierarchical relationships that may exist between nested clusters at different density levels [15]. Fortunately, HDBSCAN does not have these limitations. In this paper, we will present and evaluating the performance of hierarchical clustering for big data.
276
T.-H. Tran et al.
3 Efficient Spark HDBSCAN In this section, we present the parallel HDBSCAN algorithms using Spark.
3.1 Choose the Configuration Along with the selection of the appropriate distance measures for the data points, the selection of the association criteria between the clusters also affects the final result. The association criteria between sets of observed data points are used to construct a way to determine the distance between the set of points, and then to build secondary trees for data clusters. There are association criteria such as single-linkage, complete-linkage, average-linkage. Among which, single-linkage is to determine the shortest distance between two points where each point belongs to a different cluster, complete-linkage is to take the farthest distance between data points of different clusters, average-linkage is the average distance between points belonging to different clusters. Figure 1 illustrates the association criteria. Each association criterion has its own strengths and weaknesses. The problem with single-linkage is that some clusters can be lumped together if there is a pair of points between two closely spaced data clusters, however most of the remaining points of this cluster are far from the points of the other cluster if compared to another cluster nearby. The complete-linkage also encounters the same issue but in the opposite direction. Only average-linkage is less affected by these effects. However, in this research topic, we still use single-linkage to build secondary clustering trees because the number of data points is much larger than the number of clusters. Moreover, the single-linkage’s algorithm complexity is more optimal compared to other types. Besides, single-linkage is close to real-world decisions, when we choose to combine data points that are closely spaced into a cluster. Many previous studies also selected single-linkage as the criteria for the secondary clustering algorithm.
Fig. 1 Link criteria
HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data
277
3.2 Separation Problem To effectively implement the problem, we divided the original problem into smaller problem sets, solve the single subproblems, and then combined their solutions into complete solutions. That is also the idea of parallel computation. According to one study, the graph computation of a secondary clustering tree is equivalent to finding the smallest spanning tree of a weighted full graph, with vertices being data points and edges being the distance between two points. Thus, we brought the secondary clustering problem to the problem of finding the smallest spanning trees. With the initial data set D, we will divide it into a subset D1 and D2 , so the graph of set D is divided into 3 subgraphs, G(D1 ), G(D2 ), and G B (D1 , D2 ). In which, G B (D1 , D2 ) is a two-sided graph (bipartite graph) on set D1 and D2 . In this way, any edge in the original graph will be in one of the 3 subgraphs, or other words, pooling 3 subgraphs will get the original graph. Therefore, we can split the original graph into many parts, doing many subproblems. Specifically, if we divide the initial graph into s subgraphs and Cs2 two-sided subgraphs for a pair of subgraphs. For each sub-graph, we apply the spanning tree algorithm on them and then combine these sub-trees to get the solution of the original graph (Fig. 2).
3.3 The Parallel HDBSCAN Algorithms Based on the above idea, we will re-present the division and value algorithm for the secondary clustering problem as follows: After dividing the original problem into many smaller subproblems, we applied algorithms to find the smallest spanning tree for them. For weighted graphs, three commonly used techniques are Boruvka, Kruskal, and Prim. Boruvka’s algorithm identifies the least-weighted edges to each vertex, then forms a collapsed graph with the number of vertices halved. Therefore, the algorithm will take O(ElogV ) where E is the number of edges and V is the number of vertices. The Kruskal algorithm will instantiate a forest with each vertex being an individual tree, and in turn, select the least weighted edges without creating a cycle from the unused cultivation to merge the two trees at the same time until all vertices belong to a single tree. Both of these algorithms require weights of all edges to be weighted so that the edge with the least weight can be selected through each iteration. On the contrary, the Prim algorithm starts at an arbitrary vertex as the vertex of the smallest spanning tree and then grows the edges partially until all vertices are gone. At each iteration, only local information of one vertex is required for execution. Besides, with a full weighted graph, the Prim algorithm takes only O(V 2 ) time and O(V ) space in terms of algorithm complexity and is the best choice for finding algorithms. smallest spanning tree. When analyzing the problem, we have two types of graphs to deal with: the weighted full graph and the full two-sided graph. With the first type of graph, we start the algorithm at vertex v0 in the list of vertices. While calculating the distance
278
T.-H. Tran et al.
Fig. 2 Parallel cluster clustering algorithm using Apache Spark
from v0 to other vertices, we save the information about the least significant edge and send the corresponding edges to the reducer (in this way, we do not need to store all the small spanning tree information. Best). v0 is then removed from the list of vertices and the other endpoints of the sent edge will be selected as vertices for the next consideration. This process is repeated until all vertices are added to the tree. Therefore, our algorithm keeps squared time complexity and linear spatial complexity. With the second type of graph, two-sided graph, we will show the left set and the right set. Unlike the original case, we need to store edge weights for each side. Initially, we select vertex v0 in the left set, we also store information about the least weighted edge (v0 , vt ) in the right edge set. In the next step, we continue with vertex vt , but we do not go over the edge containing v0 . Thus, the endpoint of the edge with the least weight will be the vertex to be considered in the next iteration. The process is repeated until all vertices are reached. Therefore, the algorithm will have a complexity of O(mn) in time and O(m + n) in space, where m and n are the sizes of two separators.
HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data
279
Next, we combine all the smallest spanning trees and the minimum spanning tree T in the previous calculation step to obtain the final result. In extreme cases, all subspanning trees and T can be lumped together for use in the same process. Therefore, we extend the tree aggregation procedure into multiple iterations by using a merge factor K such that only K sub-spanning trees are included at each iteration step. To efficiently aggregate each part of the smallest spanning trees, we use the union-find data structure to keep track of which components each vertex belongs to. Recalling how we create subgraphs, most contiguous graphs share half the data points. Therefore, for every K successive subgraph, there will be a large number of vertices that share. Therefore, by aggregating the K smallest spanning trees, we can detect and eliminate incorrect edges at the first stage, and minimize the cost of the algorithm.
4 Experimental Evaluation 4.1 Experimental Setup The Hortonwork Data Platform (HDP) is popularly used by businesses to manage, store, process, and analyze big data. In our experiments, we run the algorithm on a server cluster with HDP 2.7 installed alongside with Apache Spark 2.3.0. The server cluster consists of 12 64-bit machines, with their detailed configuration is shown in Table 1.
4.2 Datasets To evaluate the performance of HDBSCAN algorithm with Spark implementation, we conducted a series of experiments on different sizes of datasets with multiple dimensions. We collected three real datasets (HT Sensor, Poker, Skin) available in the UCI data warehouse [1] from different domains. Besides, we created three Gaussian mixtures datasets (Rand500k, Rand1m1, Rand1m5) by using MixSim R package [19]. The detail of given datasets is presented in the Table 2.
Table 1 Cluster setup Architecture Chip Memory Number of physical cores
64-bit Intel Xeon series 1536 GB 288
280
T.-H. Tran et al.
Table 2 Properties of the data set Data set Number of objects HT sensor Poker Skin Rand500k Rand1m1 Rand1m5
919438 1025010 245057 500000 1100000 1500000
Number of attributes
Number of clusters
11 11 4 20 30 40
3 10 2 10 20 30
4.3 Runtime Evaluation This evaluation aims to find out which factor affected the total execution time, not only the complexity of algorithms. In Fig. 3, the Skin dataset is 5 times faster than HT Sensor dataset. Considering the total execution time, the factor that causes the result is the number of data points of the Skin dataset is 4 times lower than the HT Sensor dataset. Meanwhile, the factor that makes the HT Sensor dataset is nearly 2 times faster than the Poker is different. Although the Poker dataset and the HT Sensor dataset has nearly the same the number of data points and the same number of attributes, the number of cluster in the dataset in the Poker dataset is larger than the HT Sensor dataset. It can show that the number of cluster in the dataset has a huge impact to the total runtime. At the merging phase of subgraphs, the high number of cluster makes it do more work to do sorting and merging even the number of data points in each stage of 2 dataset is nearly the same.
Fig. 3 The total runtime of different datasets followed by the merge factor K when ran on 32 cores
HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data
281
Fig. 4 The speedup score on datasets using 32–128 cores
4.4 The Merge Factor K As we can see in the Fig. 4, the higher volume of data run with larger K will get better parallelism, whereas the smaller one is good with small K.
4.5 The Speedup Score To cope with the massive volume of data, the clustering algorithm should be scalable as the size of the cluster increases. In order to evaluate this characteristic, we use speedup score which is introduced by Chen Jin [17]. The speed up score is p t Speedup = 0t pp0 , where p0 is the lowest number of cores in the test, and t p is the execution time on p cores. Figure 5 demonstrates the results on 2 datasets HT Sensor and Skin with different K factors. It shows that the size of dataset affects the speedup score and the higher K implies less number of reducers and smaller degree of parallelism.
282
T.-H. Tran et al.
Fig. 5 The speedup score followed by merge factor K
5 Conclusions and Future Works Through this study, we conclude that when using distributed computing with the support of the Apache Spark open-source library framework, execution speed can be significantly improved. Large dataset can be better scaled by adding more computer cores. In the future, we will try to improve performance of the parallel version of HDBSCAN by minimizing repetitive tasks, optimizing the costs of moving data between servers during computation, and exploring better approaches to increase the processing speed as well as the quality of the cluster.
References 1. Bache, K., Lichman, M.: UCI machine learning repository (2013). https://archive.ics.uci.edu/ ml/datasets.php 2. https://spark.apache.org/docs/latest/ 3. Wang, S., Dutta, H.: PARABLE: a PArallel RAndom-partition based HierarchicaL ClustEring algorithm for the MapReduce framework. Comput. J. 16, 30–34 (2011). https://doi.org/10. 1093/comjnl/16.1.30 4. Hendrix, W., Ali Patwary, M.M., Agrawal, A., Liao, W., Choudhary, A.: Parallel hierarchical clustering on shared memory platforms. In: 2012 19th International Conference on High Performance Computing, Pune, pp. 1–9 (2012). https://doi.org/10.1109/HiPC.2012.6507511 5. Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 1313–1325 (1995) 6. Tsai, H.R., Horng, S.J., Lee, S.S., Tsai, S.S., Kao, T.W.: Parallel hierarchical clustering algorithms on processor arrays with a reconfigurable bus system. Pattern Recogn. 30(5), 801–815 (1997) 7. Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. J. Parallel Distrib. Comput. 60(9), 1137–1153 (2000)
HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data
283
8. Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. 16(6), 497–502 (2005). http://doi.ieeecomputersociety.org/10.1109/TPDS.2005.72 9. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 160–172. Springer, Heidelberg (2013) 10. Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 5:1–5:51 (2015). https://doi.org/10.1145/2733381 11. Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16, 30–34 (1973). https://doi.org/10.1093/comjnl/16.1.30 12. Gulati, R.: Efficient parallel DBSCAN algorithms for bigdata using MapReduce. Master’s thesis, Computer Science and Engineering Department, Thapar University (2016) 13. He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: MR-DBSCAN: an efficient parallel density based clustering algorithm using MapReduce. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pp. 473–480 (2011). https://doi.org/10.1109/ ICPADS.2011.83 14. Kim, Y., Shim, K., Kim, M.S., Lee, J.S.: DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf. Syst. 42, 15–35 (2014). https://doi.org/10. 1016/j.is.2013.11.002 15. Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data - AI integration perspective. IEEE Trans. Knowl. Data Eng. https://doi.org/10.1109/TKDE. 2019.2946162 16. Santos, J., Syed, T., Naldi, M.C., Campello, R.J.G.B., Sander, J.: Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data. https://doi.org/10.1109/TBDATA.2019. 2907624 17. Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.: A scalable hierarchical clustering algorithm using spark. In: 2015 IEEE First International Conference on Big Data Computing Service and Applications, Redwood City, CA, pp. 418–426 (2015). https://doi.org/ 10.1109/BigDataService.2015.67 18. Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Boston (2005). https://doi.org/10.1007/0-38725465-X_15 19. Melnykov, V., Chen, W.C., Maitra, R.: MixSim: an R package for simulating data to study performance of clustering algorithms. J. Stat. Softw. 51(12), 1–25 (2012). http://www.jstatsoft. org/v51/i12/
Applying Deep Reinforcement Learning in Automated Stock Trading Hieu Trung Nguyen and Ngoc Hoang Luong
Abstract The continuously changing nature of stock markets poses a non-trivial challenge to build automated trading agents that timely assist traders in their decision makings. Fixed sets of trading rules and offline pre-trained models are inefficient to adapt regarding the real-time fluctuations on the stock markets. Deep Reinforcement Learning (DRL) algorithms train autonomous agents that could potentially address such highly dynamic environments by integrating the generalization power of artificial neural networks with the online learning through experiences gained in the interactions with the environment. We investigate the capability of three DRL algorithms, namely Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC), for tackling the automated stock trading problem. We employ two datasets, from the US and Vietnam markets, with different characteristics and market trends, on which the DRL-trained agents are compared against a random trading agent. Experimental results indicate the potentials of these DRL algorithms and also exhibit their pitfalls when being applied in datasets where the majority of the stocks are not up-trending. We propose a simple, but effective, technique to assist the agents to minimize their losses.
1 Introduction Statistical models and computer algorithms have been widely employed by companies and individuals to assist their decision makings on the stock trading markets. A large part of nowadays transactions are automatically performed by means of algorithmic trading, i.e., pre-programmed trading strategies of buying, selling, or holding shares of different stocks with respect to the current market state. Quantitative trading, H. T. Nguyen (B) · N. H. Luong University of Information Technology, Ho Chi Minh City, Vietnam e-mail: [email protected] N. H. Luong e-mail: [email protected] Vietnam National University, Ho Chi Minh City, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_25
285
286
H. T. Nguyen and N. H. Luong
a type of trading strategies relies on mathematical and statistical model to identify and execute opportunities, dominates the market of foreign trading contributing a total of 80% [3]. Data-centric applications in quantitative trading can be categorized into three cultures [2, 4]: data modeling culture, machine learning culture and algorithmic decision-making culture. In data modeling culture, it is assumed that trading market is a black box comprising an uncomplicated model which generates observational data. In contrast, those following machine learning culture treat the world of finance much as a living creature which can evolve over time. Therefore, complicated models are applied, not trying to describe the underlying nature, but to approximate emerging behaviours of the markets. In algorithmic decision-making culture, the focus shifts from model-building to training automated agents to adapt to the dynamics of the world, avoiding pitfalls by negative rewarding signal and maximize accumulated returns. Reinforcement Learning (RL [12]) has long been an Artificial Intelligence paradigm to train autonomous agents to operate in complicated and dynamic environments. Various works have been conducted to investigate the potential of employing RL agents for automated stock trading. Ponomarev et al. [10] applied the Asynchronous Advantage Actor-Critic (A3C) as a learning method to address portfolio management. Xiong et al. [13] designed a stock trading strategy with the Deep Deterministic Policy Gradient (DDPG) algorithm. Yuan et al. [14] proposed a data augmentation framework which uses minute data to train and compare Soft Actor-Critic (SAC), Deep Q-Learning, and Proximal Policy Optimization. In this paper, we also experiment with the three algorithms, namely SAC, DDPG, and Twin Delayed DDPG (TD3 [7]), for training a stock trading agent. To evaluate the robustness of these algorithms, we employ two stock trading datasets with different characteristics and market trends: one collected in the US and one collected in Vietnam. The remainder of the paper is organized as follows. In Sect. 2, we describe our simulations of the stock trading market as a Markov Decision Process. In Sect. 3, we summarize the RL framework and the three RL algorithms. In Sect. 4, we describe our experimental setups and evaluate the obtained results. Finally, we conclude the paper in Sect. 5.
2 The Simulated World of Stock Markets 2.1 Stock Market as an MDP Framework The stock trading process can be (simplified and) formulated as a Markov Decision Process (MDP) [13]. The overall trading process is briefly described in Fig. 1. An MDP for the stock trading problem is characterized by the following features: • State s = [p, h, b]: is an collection where p (p ∈ Z+D ) is a set of adjusted close prices, which is close prices that have been cleansed from dividends, stock splits and new stock offering. Choosing adjusted close prices for simulation is obvious
Applying Deep Reinforcement Learning in Automated Stock Trading
287
ASSET VALUE 5
BUY ASSET VALUE 2
HOLD
ASSET VALUE 6
...
SELL
BUY(share)
ASSET VALUE 7
ASSET VALUE 1
HOLD
ASSET VALUE 3
...
SELL(share) ASSET VALUE 4
TIME
STOCK PRICE CHANGE
Fig. 1 Stock trading process [13]
because they reflect truly the market situation at that time, whereas close price is just cash number at the end of the day; h (h ∈ R+D ) is a set of current holdings of the available stocks; b (b ∈ R) is the remaining balance. Note that D is the number of stocks in the simulation. • Action a: A list of actions performed on all D stocks. The available actions for each stock include selling, buying, and holding, which result in decreasing, increasing and no change of the holdings h respectively: – Buying k shares of a stock increases the holding of that stock h t+1 = h t + k. – Selling k shares of a stock decrease the holding of that stock h t+1 = h t − k and k cannot exceed the amount of holdings of that stock. – Holding k = 0: no change in the holding h. • Reward r (s, a, s ): the change in total asset value when taken action a in state s. Total asset is defined as pT h. • Policy(s): the trading strategy of stocks at state s. It is essentially a probability distribution of a at state s. • Action-value function Q π (s, a): The expected reward achieved by performing action a at state s and then following policy π .
2.2 Agent in the Simulated World In this stock world, the agent should learn to distinguish and avoid bad decisions as much as possible. There is a constraint governing the way our agent behaves: the combination of selling and buying actions at time t should not result in a negative remaining balance. The remaining balance is updated as the equation bt+1 = bt + pT h. At the beginning, the agent is given an initial budget of b0 . All holdings and Q(s, a) values are initialized to 0, prices are set at the first trading day and policy
288
H. T. Nguyen and N. H. Luong
Fig. 2 Agent-Environment
π(s) is uniformly distributed among all actions for any state. The agent task is to keep interacting with the environment to update its Q(s, a) values and eventually learn a good policy that maximizes an optimization goal.
2.3 Optimization Goals in the Simulated World The objective of real-world traders could be formulated as maximizing the accumulated profit at the end of a trading period. Therefore, we define the accumulated change in asset values between two trading days as our optimization goal, which can T r (st , at , st+1 ). Note that more be described through the following equation: t=0 elaborated reward function designs that potentially benefit the learning process exist but are outside the scope of this paper. Modeling a highly complicated and dynamic environment like the financial trading world is certainly non-trivial. In this paper, for the sake of tractability, our simulations are carried out at a macro level that omits broker transaction fees and real-time data. We argue that micro-second stock historical data are only meaningful to day traders and would incur larger computational burdens while yielding little change to our analysis.
3 A Deep Reinforcement Learning Trading Method 3.1 A Brief Summary of Reinforcement Learning Reinforcement Learning (RL) studies how to computationally model “learning from trial-and-error”. RL tasks are often formalized as a Markov Decision Process (MDP), or loosely speaking, an Agent - Environment Interface [12]. In the MDP framework, an agent (decision maker) repeatedly takes in its state and reward from its last action, then chooses one action from the action space in response to the environment, changing its state and rewards. The process is described in Fig. 2.
Applying Deep Reinforcement Learning in Automated Stock Trading
289
To further define a RL task, we consider three important elements: • Policy: a function π mapping agent’s input states to agent’s actions. In some case, a policy can return either a single action for a corresponding state (deterministic policy), or a probability distribution over the action space (stochastic policy). • Reward signal: The goal of a RL task, defining how an agent behaves in the environment. The reward signal can be a real number that is low (or negative) if the action taken is bad and high (or positive) if the action taken is good. • Value function: a function mapping state to its corresponding value. Value of a state refers to the amount of expected reward (determining how good a given state is in the future) received over the agent’s lifetime if the agent follows a policy π .
3.2 Motivation for Applying Deep Reinforcement Learning Deep Reinforcement Learning (DRL) techniques combine the learning-byexperience through trial and error methodology of RL with the generalization power of deep artificial neural networks. DRL has proves to be a promising optimization method in dynamic and constantly evolving environments, such as robotics, power management, and autonomous driving. Therefore, DRL techniques can probably be a good match with the rapidly evolving and turbulent world of financial markets with complex state spaces. Given the MDP framework defined in Sect. 2, DRL techniques for continuous action control appear to be appropriate due to the large dimensionality of the action space. Xiong et al. [13] addressed the stock trading problems with the Deep Deterministic Policy Gradient (DDPG), and showed that DDPG could outperform the traditional portfolio allocation Min-Variance and Dow Jones Industrial Average methods. Fujimoto et al. [7] showed that the Actor-Critic network used in DDPG generates function approximation errors which might lead to overestimated bias and suboptimal policies. The Twin Delayed DDPG (TD3 [7]) algorithm was then proposed to address the problems. In addition, Soft Actor-Critic (SAC) [8] is an concurrent, state-of-the-art work that is also proposed to deal with RL problems that have large action spaces. SAC aims to maximize an addition term, i.e., entropy, in the objective function as a way to enable wider exploration. This paper continues to explore the potential of these Deep RL algorithms for stock trading problems. Since all three algorithms have in common an Actor-Critic network, its notion will be briefly described in the next section (Fig. 3).
3.3 Actor-Critic Network Deriving from Policy Gradient Theorem [12], Actor-Critic algorithm simultaneously learns a parameterized policy π(s) from the actor and calculate the Q-values for the corresponding state-action pair. This algorithmic framework consists of two parts: Policy Gradient (the actor) and Q-Learning (the critic). The actor, which takes in
290
H. T. Nguyen and N. H. Luong
Fig. 3 An actor-critics network
states and generates policy, can be trained through the deterministic policy gradient algorithm [11]: ∇φ J (φ) = E s∼ρ π [∇a Q π (s, a)|a=π(s) ∇φ πφ (s)]. The Q-value from QLearning part of the critic can be learned using temporal difference learning, updated based on the Bellman Equation [12]: Q π (s, a) = r + γ E s ,a [Q π (s , a )], a π(s ). The Actor-Critic framework can be applied in both discrete and continuing action space. For continuing action space, a Gaussian Policy and a differentiable function approximator Q (s, a) is applied.
3.4 Twin Delayed Deep Deterministic Policy Gradient (TD3) The TD3 algorithm [7] maintains most of DDPG framework and further enhances it by incorporating a double Q-learning Network and delayed policy update technique to exploit bias error accumulated through a faulty Q-learning process. The TD3 algorithm applied to the our stock trading problem is described in Algorithm 1.
3.5 Soft Actor-Critic (SAC) Similar to TD3 and DDPG, Soft Actor-Critic [8] is an off-policy Deep Reinforcement Learning algorithm that makes use of Actor-Critic architecture and primarily built for large action space problems. TD3 and Soft Actor-Critic are two concurrent work that improves the popular DDPG algorithm. The core idea and the main difference between Soft Actor Critic and other continuous control Deep RL algorithms is Entropy regularization. The method has been known for encouraging wider exploration while maintaining decent exploitation. SAC has been developed into different versions and the implementation used in this paper is drawn from Stable Baselines library [9]. The pseudo-code of SAC (modified from [1, 8, 13]) for automated stock trading is presented in Algorithm 2.
Applying Deep Reinforcement Learning in Automated Stock Trading
291
Algorithm 1 Twin Delayed DDPG (modified from [1, 7, 13]) Initialize actor μ with random weights θ, critic networks Q 1 , Q 2 with random weights φ1 , φ2 . Initialize target networks μ , Q 1 , Q 2 with weights θ ← θ, φ 1 ← φ1 , φ 2 ← φ2 , respectively. Initialize empty replay buffer D. for each episode (an episode involves a complete trading period) do Obtain initial state s0 . for t = 1 to Last trading day of the current episode do Observe state s and select action a: which stock and how many shares to buy/sell/hold. “clip” is used to ensure that the output action lies in given action range alow ≤ a ≤ ahigh . ε is the exploration noise. a = clip(μθ (s) + ε, alow , ahigh ), where ε ∼ N (0, σ ) Execute action a, observe next state s , reward r . Store transition (s, a, r, s ) in replay buffer: D ← D ∪ {(s, a, r, s )}. Randomly sample a batch of transitions B = {(s, a, r, s )} from replay buffer D. Compute target action, where c > 0 is the limit for the absolute value of the noise: a (s ) = clip(μθ (s ) + clip(ε, −c, c), alow , ahigh ), where ε ∼ N (0, σ ) Compute target by taking the minimum return between the two Q-value functions y(r, s ) = r + γ min Q φ i (s , a (s )) i=1,2
Update critic networks by one step gradient descent with respect to the target ∇φi
1 B
(Q φi (s, a) − y(r, s )) , for i = 1, 2 2
(s,a,r,s )∈B
if t mod policy_delay = 0 then Update the actor by one step gradient ascent ∇θ
1 Q φ1 (s, μθ (s)) B s∈B
Update parameters of target networks φ i ← ρφ i + (1 − ρ)φi , for i = 1, 2 θ ← ρθ + (1 − ρ)θ end if end for end for
292
H. T. Nguyen and N. H. Luong
Algorithm 2 Soft Actor-Critic (modified from [1, 8, 13]) Initialize actor pi with random weights θ, critic networks Q 1 , Q 2 with random weights φ1 , φ2 . Initialize target networks Q 1 , Q 2 with weights φ 1 ← φ1 , φ 2 ← φ2 , respectively. Initialize empty replay buffer D for each episode (an episode involves a complete trading period) do Obtain initial state s0 for t=1 to Last trading day of the current episode do Observe state s and select action a: which stock and how many shares to buy/sell/hold. a ∼ πθ (·|s) s,
Execute action a, observe next state reward r . Store transition (s, a, r, s ) in replay buffer: D ← D ∪ {(s, a, r, s )}. Randomly sample a batch of transitions B = {(s, a, r, s )} from replay buffer D. Compute target by taking the minimum return between two Q-value functions incorporated with SAC’s entropy regularization: y(r, s ) = r + γ min Q φi (s , a ) − α log πθ (a |s ) where a ∼ πθ (·|s ) i=1,2
Update critic networks by one step of gradient descent with respect to the target ∇φi
1 |B|
2
Q φi (s, a) − y(r, s )
for i = 1, 2
(s,a,r,s )∈B
Update the actor by one step of gradient ascent ∇θ
1 min Q φi (s, aθ (s)) − α log πθ aθ (s) s i=1,2 |B| s∈B
Update parameters of target networks φi ← ρφi + (1 − ρ)φi for i = 1, 2 end for end for
4 Experimental Setup and Evaluation Method In this work, RL agents are trained and tested on two datasets from U.S market and Vietnam market. Given an initial budget amount, the agents are then competed with a random agent and with each other in balancing profit and loss. The agents are trained, validated, and tested in a walk-forward manner.
Applying Deep Reinforcement Learning in Automated Stock Trading Fig. 4 Walk-forward Validation [13]
Train
2011
293
Validate & re-train
2013
2014
2015
Test
2016
2017
2018
2019
2020
4.1 Data Preparation and Necessary Toolkits For U.S equity market, a high-quality financial historical dataset is obtained from Kaggle,1 and 42 stocks are chosen to mix all kind of market situations such as up-trend, down-trend, high volatility, low volatility. For Vietnam market, relatively reliable data of 19 companies from VN30 and HNX30 components is collected and cleansed. Finally, a 9-year period (1/1/2011–31/12/2019) daily data is then extracted and ready to use. Stable Baselines [9] and OpenAI Gym [5] are employed in this experiment. Stable Baselines is a library containing state-of-the-art implementations of Deep RL algorithms, including DDPG, TD3 and SAC. OpenAI Gym is an interface for training and testing RL agents.
4.2 Experimental Setup and Walk-Forward Cross-validation Based on [13], we firstly build two stock market environments for U.S stocks and Vietnam stocks, by constructing two adjusted close price vectors of daily data (from 2011-01-01 to 2019-31-12) drawn from 42 U.S stocks and 19 Vietnam stocks. For training and evaluating our agents, we adopt the walk-forward validation method for time series data such as stock prices. The technique has been studied in [6], which favors Walk-forward over K-fold validation and bootstrap method. Walk-forward technique involves dividing the data into chronologically sorted parts, each of which acts as test data while the previous ones are train data. More specifically, we divide our data into eight stages: the first stage spanning a period of two years (i.e., from 2011-01-01 to 2012-31-12) and the next seven stages, each spanning a period of one year. After the initial training with all the stock at the end of the year and use that money as the initial budget for the next validation. The last stage is the holdout and the true testing stage, which our model has never been trained on before (Fig. 4).
4.3 Results We train our models each with 10,000 time steps for both initial training and for each re-training. Figure 5 shows our results for U.S Market and Vietnam Market. 1 https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs.
294
H. T. Nguyen and N. H. Luong
Fig. 5 Asset returns for simulated U.S Stock Trading
Table 1 US stock experiment results DDPG TD3 Initial budget 10000 (USD) Final asset (USD) 20317 Annualized return 8.15 (%)
SAC
Random
10000
10000
10000
20525 8.31
20102 7.97
17048 6.13
The results are averaged over 30 runs with different random seeds and compared to a random trader with the same action space (for each day, randomly trade 0–5 shares). US Stocks. Figure 5 and Table 1 show that the RL agents perform competently in the US stock environment. All three agents do not differ in trading behaviour and generate decent profits. These results are consistent to those collected by [13], where they compared DDPG with traditional trading strategies, using 30 component stocks in the Dow Jones Industrial Average. The stocks chosen to simulate our experiment resemble the Dow 30 components in the way they mostly consist of up-trend stocks. Since the agents only explore in training phases and stick to the profitable strategies they had discovered in testing phase, it is understandable that profits considerably increase in the experiments. Results obtained show that RL agents clearly outperforms random trader. Vietnam Stocks. The situation is trickier and more unstable when the Vietnamese dataset, in which the dominance of up-trend stocks no longer prevails, is applied. Because our state space only comprises stock prices and no other feature with predictive power, our RL agents trade quite naively: they keep the greedy strategy previously discovered in the training phase with little concern about the changing environment in the test phase. Thus, RL agents experience huge losses over the trading period (Fig. 6 and Table 2).
Applying Deep Reinforcement Learning in Automated Stock Trading
295
Fig. 6 Asset returns for simulated VN Stock Trading Table 2 VN stock experiment results DDPG TD3 Initial budget 200000 (thousand VND) Final asset 85795 (thousand VND) Annualized return −9.04 (%)
SAC
Random
200000
200000
200000
83310
99167
87823
−8.94
−7.49
−8.31
There are many options to address this problem, including: introducing a few technical indicators, which are used by technical traders or other exogenous factors such as market news and annual financial reports. We adopt here a simple, but effective, remedy that acts as an external aid for RL agents: implementing a stop-loss order. In our trading environment, we define stop-loss as an order to sell all holdings of a specific stock, which is triggered when the price of that stock falls under a threshold. We define the threshold as a difference between the current price and the last buy price not exceeding 10% the last buy price. Also, the stock that triggers the stoploss order will not be bought until the end of the year. This strategy potentially help our RL agents escape economic crashes. Table 3 shows the results achieved by the agents with the stop-loss order. Although in average, RL agents lose certain amounts of money over the years, they manage to minimize the losses under the presence of many down-trend stocks during the period.
296
H. T. Nguyen and N. H. Luong
Table 3 VN stock experiment results (with stop-loss) DDPG TD3 Initial budget 200000 (thousand VND) Final asset 208725 (thousand VND) Annualized return −0.16 (%)
SAC
Random
200000
200000
200000
200664
210641
87823
−0.88
−0.42
−8.31
5 Conclusions In this paper, we have experimented with training an autonomous stock trading bot using Reinforcement Learning (RL) on two markets simulated from US and Vietnam datasets. We compared three RL algorithms designed for continuous control problems: SAC, TD3, and DDPG, against a random agent. For the simulated US market, all three RL agents achieved positive annualized returns and significantly outperformed the random agent. For the simulated Vietnam market, due to the downtrend situation during the time period the data were collected, the RL agents needed a small exogenous support (i.e., a stop-loss order) to minimize their losses. Extensive works are required before the RL agents investigated in this paper could be employed for real-world stock trading. For future research, more elaborated models of the stock markets that include transaction fees and a wider range of trading options should be constructed. Furthermore, information such as technical indicators and stock market news could be integrated to better inform the RL agents, potentially helping the agents earn more profits and be more robust against market crashes.
References 1. Achiam, J.: Spinning up in deep reinforcement learning (2018) 2. Bacoyannis, V., Glukhov, V.S., Jin, T., Kochems, J., Song, D.R.: Idiosyncrasies and challenges of data driven learning in electronic trading. arXiv: Trading and Market Microstructure (2018) 3. Bigiotti, A., Navarra, A.: Optimizing automated trading systems. In: Antipova, T., Rocha, A. (eds.) Digital Science, pp. 254–261. Springer International Publishing, Cham (2019) 4. Breiman, L.: Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat. Sci. 16(3), 199–231 (2001). https://doi.org/10.1214/ss/1009213726 5. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym (2016) 6. Falessi, D., Narayana, L., Thai, J.F., Turhan, B.: Preserving order of data when validating defect prediction models. CoRR abs/1809.01510 (2018). http://arxiv.org/abs/1809.01510 7. Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. arXiv:1802.09477 (2018) 8. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290 (2018). http://arxiv. org/abs/1801.01290
Applying Deep Reinforcement Learning in Automated Stock Trading
297
9. Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y.: Stable baselines (2018). https://github.com/hill-a/stable-baselines 10. Ponomarev, E., Oseledets, I.V., Cichocki, A.: Using reinforcement learning in the algorithmic trading problem. arXiv:2002.11523 (2020) 11. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: 31st International Conference on Machine Learning, ICML 2014, vol. 1 (2014) 12. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, second edn. The MIT Press (2018). http://incompleteideas.net/book/the-book-2nd.html 13. Xiong, Z., Liu, X., Zhong, S., Yang, H., Walid, A.: Practical deep reinforcement learning approach for stock trading. CoRR abs/1811.07522 (2018). http://arxiv.org/abs/1811.07522 14. Yuan, Y., Wen, W.Y.J.: Using data augmentation based reinforcement learning for daily stock trading. Electronics 2020 abs/1811.07522 (2020)
Telecommunications Services Revenue Forecast Using Neural Networks Quoc-Dinh Truong, Nam Van Nguyen, Thuy Thi Tran, and Hai Thanh Nguyen
Abstract Forecasting is a very important task not only for any business but also in other fields. There are numerous categories to forecast in business, but determining the future revenue (revenue forecast) is one of most crucial task so that leader can propose appropriate policies, decisions to optimize production and business activities. Revenue forecasting is a complex problem and requires the use of many different methods and techniques to achieve the highest accuracy. This study mainly analyzes approaches to select neural network models, processes, highlights the necessary steps and leverages advancements in deep learning for revenue forecast on a set of revenue data generated in the monthly, quarterly period from 2013 to June 2019 of 9 regions in Tra Vinh province. The considered telecommunication services groups including Internet services, MyTV service, landline phone service and postpaid mobile service are taken into account for revenue forecast tasks with the deep learning techniques. The proposed method achieves promising results and is already deployed in the practical cases at VNPT Tra Vinh.
Q.-D. Truong (B) · H. T. Nguyen College of Information and Communication Technology, Can Tho University, Can Tho, Vietnam e-mail: [email protected] H. T. Nguyen e-mail: [email protected] T. Thi Tran Cuu Long University, Vinh Long, Vietnam N. Van Nguyen VNPT Tra Vinh, Tra Vinh, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_26
299
300
Q.-D. Truong et al.
1 Introduction In a market economy, the competition of enterprises is becoming fiercer, the competitive advantage always belongs to those enterprises which fully, timely capture and effectively exploit information to update advancements in technologies. Today, with the strong development of information technology, successful businesses have been constantly investing themselves in data management and exploitation tools at different levels. At a simple level, it is based on information from available management software to build tools to analyze the financial situation and financial reports. At a high level is the application of information technology achievements, including the field of machine learning to effectively exploit data, thereby helping managers analyze, manage, forecast business results, etc. to optimize profits and set strategic policies. VNPT Tra Vinh is a company of Vietnam Posts and Telecommunications Group, one of the leading suppliers in the field of telecommunications and information technology in the province. Targeted values (BSC) are expressed as specific goals (KPOs) through measurement indicators (KPIs) and specific targets (Target) that have been delivered to each of its subsidiaries and employees. Besides, the unit also deployed to use a number of applications software to support production and business management such as Quick report system, Financial reporting system, Asset management system, System customer management system, subscriber development system, billing system, invoice issuance, etc. However, the management of business results is still discrete, due to the use of many different systems. Especially, forecasting revenue tasks for each locality is still manually estimating. Therefore, this work is done to support and forecast the revenue results to each locality quickly, accurately and transparently. The research results are expected positively support the forecast of telecommunication services revenue and can help managers to make policies and adjust management practices appropriately to minimize optimizing production and business activities. In addition, this work is to help to contribute to promoting the application of information technology in management and business activities at VNPT Tra Vinh.
2 Related Work Forecasting is a very broad topic. Numerous studies have proposed models applying to domestic and foreign forecasts. In 2014, Master’s thesis of Hoang Tuan Ninh [1] presented the work, namely “Application of data classification and regression techniques to forecast production and business data for VNPT” to build models to support analysis, synthesis and warehouse processing data. The proposed method can be integrated with other operating business management support applications and using the YALE toolkit to train a
Telecommunications Services Revenue Forecast Using Neural Networks
301
neural network for forecasting service of each province with an average accuracy of 90% to 96% depending on the type of service with the target of 10% error. In 2014, Master’s thesis of Nguyen Duc Anh [2] introduced the study of “Researching and exploiting data of forecasting customers likely to leave VNPT network”. The work predicted to determine whether customers are at risk of leaving the network and affecting sales or not. “Forecasting sales using neural networks” by authors of Frank M. Thiesing and Oliver Vornberger [4] presented a model based on neural networks to predict revenue. This research designed and implemented a forecast system using a neural network which increased efficiently the management, as well as forecasted the number of goods sold by a supermarket company in Germany. In this article, the authors also used the naïve method (Naive Prediction) and statistical prediction (Statistical Prediction) to compare the results among these three methods. The results showed that the neural network was better than the naïve method and statistical-based prediction. This study is currently being used in supermarkets. However, the scope of the study is forecast for each product or product group, not forecasting revenue for each specific geographical area. “Sales forecast using machine learning synthesis” by Mohit Gurnani, Yogesh Korkey, Prachi Shahz, Sandeep Udmalex, Vijay Sambhe, and Sunil Bhirudk presented at the International Conference on Management Data analysis, analysis and innovation in India 2017 [4]. This paper reviewed and compared different machine learning models including Automatic Tuning Neural Network (ARNN), Extreme Gradient Boosting XGBoost and SVM (Support Vector Machine), some Hybrid models as ARIMA-ARNN, Hybrid ARIMA-XGBoost, ARIMA-SVM and STL Hybrid Analysis to forecast revenue of a pharmaceutical company namely Rossmann. This work was geared towards evaluation and analysis, evaluation of the results of the revenue forecasting problem between models and an explanation of the reason for accuracy. Some studies [5–10] also provided statistics-based and machine learningbased tools for forecasting sales. The above studies are predictive tools of revenue forecast for a product, compare and evaluate different forecasting models. Nguyen et al. [2] studied on the risk of customers leaving the network is the factor affecting revenue, but they did not forecast revenue for each specific geographical area. Therefore, the above studies cannot be applied to VNPT Tra Vinh to support the decision-making process, forecast the goals, modify the strategy to improve revenue and deliver performance targets to each area.
3 The Proposed Method for Revenue Forecast In this section, we propose a method based on deep learning models to predict revenue of telecommunications services. Firstly, we present the analysis of data to collect data and analyze the necessary data for the revenue forecast. Model selection is also introduced to apply to collected data. Assessments for the results are discussed with different metrics including Mean Absolute Percentage Error, Mean Absolute Error and Root Mean Square Error. In order to predict revenue, we prepare the data
302
Q.-D. Truong et al.
Fig. 1 Data collection from various systems
for forecast based on time series which can be consecutive months or quarters. Then, the dataset is trained with different neural networks to select the best model for forecasting using three measurements including Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). We also show an illustration of revenue forecast system web interface which is still deploying in VNPT Tra Vinh.
3.1 Input Determination In order to design a machine learning model for prediction tasks successfully, this depends heavily on understanding how the problem is. In addition, we need to know which input data is important in the area to be predicted. This study aims to forecast the revenue according to different time-frequency such as month and quarter, so the input data source can be the revenue of each month or each quarter. The input data can be: • • • • • •
List of existing customers over the months; The monthly, quarterly charge amounts of each customer. Monthly debt collection ratio of each customer. Customer satisfaction through care, installation, repair and relocation. The number of redeemed subscribers assigned to the locality. Response capacity of the network, transmission line attenuation index.
Telecommunications Services Revenue Forecast Using Neural Networks
303
3.2 Data Collection Different types of data identified as above can be collected directly from existing Information systems (Fig. 1). The data should be checked for errors by checking for changes over time. The accuracy of the forecasting results depends on the data collected, the data source can be collected from within the available systems that the unit is using, or it can be collected from the data sources outside. For example, for the revenue forecasting problem of telecommunication services, we collect data from the existing management software systems at the unit, such as billing system, Debt Management, customer management, line testing, and customer care, etc. The List of Customers: We collected from the subscriber development system and customer management system which were updated daily by the salesmen at the sales offices. Customer information was accompanied by service information. The Amount of Money (in VND) Customers Paid Each Month at a Considered Area: This data source is collected from the charging system, the collected data will be recalculated to get the revenue for each type of the considered service at each area (including 9 considered areas). Debt Collection Ratio: The data came from the debt management system and invoice issuance system, collecting monthly invoice issuance data and monthly and quarterly debt data to calculate debt collection ratio in each locality. Customers’ Satisfaction: Data of customer care system were collected through outbound work after customers have incurred new annexe contracts such as new installation, service changes or after reporting damage and technical team Technical installation or troubleshooting is complete. From the outbound results, customer satisfaction will be calculated based on the percentage of customer satisfaction on the total number of customers being cared for and surveyed. The number of redeemed subscribers for each desk in charge: Collected from the system of contracting and managing the area, used to evaluate the number of subscribers per area, as well as the scope and density of subscribers, from which assess the service capability of employees in the assigned area, this assessment is conducted in both technical and business divisions. Response capacity of network, transmission attenuation index: Collected from transmission line test system, with a weekly frequency of scanning. When collecting this data, it will be recalculated monthly, quarterly by taking the average of one month and quarter. We investigated and collected data from 9 areas in Tra Vinh province (Chau Thanh, Tra Vinh city, Cang Long, Tieu Can, Cau Ke, Cau Ngang, Tra Cu, Duyen Hai district and Duyen Hai town). Each area includes 78 samples with features as described above. For the experiments, we investigated and considered 702 samples running with 72 neural network models to select the best model according to each case.
304
Q.-D. Truong et al.
Fig. 2 Training and test sets division based on time series
3.3 Determining the Forecast Period Depending on the specific forecasting problem for each case, we determine the time to forecast such as month, quarter, year.
3.4 Data Preprocessing Before starting each machine learning problem about revenue forecasting, the preprocessing of data is crucial to improve the prediction performance. If our data set is not standard, we need to preprocess some steps such as sampling data, determining missing data values, adjusting noise data, and transforming the data to the appropriate format to make learning patterns better.
3.5 Training and Test Sets Division for Model Learning and Evaluation With historical data collected from time series through preprocessing steps, we divide data set into 2 separate sets: training set [t1, t2] and test set [t2, t3] as shown in Fig. 2. Time series can be months or quarters. With the data set over the defined time, we conduct the determination of the number of neurons, using the transfer function, normalize the input data on the lower and upper ranges of the transfer function, training forecasting model using neural network machine learning algorithm, to obtain forecasting model with the highest reliability in the application context domain and scope of the work. To evaluate the model performance, we use 3 metrics including Mean Absolute Percentage Error, Mean Absolute Error and Root Mean Square Error.
Telecommunications Services Revenue Forecast Using Neural Networks
305
4 Experimental Results 4.1 Metrics for Evaluation An important measure to find out if the forecast value is close to the actual value is the error of the forecast, the forecast error is the difference between the forecast value and the actual value. The forecasting model is considered to be good and the forecasting error must be small. Forecast error is determined by the formula: et = X t − X t
(1)
where: • et is the forecast error at time t; • X t is the real value at time t; • X t is the forecast value at time t. We measure the forecast revenue performance the following errors to measure predictive accuracy. • Mean Absolute Percentage Error: n
|et | t=1 X t
M A P E = 100 ∗
n
(2)
The mean absolute percentage error (MAPE) is a statistical measure of the accuracy of a forecasting system. It measures the accuracy of the forecast as a percentage, an indicator of the error magnitude of the forecast value compared to the actual value. This method is particularly useful when X t is of great value. • Mean Absolute Error: n |et | (3) M AE = t=1 n When the predicted and actual values are in the same unit, MAE is a very useful measure for calculating forecast errors. • Square root of square root error (Root Mean Square Error) is as follows: RMSE =
n
2 t=1 et
n
(4)
4.2 Neural Networks Setting for Forecasting This study implements a three-layer neural network including one input layer, one hidden layer and one output layer with 1 output neuron which is the revenue prediction
306
Q.-D. Truong et al.
for the specific forecasting problem. For forecasting revenue of Internet services and MyTV, the number of neurons starts at n × 2 where n is the number of preceded consecutive months (or quarters) used as a training set for the prediction. For other services, the number of neurons begins at n × 3 where n is the number of preceded consecutive months (or quarters) used as a training set for the prediction. We increase from 3 to 12 input units, corresponding to using 3 months to 12 months of past sales to forecast revenue for the next month. Besides, an increase from 2 to 8 input units is for corresponding to using the revenue from 2 quarters to 8 quarters to forecast for the next quarter.
4.3 Revenue Forecasting This section presents the forecast results based on the selected optimal neural network model of 4 typical services of VNPT Tra Vinh including Internet group of services (FTTH + MegaVNN + Private channel), MyTV service, Landline phone service and Postpaid mobile service.
4.3.1
Model Selection
We increase the number of neurons until the stopping conditions is met. This work defines the stopping conditions of neural network training when one of the two things is not met: The first one is that the loss (fault tolerance) is lower than 5 × 10−5 . The other is the number of epochs is lower than 2000.
Table 1 The best performance among all considered months and quarters of forecasting the revenue of Internet services. (1) reflects #preceded months (for revenue forecast for months) or #quarters (for revenue forecast for quarters) used as the training data for the forecast while (2) reveals the number of neurons reached the best performance. (1), (2) are the same for other tables Regions Chau Thanh
Months
Quarters
(1)
(2)
MAPE
MAE
RSME
(1)
(2)
MAPE
MAE
RSME
7
19
0.98
32
40.15
7
19
0.98
32
40.15
Tra Vinh city
2
4
0.97
32.1
38.78
2
4
0.97
32.1
38.78
Cang Long
3
18
0.9
24.73
28.59
3
18
0.9
24.73
28.59
Tieu Can
8
25
0.77
20.52
24.75
8
25
0.77
20.52
24.75
Cau Ke
5
21
2.2
42.66
52.09
5
21
2.2
42.66
52.09
Cau Ngang
7
10
0.79
21.49
22.9
7
10
0.79
21.49
22.9
Tra Cu
7
23
0.88
25.25
27.4
7
23
0.88
25.25
27.4
Duyen Hai dist.
5
24
0.97
14.3
16.57
5
24
0.97
14.3
16.57
Duyen Hai town 3
8
0.96
17.01
19.37
3
8
0.96
17.01
19.37
Telecommunications Services Revenue Forecast Using Neural Networks
307
Table 2 The best performance among all considered months and quarters of forecasting the revenue of MyTV service Regions
Months
Quarters
(1)
(2)
MAPE
MAE
RSME
(1)
Chau Thanh
7
23
0.91
0.61
0.9
5
Tra Vinh city
10
22
0.48
0.21
0.27
3
12
0.86
1.17
1.25
9
19
0.39
0.21
0.37
4
16
0.61
0.96
1.6
Cang Long Tieu Can Cau Ke Cau Ngang
(2) 7
MAPE
MAE
RSME
2.51
5.06
5.22
11
15
0.36
0.28
0.32
3
13
1.28
2.98
3.01
5
10
0.27
0.14
0.19
4
10
1.37
2.15
2.23
5
12
0.46
0.47
0.53
3
4
0.93
2.83
3.3
Tra Cu
12
36
0.37
0.48
0.69
4
13
1.09
4.22
6.03
Duyen Hai dist.
12
31
0.69
0.3
0.34
6
14
1.45
1.94
2.07
Duyen Hai town
6
35
0.22
0.15
0.23
3
12
1.25
2.46
3.61
Table 3 The best performance among all considered months and quarters of forecasting the revenue of Landline phone service Regions
Months
Quarters
(1)
(2)
MAPE
MAE
RSME
(1)
(2)
MAPE
MAE
RSME
Chau Thanh
10
21
1.83
2.3
3.46
5
24
0.4
1.5
1.81
Tra Vinh city
12
50
1.23
7.38
8.81
3
23
0.79
14.6
18.43
3
33
1.17
1.43
1.78
4
17
1.09
4.19
4.68
Cang Long Tieu Can Cau Ke Cau Ngang
10
21
0.76
0.76
0.94
5
20
1.92
5.92
6.73
7
15
2.89
3.15
4.47
5
23
9.49
30.72
37.08
7
19
2.98
2.86
3.36
5
24
2.22
6.59
8.01
10
40
1.2
2.1
2.91
6
17
2.14
11.33
17.61
Duyen Hai dist.
7
22
1.62
2.85
3.51
3
23
1.66
9.29
14.17
Duyen Hai town
8
20
1.95
3.35
3.95
5
9
2.23
12.11
14.28
Tra Cu
Table 4 The best performance among all considered months and quarters of forecasting the revenue of Postpaid mobile service Regions
Months
Quarters
(1)
(2)
MAPE
MAE
RSME
(1)
(2)
MAPE
MAE
RSME
Chau Thanh
11
28
2.15
1.4
1.92
4
18
1.73
3.58
3.78
Tra Vinh city
12
39
1.82
7.29
9.51
5
27
0.49
6.78
10.74
3
33
2.4
1.47
2.11
4
23
0.62
1.33
1.87
Cang Long Tieu Can
10
51
2.3
1.63
2.3
6
27
1.19
2.65
3.81
Cau Ke
11
24
2.48
0.97
1.24
6
17
1.94
2.4
2.96
Cau Ngang
11
43
2.12
1.4
1.66
4
19
0.85
1.84
2.78
Tra Cu
12
26
2.95
2.35
2.54
6
26
1.59
4.02
4.58
Duyen Hai dist.
4
40
1.71
0.41
0.51
3
24
6.25
4.57
4.63
Duyen Hai town
8
45
3.13
1.87
2.48
4
15
3.22
5.74
8.08
308
Q.-D. Truong et al.
Fig. 3 Forecast results and real revenue comparison at Cang Long and Cau ke
4.3.2
Model Performance on Various Regions
Because the characteristics of each locality are not the same, so the input parameters of the optimal model are not equal. We present the parameter selection results to the optimal model of the revenue forecast problem in Tables of 1, 2, 3 and 4. We found that the MAPE error measurement for service groups of 72 optimal models selected from the training models show a difference of 1.27% for the monthly forecast, and 1.68% for the quarterly forecast between forecasted value and real revenue. The results can be applied to the assignment of BSC, contract, and management of the locality in VNPT Tra Vinh. As observed from Figures of 3, 4, 5, and 6, we show comparisons between forecasted revenue and real revenue of 9 investigated areas. The difference between revenue forecast and real revenue is 2.34% between the real value and forecast value, the postpaid mobile service is 1.77% between the value. Real and forecast value, internet service deviation is 0.55% between real value and forecast value, MyTV service deviation is 0.46% between real value and forecast value, but if you look at the chart then Fixed telephone service, postpaid mobile phone, MyTV have forecast lines closest to the real value, but internet service has forecasted lines and values very far from each other, however, because internet services account for very high revenue. Although the difference is small (due to the measurement in millions of VND), the data is reduced by a large distance.
Telecommunications Services Revenue Forecast Using Neural Networks
309
Fig. 4 Forecast results and real revenue comparison at Cau Ngang and Chau Thanh
Fig. 5 Forecast results and real revenue comparison at Duyen Hai town and Duyen Hai district
310
Q.-D. Truong et al.
Fig. 6 Forecast results and real revenue comparison at Tieu Can, Tra Vinh, and Tra Cu
4.3.3
Evaluation of Experts on Revenue Forecast Models
In order to evaluate the system, we discuss with experts at VNPT about the experimental results. They conduct that the experimental results of the work can be used at the company. The experimental results are promising and potentials for real applications. The forecast system is also currently deployed at VNPT Tra Vinh with the website address: http://113.163.202.9:8080 (shown in Fig. 7).
Telecommunications Services Revenue Forecast Using Neural Networks
311
Fig. 7 An illustration of forecast system web interface
5 Conclusion We presented a general model for revenue forecasting problem at VNPT Tra Vinh. Through empirical results and consultation with experts, the forecast of the revenue of telecommunication services in VNPT Tra Vinh is highly available and usable in the process of operating business. In the upcoming development directions, further studies should improve the method of collecting data and add more factors affecting revenue results (for example, seasonal factors, trend factors, etc.). Future research also can take into account sophisticated learning architectures to enhance forecast performance.
References 1. Hoang,T.N.: Application of data classification and regression techniques to forecast production and business data for VNPT. Master thesis, Ha Noi National University (2014) 2. Nguyen, D.A.: Researching and exploiting data of forecasting customers likely to leave VNPT network. Master thesis, Institute of Military Technology (2014) 3. Gurnani, M., Korkey, Y., Shahz, P., Udmalex, S., Sambhe, V., Bhirudk, S.: Forecasting of sales by using fusion of machine learning techniques. In: 2017 International Conference on Data Management, Analytics and Innovation (ICDMAI) Zeal Education Society, pp. 93–101 (2017). https://doi.org/10.1109/ICDMAI.2017.8073492 4. Thiesing, F.M., Vornberger, O.: Sales forecasting using neural networks. In:1997 Proceedings of International Conference on Neural Networks (ICNN 1997), vol. 4, pp. 2125–2128 (1997). https://doi.org/10.1109/ICNN.1997.614234 5. Jagielska, I., Jacob, A.: A neural network model for sales forecasting. In: Proceedings 1993 The First New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, pp. 284–287(1993). https://doi.org/10.1109/ANNES.1993.323024 6. Morioka, Y., Sakurai, K., Yokoyama, A., Sekine, Y.: Next day peak load forecasting using a multilayer neural network with an additional learning. In: Proceedings of the Second International Forum on Applications of Neural Networks to Power Systems, pp. 60–65 (1993). https:// doi.org/10.1109/ANN.1993.264349
312
Q.-D. Truong et al.
7. Onoda, T.: Next day peak load forecasting using an artificial neural network with modified backpropagation learning algorithm. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 1994), vol. 6, pp. 3766–3769 (1993). https://doi.org/10.1109/ ICNN.1994.374809 8. Webb, T., Schwartz, Z., Xiang, Z., Singal, M.: Revenue management forecasting: the resiliency of advanced booking methods given dynamic booking windows. Int. J. Hosp. Manag. 89 (2020). https://doi.org/10.1016/j.ijhm.2020.102590. ISSN 0278-4319 9. Fiori, A.M., Foroni I.: Prediction accuracy for reservation-based forecasting methods applied in revenue management. Int. J. Hosp. Manag. 84(2020). https://doi.org/10.1016/j.ijhm.2019. 102332. ISSN 0278-4319 10. Gonçalves, C., Pinson, P., Bessa, R.J.: Towards data markets in renewable energy forecasting. IEEE Trans. Sustain. Energy 11 (2020). https://doi.org/10.1109/TSTE.2020.3009615. ISSN 1949-3037
Product Recommendation System Using Opinion Mining on Vietnamese Reviews Quoc-Dinh Truong, Trinh Diem Thi Bui, and Hai Thanh Nguyen
Abstract Opinion mining (also known as sentiment analysis) through customers’ reviews or feedbacks which can identify the user opinion about different product features has received a lot of attention exhibited in numerous studies. The majority of recommender systems have recommended products based only on overall evaluation and primarily based on expert’s evaluation. In this work, we propose a method to explore Vietnamese reviews extracted from e-commerce websites in Vietnam to provide suggestions in products selection based on products’ features/functions. The proposed approach introduces a topic-based model to identify products’ features which are mentioned in customer comments/reviews. The proposed system is implemented with the integration combining the VietSentiWordnet to calculate the importance scores for the features of each product. We also construct a product recommendation database which can store customers preference and purchases history. The work is analysed on more than 2,000 Vietnamese comments/reviews about laptop products and is expected to be feasible to apply in practical cases.
1 Introduction E-commerce in recent years has been developing strongly, bringing many benefits to the global economy. It is not easy to support users choose a product that they like on e-commerce websites because there are too many products in same price range, similarly displayed and promoted. Numerous organizations and companies have successfully applied the recommendation systems to their e-commerce pages such as amazon.com, movielens.org, youtube.com, etc. In Vietnam, there are also Q.-D. Truong (B) · H. T. Nguyen College of Information and Communication Technology, Can Tho University, Can Tho, Vietnam e-mail: [email protected] H. T. Nguyen e-mail: [email protected] T. D. Thi Bui Nam Can Tho University, Can Tho, Vietnam © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 N. H. Phuong and V. Kreinovich (eds.), Soft Computing: Biomedical and Related Applications, Studies in Computational Intelligence 981, https://doi.org/10.1007/978-3-030-76620-7_27
313
314
Q.-D. Truong et al.
websites which provide suggestions to customers including mp3.zing.vn, vatgia.com, etc. Many e-commerce websites have now developed rating systems to support their customers to make decisions for purchases. However, such an overall assessment is not reliable enough to reflect the actual quality of the product, especially for groups of products with many different features (for example, a laptop can be generally considered to be good, but it does not necessarily meet the requirement of large capacity or light weight). Providing suggestions to buy a suitable product according to one or more features of interest to users based on a predicted ranking approach can be not feasible until e-commerce sites support user rating features/functions of products. This is a problem that most of the current suggestion systems have not adequately solved. Most e-commerce websites today allow users to provide comments/reviews on products so that users can express their reviews and feelings in a clearer and more detailed way about a product’s features. Comments/reviews, hence, become a powerful tool for customers to reflect on their opinions about products and services. The customers has become increasingly powerful to the producer and shops. A survey from PowerReviews1 found that 56% of shoppers not only view online reviews for more information, but also like to visit to give reviews. Users of e-commerce sites also believe that reviews can reflect a part of the experience of using the product, so they are more involved in the process of making reviews about the product. Thus, the evaluation and exploration of customers’ opinion through comments/reviews can give us information about customers’ likes or dislikes for a certain feature of a product can provide suggestions that match the customers’ preferences. In this study, we propose a method for exploring and mining Vietnamese reviews and extract information on features/functions of the product to give appropriate suggestions based users’ preferences and features of the product which they may be interested in that product. Also, we propose a database for products’ features/functions extraction aiming to support the recommender system to provide useful suggestions on purchasing products based on features/functions of the product they may interest.
2 Related Work Recommender systems (RS) aim to predict preferences or ratings that a user can give to an information filtering system certain items (items) that they have not considered in the past (items can be songs, movies, video clips, books, articles, etc.). In our proposed system, we are only interested in some main data types including the user (the user) is the person who made a rating on a certain item (the item can be the product, movie, song, and so on) and user rating scores (comments) performed on that item. Numerous techniques have proposed to the recommendation system. Currently, in order to build a recommendation system with many proposed techniques, studies can be divided into main approaches such as Collaborative Filtering, group of filter1 https://www.powerreviews.com/blog/survey-confirms-the-value-of-reviews/.
Product Recommendation System Using Opinion Mining on Vietnamese Reviews
315
ing techniques, Content-based Filtering, hybrid grouping, and non-personalization technical group. Many studies based on previous user behaviors such as transaction history to find rules of interaction between users and items have proposed. Therefore, suggestion systems based on this approach are not interested in the attributes of the item, it is capable of exploiting information outside the scope of the item attributes. The models can be built based on the behavior of one user, or more effectively, it can be conducted from many different users with the same characteristics. When the models consider other users’ behavior, collaborative filtering uses group knowledge to generate recommendations based on similar users. More clearer, it provides suggestion based on users with the same interests, or those with similar behaviors, who clicked like, and gave points for the same item. Also, numerous works have explored neighbor-based methods: numerous authors introduced methods based on past data of “similar” users or based on past data of “similar” items. The author in [1] proposed a recommender system for online sales based on explicit feedbacks from users through reviews for products. The author built an online sales system and deploy the recommendation system to advise customers on products they might like. The author in [2] introduced a solution to build a recommender system for online sales using potential feedback from users. The authors proposed the methods to collect potential feedbacks, then a model is used to learn the appropriate suggestion methods. The authors combined predictive models to increase accuracy. The author in [3] introduced an improved method of collaborative filtering clustering and optimal collaborative repetition using PSO algorithm. The accuracy is significantly improved compared to the traditional collaborative filtering method while the method also solved the problem of sparse data that collaborative filtering methods often encounter. The work in [4] introduced to use latent feedback from users (such as the ratio of the length of time that the user has listened to over the total length of the song) to use the ranking algorithm to build the “Song Suggestion System”. Trieu Vinh Viem et al. [5] also proposed to build a film suggestion system. This model can improve the accuracy on par with the latent factor model. It not only maintained the advantages of the neighborhood model, but also immediately solved the problem of new users when ranking for the first time without having to retrain. The authors [6, 7] proposed the application of collaborative filtering in the context of an e-learning system to predict the likelihood that a student may complete a certain learning task or not. The work in [8] proposed a solution in building a contextual suggestion system, applied to tourism suggestions to suggest the most suitable tourist destinations for tourists. This system combines methods such as integrated input context-based suggestion with matrix decomposition technique and output context processing to increase the system’s accuracy. Some similar studies can be found in [9, 10, 12, 13].
316
Q.-D. Truong et al.
Fig. 1 The proposed architecture for the recommender system
3 Recommendation System Based on Functions/Features of the Product The work aims to build a system which can give product recommendations based on some features and functions of the product. For example, when a customer wants to buy a laptop and he or she is interested in RAM, the system can give suggestions such as Vivobook_S15_S510UQ_BQ475T which owns a large robust RAM. Or, in the case, the user is looking for a laptop with a good battery to use for a longer time without requiring the charge, the system may suggest the laptop with the model such as FX503VD_E4082T which has a powerful battery. In order to provide efficient suggestions as illustrated above, the system needs to rely on a recommendation database. This database includes many data entries. Each entry refers to a feature of a product and has a structure as follows: , , . The proposed system consists of two main modules: one module for building a suggested database with architecture shown in Fig. 1; other module suggesting product by feature with architecture revealed in Fig. 2 (with 4 functions/features used for providing product recommendations in the experiments). When the system is deployed, the user is required to choose in descending order of the features/functions of the product which they are interested in. Based on user’s selections, the system accesses the database to provide the list of products with scores calculated in the order of features that the user has selected. If the user does not select any features, the system will provide the products with the highest overall rating (calculated on all features). In the scope of this topic, we build a recommendation system for the products of notebook. Product features are the hardware properties of the notebook product group. The interest score of each product is calculated based on the
Product Recommendation System Using Opinion Mining on Vietnamese Reviews
317
Fig. 2 The proposed architecture for product recommendation system based on features/functions of the products
analysis of customer opinion in the comments/reviews related to the characteristics of each particular product. In order to make product recommendations according to features that users are interested in, the system needs to build a suggestion database, in which the most important information to have is the rating score on each feature of the product group. In the scope of this study, a product’s interest score is aggregated through the determination of the user’s views on certain features of a product expressed by users in their comments. Therefore, the system needs to collect customers’ comments/reviews related to products in the product group that need to make suggestions. Sources for comments/reviews are from e-commerce websites that sell laptop products and allow customers to make comments on a particular product. The user views about a certain product feature will be extracted and synthesized by the system to build the suggested database. By using this recommendation database products with features of products, the method is expected to provide a match between the suggested products and the user’s needs.
3.1 Data Collection Collected data include comments/reviews from e-commerce websites that provide rating pages such as fptshop.com, www.thegioididong.com and so on, related to the considered product group (notebook). The data collected for a product group are divided into 2 datasets. The first dataset contains all the comments on all products and has been feature-labeled that is used to build the topic-based model (the set of keywords is often used to refer to a feature) for product features. The second dataset which consists of comments but is not labeled feature is used to calculate the feature score for each product and to build a suggestion database.
318
Q.-D. Truong et al.
Fig. 3 The proposed architecture for product recommendation system based on functions/features of the product
3.2 Module Building Themes for Feature Sets In order to be able to build a suggestion database with the described structure, for each collected review, the system needs to determine which reviews of the user commenting to the features of the product. However, in practical, users usually give many different words to refer to the same product features, so we need to build a topic-based model for the feature extraction to get a set of keywords represents the feature or function of the product. In the context of the system, the topic-based model which is built for the feature is simply understood as a list of keywords that represent characteristics of the feature, i.e. when a keyword is used in the comment, can know which product (price), (design) features are mentioning to. For example, pin (battery), are features of a netbook product. Some vietnamese keywords in reviews such as ” (hours, duration “ of use, usage, robust/powerful, laptop, activity, time) are described for the battery ” (Range, affordable, money, expensive) feature while “ can be for “price”. Figure 3 presents an algorithm for building a topic-based model for extracting keywords which describe features/functions of the product. It is possible to generalize the process of creating keywords representing a feature through the following main steps: • Browse for comments labelled with the product’s subject matter. For each comment, we use the vnTagger tool to perform the separation step and label the category
Product Recommendation System Using Opinion Mining on Vietnamese Reviews
319
Fig. 4 The steps for building a database for supporting to provide recommendations
word. Keywords labelled as generic are retained for consideration as keywords representing product features. • Synthesize the list of keywords returned, in which each keyword has more information that is the number of occurrences corresponding to each feature. • Select a keyword with the highest number of impressions than K and the largest number of impressions of all features, which is a keyword that represents each product feature. For instance, the keyword “money” appears 10 times with the “price” feature and appears with the “design” feature 9 times. With the K of 5, the keyword “money” is selected as the representative for “price”. Figure 4 includes steps to build a dataset for supporting to provide the features’ product-based recommendations. Exploring the set of collected reviews/comments from social network sites, e-commerce websites, for each comment, the system performs the steps of separating words, retaining nouns, adjectives, and verbs. The feature thread model is used to validate which feature a comment is referring to. The result of the analysis step includes the trio of: , , . Emotional dictionary is used to determine the rating for the commented feature. This rating is calculated into the total score of the performance of the corresponding product in the suggestion database. The system makes suggestions to the user in two ways: by the list of selected features and without the list of selected features. In case the user is not interested in a particular feature, the system calculates the score for the product according to the total score of the features. The overall rating for a product is calculated by the formula 1.
320
Q.-D. Truong et al.
number o f f eatur es
Scor e Pi =
wi j
(1)
j=1
where wi j is feature point j of the product pi . In the case, the user defines a list of features (in descending order of priority), the recommendation score for each product is calculated by the formula 2:
number o f selected f eatur es
Scor e Pi =
α j wi j
(2)
j=1
where α j is the weight for corresponding feature j, this weight is considered as the parameter of the system, and is set by the system administrator according to the principle α1 > α2 > ... > αm . In the experiments, we choose α = {0.5, 0.3, 0.15, 0.05} [11].
4 Experimental Results 4.1 Data Collection and Database for Recommender System We use Internet Marketing Ninjas2 tool to collect data at two URLs3,4 . The originally collected data include the web page contents as HTML source codes. We process the HTML tag analysis to retain necessary information such as product name, comments, reviews, etc. Comments are paragraphs of text in the class “comment-ask”. When filtering comments, there exist some redundant HTML tags, we manually deleted to get the comments in plain text. We collect more than 2,000 comments from websites fptsoft.com.vn and thegioididong.com.vn, performed the separation of comments which were related to reviews on more than one feature/function of the product including more than 4,000 sentences. 3,306 sentences contain information on product features while other 764 sentences do not include any features. The comment type that does not contain any features can be the question which the customer asks ” (“Does this machine have good graphics?)). (for example, “ ” (“I like Some comments about the product are so general (such as “ it very much”)), some reviews formed as slang, abbreviation were eliminated. Some products do not have scores for some particular features. The reason can be that the data collection has few comments on the product, or few comments or no comments about the product’s features. The set of collected reviews includes a total of 3251 comments that satisfy the mentioned requirement to at least 1 feature and contain 2 https://www.internetmarketingninjas.com/seo-tools/google-sitemap-generator/. 3 https://fptshop.com.vn/may-tinh-xach-tay. 4 https://www.thegioididong.com/laptop/.
Product Recommendation System Using Opinion Mining on Vietnamese Reviews
321
Table 1 An illustration on the products with the highest and lowest scores for recommendations No Features or Product name functions The highest score The lowest score 1
2
Pin (battery) FX503VD_E4082T 15_bs572TU_i3_6006U_2JQ69PA Inspiron_5570_i5_8250U Màn hình A315_51_364W (Monitor) Vivobook_A510UF_EJ182T Vivobook_X407MA_BV043T
X510UQ_i5_8250U_BR632T 15_bs555TU_Core_i3_6006U ThinkPad_E570 Inspiron_5570_i5_8250U Inspiron_5567_i5_7200U Inspiron_5567_i5_7200U_M5I5384W
at least 1 emotional word. Feature points for the features of one product will be accumulated according to the formula of 1 and 2. For each product, there are several of features that the user has rated and the system calculates the score for that feature as illustrated in Table 1.
4.2 Scenarios In this section, we present the results of product recommendations in 2 scenarios. In the first scenario, we select features/functions of the product (maximum 4 features) for generating recommendations; The second case, no feature is selected in advance. The database is used for the system to calculate scores of the suggestion for each product corresponding to 2 cases with or without a list of features of interest. When the user does not choose any features, the system suggests the products with the largest total feature points in the database (Fig. 5). The following table shows detailed configurations of the 3 products with the highest total score. Figure 6 shows a list of recommended products when the user selected the list of features of interest in order of priority: battery, ram, price and quality of the integrated webcam. When a user selects a row on the result, the score information for each feature of the respective product displayed as Fig. 7.
4.3 Discussion Collected reviews include more than 2000 comments with more than 4000 sentences. 3300 sentences were related to product features and more than 764 sentences did not mention product features. Such sentences which do not comment on product features can be questions that the user asked on the sites (for getting more information) or contains special characters, or the questions that contain unsigned Vietnamese such as “man hinh, gia, thiet ke” (monitor, price, design, etc.),
322
Q.-D. Truong et al.
Fig. 5 A list of suggested products in the case the user does not select any feature a, a detailed feature score of the selected product
Fig. 6 A list of suggested products in the case the user does not select any feature a, a detailed feature score of the selected product
Product Recommendation System Using Opinion Mining on Vietnamese Reviews
323
Fig. 7 A list of suggested products in the case the user does not select any feature a, a detailed feature score of the selected product
unorthodox young people’s language abbreviations (“ ” (okay, right/fine)), or sentences which do have meaning but not related to the product (for example, ” (“Thank you, your “ staff named Ngan who gave us an enthusiastic support”)). Due to the human subjective factors in making reviews and depending on the needs of the user, reviews made on the products are also different. For example, with the same of time for the battery for 3 h, normal users who only use the laptop for surfing the internet or use normal office software give a good review of the battery. However, people who usually implement heavy software such as Photoshop, processing video, etc. can give a bad review on battery. It seems more appropriate with the reality that users only choose the products that suit their needs. When they give reviews, they only consider features/function which they are interested in. Some sentences satisfy the requirements that contain features but they do not have emotional words, then such sentences are removed. For example: “should add the ssd drive”. Although that sentence belongs to the feature of the hard drive it does not contain the emotional word. This is one of the limitations that the system has not solved yet. Some products are rated well by experts (eg., Apple products) but are they were not recommended by the system because there are too few comments regarding the Apple product group. The main reason is the price level, so it is not suitable for the majority of ordinary users. Product recommendation results based on customer comment analysis are also consistent with user reviews on sales pages. For example, the rating of some famous websites such as thegioididong.com and fptshop.com about the products that the system has suggested is the best when the user does not choose any feature, all have quite high scores respectively: 3.6, 4.8 and 4.5.
5 Conclusion In this study, we propose a feature-based product recommendation system mining from customer comments. The system collects automatically the dataset from customers’ reviews and comments about laptop products on the online sales system of FPT enterprises and Mobile World. We deployed a topic-based model for building a
324
Q.-D. Truong et al.
list of keywords that represent the features of laptop product groups. The proposed method also explored comments that mentioned in many features contained many emotional words. We introduced a solution for calculating feature scores for the product based on using VietSentiWordnet emotional dictionary and built a database of product recommendations by features. The experiments are done and test in two scenarios. The first one investigated the list of features that the user is interested in; the other used the general evaluation of the product. Further research should collect and investigate on a larger and more diverse set of evaluation data about the number of comments as well as product groups so that you can make a more comprehensive and objective assessment of the proposed model.
References 1. Nguyen, H.D., Nguyen, N.T.: (product recommder system in online sale using collaborative filtering technique). J. Can Tho Univ. 31a, 36–51 (2014). ISSN 1859-2333 2. Luu, T.A.N., Nguyen, N.T.: (methods of building product suggestion systems using potential feedbacks). In: National Conference FAIR 2015, pp. 600–611. https://doi.org/10.15625/vap.2015.000199 3. Pham, M.C., et al.: (recommendation systems using swarm optimization). In: The 8th National Scientific Conference - Basic Research and Application of Information Technology 2013, pp. 153–159. https://doi.org/10.15625/ FAIRVII.2014-0336 4. Nguyen, N.T., Nguyen, P.T.: (one solution in building Song Suggestions). In: The XVII National Conference on Information Technology, pp. 149–154 (2014). ISBN 978-604-67-0426-3 5. Trieu, V.V., Trieu, Y.Y., Nguyen, N.T.: (build a movie recommendation system based on a neighborhood factor model). J. Can Tho Univ. 30a, 170–179 (2013). ISSN 1859-2333 6. Thai-Nghe, N., Drumond, L., Horvath, T., Krohn-Grimberghe, A., Nanopoulos, A., SchmidtThieme, L.: Factorization techniques for predicting student performance. In: Book Chapter in Educational Recommender Systems and Technologies: Practices and Challenges, pp. 129–153 (2012). ISBN 978-1-61350-489-5. https://doi.org/10.4018/978-1-61350-489-5.ch006 7. Thai-Nghe, N., Horváth, T., Schmidt-Thieme, L.: Context-aware factorization for personalized student’s task recommendation. In: Proceedings of UMAP 2011 International Workshop on Personalization Approaches in Learning Environments, vol. 732, pp. 13–18 (2011). ISSN 1613-0073 8. Thai-Nghe, N., Mai, T.N., Nguyen, H.H.: (an approach in building a system of ideas according to the scene). In: Proceedings of the 8th National Conference “Basic Research and Application of Information Technology” - FAIR 2015, pp. 495–501 (2015). ISBN 978-604-913-472-2 9. Cavalcanti, D.C., Prudêncio, R.B.C., Pradhan, S.S., Shah, J.Y., Pietrobon, R.S.: Good to be bad? Distinguishing between positive and negative citations in scientific impact. In: 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence, pp. 156–162 (2011). https://doi. org/10.1109/ICTAI.2011.32
Product Recommendation System Using Opinion Mining on Vietnamese Reviews
325
10. Vu, X.S., Song, H.J., Park, S.B.: Building a Vietnamese SentiWordNet using vietnamese electronic dictionary and string kernel. In: Knowledge Management and Acquisition for Smart Systems and Services 2014. Lecture Notes in Computer Science, vol. 8863, pp. 223–235. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13332-4_18. 11. Ho, T.T.: (social media analytics based on subject matter model and application). Ph.D. thesis. Ho Chi Minh Information Technology University (2015) 12. Tagliabue, J., Yu, B., Bianchi, F.: The embeddings that came in from the cold: improving vectors for new and rare products with content-based inference. In: RecSys 2020: Fourteenth ACM Conference on Recommender Systems, September 2020, pp. 577–578 (2020). https:// doi.org/10.1145/3383313.3411477 13. Dadouchi, C., Agard, B.: Recommender systems as an agility enabler in supply chain management. J. Intell. Manuf. 31, 2020 (2020). https://doi.org/10.1007/s10845-020-01619-5