122 22 3MB
English Pages 179 [170] Year 2022
Studies in Fuzziness and Soft Computing
Mahdi Eftekhari Adel Mehrpooya Farid Saberi-Movahed Vicenç Torra
How Fuzzy Concepts Contribute to Machine Learning
Studies in Fuzziness and Soft Computing Volume 416
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Fuzziness and Soft Computing” contains publications on various topics in the area of soft computing, which include fuzzy sets, rough sets, neural networks, evolutionary computation, probabilistic and evidential reasoning, multi-valued logic, and related fields. The publications within “Studies in Fuzziness and Soft Computing” are primarily monographs and edited volumes. They cover significant recent developments in the field, both of a foundational and applicable character. An important feature of the series is its short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at https://link.springer.com/bookseries/2941
Mahdi Eftekhari · Adel Mehrpooya · Farid Saberi-Movahed · Vicenç Torra
How Fuzzy Concepts Contribute to Machine Learning
Mahdi Eftekhari Department of Computer Engineering Shahid Bahonar University of Kerman Kerman, Iran
Adel Mehrpooya Department of Computer Engineering Shahid Bahonar University of Kerman Kerman, Iran
Farid Saberi-Movahed Department of Applied Mathematics Graduate University of Advanced Technology Kerman, Iran
Vicenç Torra Department of Computing Sciences Umeå University Umeå, Sweden
ISSN 1434-9922 ISSN 1860-0808 (electronic) Studies in Fuzziness and Soft Computing ISBN 978-3-030-94065-2 ISBN 978-3-030-94066-9 (eBook) https://doi.org/10.1007/978-3-030-94066-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To the soul of my father who taught like a loving teacher until the last moments of his life and dedicated to my mother a living statue of love. Mahdi Eftekhari To my parents with love, who gave me “95% help and support” and “95% mental health issues” to enable me to handle this formidable task. No, it did not make up 190%, they multitasked. Adel Mehrpooya To Parvin Rezaei who opened up new horizons in my life. Farid Saberi-Movahed To the researchers that have been studying HFS. Vicenç Torra
Preface
The aim of this book is to introduce some new trends of research regarding the use of traditional fuzzy sets, particularly hesitant fuzzy sets, in machine learning. In data-driven systems, we often need to deal with uncertainty. This uncertainty is caused by the available data and also in stochastic elements inherent to the problems being modeled. Machine learning algorithms face these problems when building the models. Fuzzy set theory provides some tools for this purpose. In this book, we review some of the tools based on hesitant fuzzy sets that can be used in machine learning problems. This book is addressed to audiences who are interested in both machine learning and fuzzy set extensions. The contents are organized into three parts including ten chapters in addition to a chapter on preliminaries which is an introduction to various concepts used in the next three parts. That is, Chap. 1 presents the preliminaries. Part I focuses on the application of both fuzzy set and hesitant fuzzy set concepts in clustering algorithms and on other unsupervised learning approaches. This part begins with Chap. 2 that describes the application of hesitant fuzzy concepts to fuse the results of different clustering algorithms. This chapter also presents the concept of hesitant fuzzy partitions. Then, Chap. 3 introduces an unsupervised feature selection method based on the concepts of sensitivity and correlation. The definition of sensitivity given in this chapter is based on the gradient of density function in subtractive clustering with respect to a given feature. Part II discusses supervised learning problems and explains cases for which fuzzy and hesitant fuzzy concepts can be used to boost the performance of supervised tasks. It is composed of Chaps. 4–7 whose description is briefly provided here. Chapters 4 and 5 study two extensions of fuzzy decision trees developed in recent years. Chapter 6 focuses on using hesitant fuzzy sets in decision trees when the data to learn the trees are imbalanced. That is, the use of fuzzy sets for imbalanced classification problems, a particular type of supervised learning tasks. More precisely, the chapter describes the use of various information gain measures to combine information, as well as the use of concepts related to hesitant fuzzy sets to combine the results of fuzzy decision trees. Then, Chap. 7 discusses the application of hesitant fuzzy sets to ensemble learning algorithms. In ensemble learning, and more particularly in dynamic ensemble selection problems, situations arise in which multiple-criteria vii
viii
Preface
decision-making notions can be used. Chapter 7 concludes this part. It considers the problem of considering different machine learning algorithms as a set of experts and how to process the associated information using hesitant fuzzy elements. Part III provides a brief survey of recent uses of hesitant fuzzy set and rough set concepts in supervised dimension reduction problems. In Chap. 8, some similarity measures as well as feature evaluation metrics are defined in form of hesitant fuzzy sets. Then these sets are applied to combine different measures and criteria of feature selection. The main idea of this chapter is to introduce some methods for combining different feature ranking algorithms and metrics via hesitant fuzzy sets. A distributed version of hesitant fuzzy-based algorithms is then provided in Chap. 9. This distributed version is appropriate for handling big data problems. Then, Chap. 10 reviews an approach to combine different rough set-based feature selection metrics through hesitant fuzzy sets. This part is concluded with Chap. 11. In this last chapter, we explain how hesitant fuzzy correlation can be used to tackle supervised feature selection issues. All the approaches presented in this book have a common motif. They use hesitant fuzzy set concepts in machine learning problems. They consider these problems from a multi-criteria decision-making perspective. More particularly, these problems are framed considering different machine learning algorithms as experts, and then taking the results of these algorithms as expert’s opinions. Then, hesitant fuzzy sets are used to combine these opinions. The algorithms presented in this book have been described and tested in a set of papers. These papers report the good performance of the approaches. In data-driven machine learning, model building and selection is based on a set of criteria. Naturally, efficiency and accuracy is one of them. This usually implies that more complex models with more parameters perform better. This contrasts with model simplicity (and Occam’s razor, as we mention in Chap. 5.1). Explainable AI, the need to build transparent models and make transparent decisions add additional constraints to how machine learning models are identified and how these models are used. Privacy regulations also add additional constraints to model building and selection. The need to consider all these aspects into account make machine learning problems challenging. In this book, we have described some of the first contributions of hesitant fuzzy sets for this complex problem. Further solutions can be developed to tackle with these other competing requirements. The authors appreciate and value the work of students who have contributed to the papers used as the major references of this book over the course of 6 years since 2014. The authors gratefully acknowledge Ms. S. Barchinejad, Ms. L. Aliahmadipour, Ms. M. Mokhtia, Ms. S. Sardari, Mr. M. Zeinalkhani, Mr. M. Mohtashami, Mr. M. K. Ebrahimpour, and Mr. J. Elmi. Kerman, Iran Kerman, Iran Kerman, Iran Umeå, Sweden December 2021
Mahdi Eftekhari Adel Mehrpooya Farid Saberi-Movahed Vicenç Torra
Contents
1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Fuzzy Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Membership Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Hesitant Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Fuzzy Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Dimensionality Reduction and Related Subjects . . . . . . . . . . . . . . 1.3.1 Pearson’s Correlation Coefficient Measure . . . . . . . . . . . 1.3.2 Correlation Coefficient of Hesitant Fuzzy Sets . . . . . . . . 1.3.3 Correlation-Based Merit . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Rough Set and Fuzzy-Rough Set Basic Concepts . . . . . . 1.3.6 Weighted Rough Set Basic Concepts . . . . . . . . . . . . . . . . 1.3.7 Fuzzy Rough Set Basic Concepts . . . . . . . . . . . . . . . . . . . 1.4 Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Fuzzy c-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Subtractive Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Fuzzy Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 I-Fuzzy Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 LASSO Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Elastic Net Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part I 2
1 1 1 2 5 12 13 20 21 22 23 24 25 27 27 29 29 30 31 32 33 34 34 34 35
Unsupervised Learning
A Definition for Hesitant Fuzzy Partitions . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 H-Fuzzy Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41 41 42
ix
x
Contents
2.2.1 Construction of H-Fuzzy Partitions . . . . . . . . . . . . . . . . . . 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Unsupervised Feature Selection Method Based on Sensitivity and Correlation Concepts for Multiclass Problems . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 GA for Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Proposed Unsupervised Feature Selection Method . . . . . . . . . . . . 3.2.1 Feature Relevance Evaluation via Sensitivity Analysis Based on Subtractive Clustering . . . . . . . . . . . . 3.2.2 A General Scheme for Sensitivity and Correlation Based Feature Selection (SCFS) . . . . . . . . . . . . . . . . . . . . 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Computer Programming Exercise for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part II 4
5
6
43 49 49 49 51 51 52 53 53 54 55 57 57
Supervised Learning Classification and Regression
Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Discretization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Employing Discretization Methods for Fuzzy Partitioning . . . . . . 4.3.1 Defining MFs Over Crisp Partitions . . . . . . . . . . . . . . . . . 4.3.2 Fuzzy Entropy Based Fuzzy Partitioning . . . . . . . . . . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 62 64 66 75 77 78 79
Comparing Different Stopping Criteria for Fuzzy Decision Tree Induction Through IDFID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Iterative Deepening FID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Stopping Criterion of IDFID3 . . . . . . . . . . . . . . . . . . 5.3.2 Comparison Method for Various Stopping Criteria . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Computer Programming Exercise for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81 81 82 84 87 89 91 92 92
Hesitant Fuzzy Decision Tree Approach for Highly Imbalanced Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Hesitant Fuzzy Decision Tree Approach . . . . . . . . . . . . . . . . . . . . . 6.2.1 Data Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93 93 96 96
Contents
xi
6.2.2 Generating the Membership Functions . . . . . . . . . . . . . . . 97 6.2.3 Construction of Fuzzy Decision Trees . . . . . . . . . . . . . . . 97 6.2.4 The Aggregation of FDTs . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.5 Notations for Different FDT Classifiers . . . . . . . . . . . . . . 98 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4 Computer Programming Exercise for Future Works . . . . . . . . . . . 104 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7
Dynamic Ensemble Selection Based on Hesitant Fuzzy Multiple Criteria Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Multiple Classifier Systems and Dynamic Selection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Neighborhood Selection Techniques . . . . . . . . . . . . . . . . . 7.2.2 Competence Level Calculation . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Classifier Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Dynamic Ensemble Selection Based on Hesitant Fuzzy Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
107 107 108 110 110 113 114 115 116 116
Part III Dimension Reduction 8
9
Ensemble of Feature Selection Methods: A Hesitant Fuzzy Set Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 The MRMR-HFS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Feature Selection: A Hesitant Fuzzy Correlation Concept for High-Dimensional Microarray Datasets . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Distributed HCPF Feature Selection Algorithm . . . . . . . . . . . . . . . 9.2.1 The Partitioning Process in the Distributed Version of HCPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 HCPF: A Feature Selection Algorithm Based on HFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 The Merging Process in the Distributed Version of HCPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
121 121 123 130 131 131 133 133 136 138 138 144 144 146 146
xii
Contents
10 A Hybrid Filter-Based Feature Selection Method via Hesitant Fuzzy and Rough Sets Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Proposed Hybrid Filter-Based Feature Selection Method . . . . . . . 10.2.1 Sample Weighting Approach . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Primary Feature Subset Selection . . . . . . . . . . . . . . . . . . . 10.2.3 Fuzzy Rough-Based Elimination . . . . . . . . . . . . . . . . . . . . 10.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147 147 149 150 150 151 155 155 156
11 Feature Selection Based on Regularization of Sparsity Based Regression Models by Hesitant Fuzzy Correlation . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Feature Selection Method Based on Regularization . . . . . . . . . . . . 11.2.1 Regularized Ridge Regression . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Regularized LASSO and Elastic Net Regression . . . . . . . 11.3 Alternative Sparsity Based Regression Methods . . . . . . . . . . . . . . 11.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Computer Programming Exercises for Future Works . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157 157 159 160 161 163 164 166 166
Chapter 1
Preliminaries
In this chapter, the reader is provided with some background information that are used in the next chapters. More particularly, hesitant fuzzy sets and their relevant definitions are given in Sect. 1.1. Next, decision trees and their fuzzy version are quickly reviewed in Sect. 1.2. Then in Sect. 1.3, a brief description of dimensionality reduction techniques and of similarity measures are provided. After that, some wellknown clustering methods are considered in Sect. 1.4. Finally, Sect. 1.5 presents three significant linear regression algorithms.
1.1 Fuzzy Notions This section is composed of three subsections: Sect. 1.1.1, which introduces fuzzy sets, Sect. 1.1.2, which focuses on membership functions, and Sect. 1.1.3, which presents some basic concepts and notations of fuzzy set and hesitant fuzzy set theories.
1.1.1 Fuzzy Sets Fuzzy sets theory was introduced by Zadeh in 1965 [1]. In Boolean logics only two truth values are considered: true and false, often represented by one and zero. Similarly, in classical set theory, elements can either belong to a set or not belong to the set. If we define a set A in terms of a characteristic function χ A on a reference set, or universe of discourse, X . We have that χ A is a function that given an element x ∈ X assigns either zero if the element is not in A and one if it is in A. Thats is, χ A : X → {0, 1}. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Eftekhari et al., How Fuzzy Concepts Contribute to Machine Learning, Studies in Fuzziness and Soft Computing 416, https://doi.org/10.1007/978-3-030-94066-9_1
1
2
1 Preliminaries
Zadeh proposed to use the whole interval [0, 1] instead of using {0, 1}. In this way a range of logical values can be considered in which zero and one are also included. Fuzzy sets permit to represent partial membership. Formally, we define a set A in terms of a membership function μ A that assigns a value in the unit interval to elements of the reference set X . That is, μ A : X → [0, 1]. Naturally, when the range of μ A is just {0, 1} the fuzzy set is just a standard set. When we need to distinguish between fuzzy sets and standard sets we use the term crisp set for the latter. Several functions have been defined to operate with fuzzy sets. For example, union, intersection and complement of fuzzy sets have been defined. Given the membership function of two fuzzy sets A and B represented by μ A and μ B we can compute their union and intersection by means of the following two expressions: • μ A∪B (x) = max(μ A (x), μ B (x)) • μ A∩B (x) = min(μ A (x), μ B (x)), and we can compute the complement of a fuzzy set A denoted Ac as: • μ Ac (x) = 1 − μ A (x). These definitions are generalizations of the standard operators for classical sets. That is, when μ A and μ B are crisp sets, μ A∪B and μ A∩B will be the standard union and the standard intersection. Other definitions have been proposed for union, intersection, and complement of fuzzy sets instead of minimum, maximum, and 1 − μ A (x). Functions that generalize classical union are known as t-conorms, the ones that generalize classical intersection are known as t-norms, and the ones that generalize classical components are known as negations. Examples of t-norms, besides the minimum, include T (a, b) = ab (the product), and T (a, b) = max{0, a + b − 1} (Łukasiewicz’s t-norm). Examples of t-conorms, besides the maximum, include S(a, b) = a + b − ab (the probabilistic sum) and S(a, b) = min(a + b, 1) (the bounded sum or Łukasiewicz’s t-conorm). Other operations have been defined as well on fuzzy sets. We will use the sigma count of a fuzzy set that generalizes the concept of cardinality. Given a fuzzy set A represented by its membership function μ A on a reference set X , we define its sigma count as: μ A (x). (1.1) |A| = x∈X
Observe that for crisp sets, this is just the cardinality as μ(x) will be either zero or one.
1.1.2 Membership Functions As stated above, Membership Functions (MF) describe the degree of membership of each element of the reference set to the corresponding fuzzy set. Triangular, trapezoidal and Gaussian membership functions are types of functions that are commonly
1.1 Fuzzy Notions
3
used in pattern recognition applications. Triangular MFs can be characterized by three parameters a, b and c. They are defined by Eq. (1.2). Figure 1.1a shows an example of triangular MF. ⎧ 0 ⎪ ⎪ ⎪ ⎪ x −a ⎪ ⎨ b −a Triangle(x; a, b, c) = c − x ⎪ ⎪ ⎪ ⎪ ⎪ ⎩c − b 0
if x ≤ a if a ≤ x ≤ b if b ≤ x ≤ c
.
(1.2)
if c ≤ x
Trapezoidal MFs, given by Eq. (1.3), require one more parameter, d, compared to the triangular MFs. Figure 1.1b illustrates a typical example of a trapezoidal MF. ⎧ ⎪ 0 ⎪ ⎪ ⎪ x −a ⎪ ⎪ ⎪ ⎪ ⎨b − a Trapezoidal(x; a, b, c, d) = 1 ⎪ ⎪ ⎪d − x ⎪ ⎪ ⎪ ⎪ d −c ⎪ ⎩ 0
if x ≤ a if a ≤ x ≤ b if b ≤ x ≤ c .
(1.3)
if c ≤ x ≤ d if d ≤ x
Gaussian MFs require two parameters, c and σ, and are formulated by Eq. (1.4). Figure 1.1c portrays a Gaussian MF in which c = 5 and σ = 2.
1 gaussian(x; c, σ) = exp − 2
x −c σ
2
.
(1.4)
A 2-sided Gaussian MFs is similar to the previous function but there is an interval with elements with membership value equal to one, and, then, the shape on the right and left of the function is different. We use cleft and cright to denote the interval with membership equal to one, and σleft and σright to denote the parameters that control the width of the function. The formal definition follows.
(a)
(b)
(c)
Fig. 1.1 Three types of MFs: a Triangular MF. b Trapezoidal MF. c Gaussian MF
4
1 Preliminaries ⎧
⎪ 1 x − cleft 2 ⎪ ⎪ if x ≤ cleft exp − ⎪ ⎪ 2 σleft ⎪ ⎨ 2SidedGaussian(x; σleft , cleft , cright , σright ) = 1 if cleft ≤ x ≤ cright ⎪ 2
⎪ ⎪ x − c 1 ⎪ right ⎪ ⎪ if cright ≤ x ⎩exp − 2 σright
(1.5) There are different methods in the literature to generate membership functions. Medasani et al. [2] and Chi et al. [3] provide an overview of these techniques. Heuristic methods [4] use predefined shapes for MFs. These techniques have been successfully used in rule-based pattern recognition applications. Histogram based methods [2] employs histograms of features that provide information regarding the distribution of input feature values. Probability based methods define MFs by converting probability distributions to possibility distributions [5]. A possibility distribution on a reference set X is a function π : X → [0, 1]. While probability distributions are required to add to one (i.e., x∈X p(x) = 1 if X is finite), this is not the case for possibility distributions. A distribution π can be understood as a fuzzy set. Finally, there are methods that define MFs applying the notion of fuzzy nearest neighbor, introduced by Keller et al. [6]. In neural network based techniques, feed forward multi-layer neural networks are selected as generators of MFs [2], and in clustering based methods, clustering techniques such as fuzzy c-means determine MFs. Two other approaches are of interest. Yuan and Shaw [7] introduced a new iterative procedure for MF construction, and Pedrycz [8] proposed fuzzy equalizations to produce membership functions. A brief description of the two last methods are provided in the following paragraphs since they have been the center of attention in recent years and been compared with some methods presented in this book. Note that these three ways of defining fuzzy sets have been combined. E.g., [9, 10] use an approach to clustering where the shape of membership functions is given. This approach, that is called membership based clustering, combines the first and the third approaches above. Yuan and Shaw’s method to generate membership functions takes an input parameter k to construct k triangular MFs through an iterative procedure so that adjacent MFs cross at a membership value equal to 0.5. The iterative procedure starts with evenly distributed triangular MFs on the range of attributes and adjusts the center of MFs in order to reduce the total distances of examples to the center of the nearest MFs. In each iteration, it randomly selects a sample and adjusts the nearest MF’s center using the learning rate η that is a monotonic decreasing function. The iteration will continue until the total distance of examples converges to the center of the nearest MF. The fuzzy equalization based MF construction method takes an input parameter c to define c triangular MFs with 1/2 overlap occurrence between every two successive MFs. It constructs triangular MFs from left to right and uses the two last parameters of each triangular MF (i.e., b and c) as the two first parameters of the next triangular
1.1 Fuzzy Notions
5
MF (i.e., a and b). Then, it determines the last parameters of the triangular MF so that the probability of the fuzzy event associated with the generated MF takes the value 1/c. Further details about fuzzy equalization method can be found in [8].
1.1.3 Hesitant Fuzzy Sets Since Zadeh introduced fuzzy set theory, a number of extensions have been proposed. They include intuitionistic fuzzy sets [11], type-2 fuzzy sets [12], type-n fuzzy sets, fuzzy multisets and hesitant fuzzy sets [13, 14]. The theory of intuitionistic fuzzy sets was constructed based on three fundamental concepts: membership functions, nonmembership functions and hesitancy functions. Type-2 fuzzy sets were established according to the principle that the membership of a given element is permitted to be a fuzzy set. Type-n fuzzy sets are in fact an extension of type-2 fuzzy sets for which the membership can be a type-(n − 1) fuzzy set. Fuzzy multiset theory introduces an extension of fuzzy sets in which each element is allowed to be repeated two or more times, each time with a different membership degree. In hesitant fuzzy sets (HFS), the membership is permitted to have a set of possible values. To be more specific, HFSs model the uncertainty provoked by the hesitation that may appear when we need to assign the membership degree of an element to a fuzzy set [15]. Later, some connections between HFSs and fuzzy multisets, and between HFSs and intuitionistic and type-n fuzzy sets were established. In particular, it was proven that the envelope of a HFS is an intuitionistic fuzzy set. Many researchers have studied this concept and developed further extensions. For example, dual hesitant fuzzy sets [16] and generalized hesitant fuzzy sets [17]. HFSs and its variants have been applied to several types of problems. For example, decisionmaking and information fusion. See for example the results in [15, 18]. Other applications are in the field of machine learning were we find hesitant fuzzy decision tree [19], HFS-based methods for feature selection [20–22], and hesitant fuzzy clustering [23]. See e.g. [24] for an overview of results related to hesitant fuzzy sets.
1.1.3.1
Basic Definitions
In what follows, the definition of an HFS and the concept of the correlation coefficient between two HFSs are described. Definition 1.1.1 ([14]). Let X be a universe of discourse. A Hesitant Fuzzy Set (HFS) A on X is defined in terms of a Hesitant Fuzzy (HF) membership function h A (x) : X −→ P f ([0, 1]) where P f ([0, 1]) is the set of all non-empty finite subsets of [0, 1]. Formally, this is defined by Eq. (1.6): A = {x, h A (x) x ∈ X },
(1.6)
6
1 Preliminaries
In this expression, h A returns a non-empty finite subset of [0, 1] for every x ∈ X . Hence, h A (x) is the set of all the possible values in the interval [0, 1]. Following Z. Xu and others (see e.g. [25]), we call h A (x) for a given x ∈ X a Hesitant Fuzzy Element (HFE) of A. Example 1.1.1 ([25]). Let X = {x1 , x2 , x3 } be a universe of discourse. Assume that h A (x1 ) = {0.2, 0.4, 0.5}, h A (x2 ) = {0.3, 0.4} and h A (x3 ) = {0.3, 0.2, 0.5, 0.6} are the HFEs of an HFS A corresponding to x1 , x2 and x3 , respectively. Then, the hesitant fuzzy set A is A = {x1 , h A (x1 ) x2 , h A (x2 ) x3 , h A (x3 )} . Definition 1.1.2 ([14]). Given an HF membership function h, we define the Intuitionistic Fuzzy Value (IFV) Aenv(h) as the envelope of h in which Aenv(h) has representation (h − , 1 − h + ), where h − = inf{γ| γ ∈ h} and h + = sup{γ| γ ∈ h}. Considering the relationship between the HFEs and IFVs, Xu and Xia [26] defined some operations on a collection of HFs. Let h, h 1 and h 2 be HFs and suppose that λ is a real number. Then, • • • • • • •
h λ = ∪γ∈h {γ λ }, λh = ∪γ∈h {1 − (1 − γ)λ }, h 1 ⊕ h 2 = ∪γ1 ∈h 1 ,γ2 ∈h 2 {γ1 + γ2 − γ1 γ2 }, h 1 ⊗ h 2 = ∪γ1 ∈h 1 ,γ2 ∈h 2 {γ1 γ2 }, h c = ∪γ∈h {1 − γ}, h 1 ∪ h 2 = ∪γ1 ∈h 1 ,γ2 ∈h 2 max {γ1 , γ2 }, h 1 ∩ h 2 = ∩γ1 ∈h 1 ,γ2 ∈h 2 min {γ1 , γ2 }.
Here, we recall some aggregation operators involved in hesitant fuzzy sets which will be used in the next chapters. For a given HFS A on the reference set X , with membership degrees expressed in terms of the function h A , we will denote h Aσ( j) (x) the jth largest value in the set h A (x). That is, Aσ( j) is the permutation of the membership values in h A (x) such that h Aσ( j) (x) ≥ h Aσ( j+1) (x). Note that the permutation depends on x and, naturally, on h. We will use li to denote the cardinality of the set associated to xi ∈ X , or l x to denote the cardinality of the set associated to x ∈ X . That is, li = |h A (xi )|. In some applications the values in h A (xi ) may be provided by experts. Because of that, we can understand li as the number of experts. Definition 1.1.3 ([25]). For an HFS A = {x, h A (x) xi ∈ X, i = 1, 2, . . . , n}, the information energy of A is defined by Eq. (1.7). ⎞ li 1 ⎝ E HFS (A) = h 2Aσ( j) (xi )⎠ , l i i=1 j=1 n
⎛
(1.7)
1.1 Fuzzy Notions
7
where n is the cardinality of the universe of discourse, h Aσ( j) (xi ) is the j th largest membership value of the i th element in the universe of discourse, and li is the number of values in h A (xi ). When a single element xi ∈ X is considered, we use the following expression. li 1 E HFS (A)(i) = h2 (xi ), li j=1 Aσ( j)
i = 1, 2, . . . , n.
(1.8)
Definition 1.1.4 ([18]). Let A and B be two typical HFSs. The correlation between A and B is given by Eq. (1.9). ⎞ li 1 ⎝ CHFS (A, B) = h Aσ( j) (xi )h Bσ( j) (xi )⎠ , l i i=1 j=1 n
⎛
(1.9)
where Aσ( j) and Bσ( j) are defined as in Eq. (1.7) (i.e., Aσ is the ordering associated to A and Bσ is the one associated to B).
1.1.3.2
Aggregating and Score Operators
We start this section reviewing some aggregation functions for hesitant fuzzy sets. They are functions that given n hesitant fuzzy sets build another hesitant fuzzy set. Definition 1.1.5 ([27]). Let h i be an HF for i = 1, . . . , n. A Hesitant Fuzzy Weighted Averaging (HFWA) operator is a mapping H n −→ H defined as n HFWA(h 1 , . . . , h n ) = (wi h i ) =
1−
γ1 ∈h 1 ,...,γn ∈h n
i=1
n
(1 − γi )
wi
,
(1.10)
i=1
, . . . , wn )T is a weighting vector for h j for j = 1, 2, . . . , n. That is, where w = (w1 n wi = 1. In particular, if w = (1/n, . . . , 1/n)T , then the HFWA wi ∈ [0, 1] and i=1 operator is reduced to the Hesitant Fuzzy Averaging (HFA) operator given by Eq. (1.11). HFA(h 1 , . . . , h n ) =
n 1 i=1
n
hi
=
γ1 ∈h 1 ,...,γn ∈h n
n 1/n 1 − (1 − γi ) .
(1.11)
i=1
Example 1.1.2 Suppose that h 1 = {0.2, 0.6} and h 2 = {0.35, 0.5, 0.77} are two HFEs. Then the HFA(h 1 , h 2 ) can be calculated as
8
1 Preliminaries
HFA(h 1 , h 2 ) =
2 2 1 ( hi ) = {1 − (1 − γi )1/2 } 2 i=1 γ ∈h ,γ ∈h i=1 1
1
2
2
= {1 − [(1 − 0.2)1/2 (1 − 0.35)1/2 ], 1 − [(1 − 0.2)1/2 (1 − 0.5)1/2 ], 1 − [(1 − 0.2)1/2 (1 − 0.77)1/2 ], 1 − [(1 − 0.6)1/2 (1 − 0.35)1/2 ], 1 − [(1 − 0.6)1/2 (1 − 0.5)1/2 ], 1 − [(1 − 0.6)1/2 (1 − 0.77)1/2 ]} = {0.27, 0.36, 0.57, 0.49, 0.45, 0.62}. Definition 1.1.6 ([27]). Let h i be an HF for i = 1, . . . , n, and suppose that HFWG: H n −→ H is given by Eq. (1.12). HFWG(h 1 , . . . , h n ) =
n
(h iwi )
=
γ1 ∈h 1 ,...,γn ∈h n
i=1
n (γi )wi
.
(1.12)
i=1
Then HFWG is called a Hesitant Fuzzy Weighted Geometric (HFWG) operator, where w = (w1 , . . . , wn )T is a weighting vector for h j for j = 1, 2, . . . , n. That is, n wi = 1. In case w = (1/n, . . . , 1/n)T , the HFWA operator is wi ∈ [0, 1] and i=1 reduced to the Hesitant Fuzzy Geometric (HFG) operator in Eq. (1.13). HFG(h 1 , . . . , h n ) =
n
1 n
(h i ) =
i=1
γ1 ∈h 1 ,...,γn ∈h n
n
(γi )1/n .
(1.13)
i=1
Definition 1.1.7 Let h i be an HF for i = 1, . . . , n, and suppose that w = (w1 , . . . , wn )T isa weighting vector associated to h i s. That is, wi ∈ [0, 1] for n wi = 1. A Generalized HFWG (GHFWG) operator is a mapi = 1, . . . , n and i=1 n ping GHFWAλ : H → H defined by Eq. (1.14). 1 n GHFWAλ (h 1 , . . . , h n ) = ⊕i=1 (wi h i )λ λ ⎧
1/λ ⎫ n ⎨ ⎬ w i 1 − γiλ = 1− 1− . ⎩ ⎭ γ1 ∈h 1 ,γ2 ∈h 2 ,...,γn ∈h n
(1.14)
i=1
We now review some scoring functions. A scoring function is one that assigns to each hesitant fuzzy element a single number in [0,1]. That is, they are functions that take l membership values in [0,1] and build another one. Therefore, in general, any aggregation function can be used for this purpose. Definition 1.1.8 ([28]). Let h = {h 1 , h 2 , . . . , h l } be a hesitant fuzzy element, let σ a permutation of {1, . . . , l} so that h σ(i) ≥ h σ(i+1) for 1 ≤ i < l (i.e., h σ(i) is the i th largest element of h). The functions in Eqs. (1.15), (1.16) and (1.17) are score functions for HFEs.
1.1 Fuzzy Notions
9
• The arithmetic mean score function: 1 hi . l i=1 l
SG M (h) =
(1.15)
• The minimum score function: l
S Min (h) = min h i = h σ(l) . i=1
(1.16)
• The maximum score function: l
S Max (h) = max h i = h σ(1) i=1
(1.17)
• The OWA-based score function. Given a fuzzy quantifier Q (i.e., a monotonic nondecreasing function Q such that Q(0) = 0 and Q(1) = 1): S Q (h) =
l
Q(i/l) − Q((i − 1)/l)h σ(i) .
i=1
Score functions permit us to summarize the values of a hesitant fuzzy sets, and also build a fuzzy set from a hesitant fuzzy set. Given a score function S, and given a hesitant fuzzy set h on the reference set X , the following definition results into a standard fuzzy set: μ(x) = S(h(x)). Score functions S define a total order on the hesitant fuzzy elements. Given S, for any h 1 and h 2 hesitant fuzzy elements, we define [27], • if S(h 1 ) > S(h 2 ), then h 1 > S h 2 ; • if S(h 1 ) = S(h 2 ), then h 1 = S h 2 . When S is clear from the context, we simply use h 1 > h 2 and h 1 = h 2 . Given two hesitant fuzzy elements, we may require them to have the same number of values to operate them. In this case, we may use the following procedure. Remark 1.1.1 ([27]). Let h 1 and h 2 two hesitant fuzzy elements. Let lh j stand for the number of values in h j , then Steps 1 and 2 result into two hesitant fuzzy elements of the same length. 1. All the elements in each h j are arranged in decreasing order, and h σ(i) is the i th largest value in h j . 2. In case lh 1 = lh 2 , one sets l = max{lh 1 , lh 2 }. If the number of elements of h 1 is less than that of h 2 , the extension of h 1 should be considered optimistically by repeating its maximum element until it has the same length as h 2 . Do similarly if the number of elements of h 2 is less than that of h 1 .
10
1 Preliminaries
1.1.3.3
Examples of Hesitant Fuzzy Sets and Their Related Operators
In this section, a number of examples of HFSs and their related operators that are utilized in this chapter are given to provide a deeper understanding of these notions. It is supposed that in Examples 1.1.3, 1.1.4, 1.1.5 and 1.1.6, the universe of discourse is U = {s1 , s2 , s3 , s4 , s5 , s6 , s7 , s8 }, where for i = 1, 2, . . . , 8, si indicates a student. Example 1.1.3 A fuzzy set, A, that represents the concept of a “good student” can be defined by Eq. (1.18). A = {(s1 , 0), (s2 , 0.1), (s3 , 0.8), (s4 , 1), (s5 , 1), (s6 , 0.8), (s7 , 0.15), (s8 , 0)}. (1.18) Considering the degree of membership, the students s1 and s8 are not “good students”, the cases s2 , s3 , s6 and s7 partially belong to the concept of a “good student”, and the cases s4 and s5 are “good students.” Example 1.1.4 A hesitant fuzzy set, B, that represents the concept of a “good student” can be given by Eq. (1.19). B = {(s1 , {0, 0.1}), (s2 , {0.1, 0.15, 0.2}), (s3 , {0.7, 0.8, 0.9}), (s4 , {0.99, 1}), (s5 , {0.98, 1}), (s6 , {0.75, 0.8}), (s7 , {0.15, 0.2}), (s8 , {0, 0.001})}. (1.19) It should be mentioned that each hesitant fuzzy element, h(si ), for i = 1, 2, . . . , 8, are described in Table 1.1. Example 1.1.5 Suppose that the Criteria and Experts sets are defined by Eqs. (1.20) and (1.21), respectively. Criteria = {cr1 , cr2 }, (1.20) and Experts = {ex per t1 , ex per t2 , ex per t3 }.
(1.21)
Table 1.1 The hesitant fuzzy elements (HFEs) corresponding to Example 1.1.4 U Expert 1 Expert 2 Expert 3 HFEs s1 s2 s3 s4 s5 s6 s7 s8
0 0.1 0.7 0.99 0.98 0.75 0.15 0
0 0.15 0.8 0.99 1 0.8 0.15 0.001
0.1 0.2 0.9 1 0.98 0.8 0.2 0
h(s1 ) ={0, 0.1} h(s2 ) ={0.1, 0.15, 0.2} h(s3 ) ={0.7, 0.8, 0.9} h(s4 ) ={0.99, 1} h(s5 ) ={0.98, 1} h(s6 ) ={0.75, 0.8} h(s7 ) ={0.15, 0.2} h(s8 ) ={0, 0.001}
1.1 Fuzzy Notions
11
Table 1.2 The required calculations to prepare HFEs corresponding to Example 1.1.5 U Criteria Expert 1 Expert 2 Expert 3 HFEs s1 s2
s3 s4 s5 s6 s7 s8
cr1 cr2 cr1
0.1 0 0.1
0.2 0 0.15
0.1 0.1 0.2
cr2 cr1 cr2 cr1 cr2 cr1 cr2 cr1 cr2 cr1 cr2 cr1 cr2
0.7 0.99 0.98 0.75 0.15 0 0.001 0 0.5 0.5 0.2 0.5 0.2
0.8 0.99 1 0.8 0.15 0.2 0.001 0.1 0.6 0.5 0.8 0.5 0.3
0.9 1 0.98 0.8 2 0.1 0.02 0.2 0.8 0.7 0.8 0.7 0.7
h 1 (s1 ) = {0.1, 0.2} h 2 (s1 ) = {0, 0.1} h 1 (s2 ) = {0.1, 0.15, 0.2} h 2 (s2 ) = {0.7, 0.8, 0.9} h 1 (s3 ) = {0.99, 1} h 2 (s3 ) = {0.98, 1} h 1 (s4 ) = {0.75, 0.8} h 2 (s4 ) = {0.15, 2} h 1 (s5 ) = {0, 0.1, 0.2} h 2 (s5 ) = {0.001, 0.02} h 1 (s6 ) = {0, 0.1, 0.2} h 2 (s6 ) = {0.5, 0.6, 0.8} h 1 (s7 ) = {0.5, 0.7} h 2 (s7 ) = {0.2, 0.8} h 1 (s8 ) = {0.5, 0.7} h 2 (s8 ) = {0.2, 0.3, 0.7}
Table 1.2 shows the calculations required for preparing HFEs. According to HFEs, the hesitant fuzzy decision matrix (HFDM), D, holding the concept of a “good student” is computed as cr2 cr1 ⎡h (s ) h (s )⎤ 1 1 2 1 ⎢h 1 (s2 ) h 2 (s2 )⎥ ⎥ ⎢ ⎢h 1 (s3 ) h 2 (s3 )⎥ ⎥ ⎢ h 1 (s4 ) h 2 (s4 )⎥ D =⎢ . ⎥ ⎢ ⎢h 1 (s5 ) h 2 (s5 )⎥ ⎥ ⎢ ⎢h 1 (s6 ) h 2 (s6 )⎥ ⎣h (s ) h (s )⎦ 1 7 2 7 h 1 (s8 ) h 2 (s8 ) Example 1.1.6 Let the hesitant fuzzy decision matrix, D, be defined as the one given in Example 1.1.5. Then, the three best students can be selected according to the steps described in the following. • Considering Definition 1.1.6, HFG is employed to generate the hesitant fuzzy set, A1 , from the decision matrix, D.
12
1 Preliminaries
A1 = {(s1 , HFG({0.1, 0.2}, {0, 0.1})), (s2 , HFG({0.1, 0.15, 0.2}, {0.7, 0.8, 0.9})), (s3 , HFG({0.99, 1}, {0.98, 1})), . . . , (s8 , HFG({0.5, 0.7}, {0.2, 0.3, 0.7}))} = {(s1 , {0.10.5 × 00.5 , 0.10.5 × 0.10.5 0.20.5 × 00.5 , 0.20.5 × 0.10.5 }), (s2 , {0.10.5 × 0.70.5 , 0.10.5 × 0.80.5 , 0.10.5 × 0.90.5 0.150.5 × 0.70.5 , 0.150.5 × 0.80.5 , 0.150.5 × 0.90.5 0.20.5 × 0.70.5 , 0.20.5 × 0.80.5 , 0.20.5 × 0.90.5 }), . . . , (s8 , {0.50.5 × 0.20.5 , 0.50.5 × 0.30.5 , 0.50.5 × 0.70.5 0.70.5 × 0.20.5 , 0.70.5 × 0.30.5 , 0.70.5 × 0.70.5 })} = {(s1 , {0, 0.1, 0.14}), (s2 , {0.26, 0.28, 0.3, 0.32, 0.34, 0.36, 0.37, 0.4, 0.42}), (s3 , {0.98, 0.99, 1}), (s4 , {0.33, 0.34, 0.38, 0.4}), (s5 , {0, 0.01, 0.04, 0.06}), (s6 , {0, 0.22, 0.24, 0.28, 0.31, 0.34, 0.4}), (s7 , {0.31, 0.37, 0.63, 0.74}), (s8 , {0.31, 0.37, 0.38, 0.45, 0.59, 0.7})}.
• According to Definition 1.1.8, the score function S Max is utilized to generate the fuzzy set A2 from the hesitant fuzzy set A1 . A2 = {(s1 , 0.14), (s2 , 0.42), (s3 , 1), (s4 , 0.4), (s5 , 0.06), (s6 , 0.4), (s7 , 0.74), (s8 , 0.7)}.
• The students are arranged, as presented in Eq. (1.22), according to the degree of membership. (1.22) s3 > s7 > s8 > s2 > s4 = s6 > s1 > s5 . Finally, s3 , s7 and s8 are selected as the three best students.
1.2 Decision Tree In this section, the notion of Fuzzy Decision Trees (FDTs) is reviewed in brief. In this regard, we first provide a general introduction to the basics of Decision Trees (DTs) [29]. A crisp decision tree classifies data instances by sorting them down the tree from the root to some leaf nodes which provide a classification. Each internal node in the tree specifies a test of a single attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute. Each leaf is assigned to one class representing the most appropriate target value. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node.
1.2 Decision Tree
13
ID3 is a popular algorithm for decision tree induction from categorical data. It constructs the trees in a top-down manner, beginning with the question on which attribute should be tested at the root node of the tree? To answer this question, each of the attributes is evaluated using a splitting criterion to determine how good it is, when considered alone, to classify the training examples. The best attribute is selected and used as the test at the root node of the tree. A descendant of the root node is then created for each possible value of this categorical attribute, and the training examples are downed the branch corresponding to the example’s value for this attribute. The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree until a stopping criterion is satisfied. Information gain is a commonly used splitting criterion for decision tree induction which measures how well a given attribute separates the training examples according to their target classification. This criterion applies entropy index that characterizes the (im)purity of an arbitrary collection of examples. If the target attribute can take on m possible values, then the entropy of a collection S relative to this m-wise classification is defined by (1.23). Entropy(S) =
m
|Si | |Si | − pi log2 pi , log2 = |S| |S| i=1 m
−
i=1
(1.23)
where |Si | is the number of instances of S belonging to the class i and pi indicates the probability corresponding to the class i. The information gain of an attribute A relative to a collection of examples S is calculated by (1.24). InformationGain(S, A) = Entropy(S) −
|Sv | Entropy(Sv ), |S| v∈Values(A)
(1.24)
where Values(A) denotes the set of all possible values for the attribute A, and Sv is the subset of S for which the attribute A has the value v, i.e., Sv = {s ∈ S|A(s) = v}.
1.2.1 Fuzzy Decision Tree Fuzzy decision tree classifiers [7] combine decision trees with approximate reasoning as offered by fuzzy knowledge representation. This is to deal with language and measurement uncertainties. Fuzzy decision trees employ fuzzy linguistic terms to specify branching conditions in nodes and to allow examples to follow down multiple branches simultaneously with different satisfaction degrees ranged on [0, 1]. Before the fuzzy decision tree construction and inference procedures are defined, some notations are required [30].
14
1 Preliminaries
• Let X be a data set. As usual, the columns of X correspond to the set of attributes. We denote this set by {A1 .A2 , . . . , Ak , Y }, where Ai is the i th attribute, k is the number of input attributes, A = {A1 .A2 , . . . , Ak } denotes the set of input (original) attributes, and Y indicates the target feature. • The set of values of the target variable (class labels) is Y ∈ {c1 , c2 , . . . , cm }, where m is the number of classes. • In the fuzzy dataset S, the examples are denoted by S = {(X1 , μs (X1 )), (X2 , μs (X2 )), . . . , (Xn , μs (Xn ))},
• • • •
where Xi is the i th example, μs (Xi ) indicates the membership degree of Xi in S, and n denotes the number of examples. Xi has been written in boldface since it is a vector containing the input attributes and the target attribute. ( j) The i th example is symbolized by Xi = [xi(1) , xi(2) , . . . , xi(k) , yi ]T , where xi is the th value of the j attribute, and yi is the value of the target feature. Fuzzy terms defined on the i th attribute, Ai , are denoted by Fi(1) , Fi(2) , . . . , Fi(ri ) , ( j) where Fi is the j th fuzzy term and ri is the number of fuzzy terms defined on the attribute Ai . ( j) The membership function corresponding to the fuzzy term Fi is represented by μ F ( j) . i The number of examples in the fuzzy dataset S is denoted by |S| and is defined in terms of the sigma count. That is, |S| =
n
μ S (Xi ).
i=1
• S y=ci indicates the set of examples of the fuzzy dataset S which belong to the class i. Note that S y=ci can be a fuzzy set. • When S symbolizes the fuzzy dataset associated to a parent node, the fuzzy dataset ( j) associated to a child node corresponding to the fuzzy term Fi is denoted by ( j) S[Fi ]. For example, consider the child nodes in Fig. 1.2. In this figure, the branching attribute is Ai and each node is labeled with the name of the fuzzy dataset. Throughout this book, the jth child means the child node corresponding ( j) to the fuzzy term Fi .
1.2.1.1
Fuzzy Decision Tree Construction
FDTs allow instances to follow down multiple branches simultaneously, with different satisfaction degrees ranged over [0, 1]. To implement these characteristics, FDTs utilize fuzzy linguistic terms to specify branching condition of nodes. In FDTs, an instance may fall into many leaves with different satisfaction degrees (each degree taking a value in [0, 1]). The reason is that, for any parent node, the instance can
1.2 Decision Tree
15
Fig. 1.2 A typical parent node with child nodes in the fuzzy decision tree
satisfy the membership associated to several child nodes with non-null membership degree [30]. This fact is most advantageous as it provides more graceful behavior, especially if we need to deal with noise or incomplete information. Nevertheless, from a computational point of view, the process of inducing an FDT is slower compared with the one of inducing a crisp decision tree. This is the price paid for this strategy to induce a more accurate but still interpretable classifier. In general, fuzzy decision tree induction has two major components: a procedure for the fuzzy decision tree construction and an inference procedure for decisionmaking, i.e., the class assignment for new instances. One of the FDT building procedures is a fuzzy extension to the well-known ID3 algorithm [29] called Fuzzy ID3 (FID3). Fuzzy ID3 employs predefined fuzzy linguistic terms by which the attribute values of training data are fuzzified [31]. This method extends the information gain measure to determine branching attribute of each node expansion. Moreover, FID3 uses fuzzy datasets in which a degree of membership is added to each example together with the crisp value of all features (including both the input attributes and the target feature). The fuzzy dataset of the child nodes contains all the examples of the parent nodes in which the branching attribute has been eliminated. Another difference occurs in the membership degrees of examples. Suppose that S is the fuzzy dataset of a parent node, Ai is the branching attribute with Fi(1) , Fi(2) , . . . , Fi(ri ) fuzzy terms, and the ( j) ( j) fuzzy dataset of the child node corresponding to the fuzzy term Fi is S[Fi ]. The ( j) (1) (2) (k) membership degree of the h th example, Xh = [x h , x h , . . . , x h , yh ]T , in S[Fi ] is given by Eq. (1.25) [30]. μ S[F ( j) ] (Xh ) = μ S (Xh ) × μ F ( j) (x h(i) ), i
i
(1.25)
16
1 Preliminaries
where μ S (Xh ) is the membership degree of Xh in S, and μ F ( j) (x h(i) ) is the membership i
( j)
degree of x h(i) in the MF corresponding to the fuzzy term Fi , namely μ F ( j) . In the i generalized case, the multiplication operator can be replaced with a t-norm operator. The attribute whose Fuzzy Information Gain (FIG) is maximum (see Eq. (1.26) below) is selected by FID3 as a branching attribute. The fuzzy information gain is defined in terms of the fuzzy entropy (FE). Let us review these expressions. The FIG of the attribute Ai relative to a fuzzy dataset S is given by Eq. (1.26). FIG(S, Ai ) = FE(S) −
ri
( j)
w j × FE(S[Fi ]),
(1.26)
j=1 ( j)
where FE(S) is the fuzzy entropy of the fuzzy dataset S, FE(S[Fi ]) is the fuzzy entropy of the j th child node, and w j is the fraction of examples which belong to the j th child node. The Fuzzy Entropy (FE) is defined by Eq. (1.27) [30]. FE(S) =
m i=1
−
|S y=ci | |S y=ci | log2 . |S| |S|
(1.27)
Then, the fraction of example w j is calculated by Eq. (1.28), which uses the sigma count of fuzzy sets (recall Eq. (1.1)). ( j)
|S[F ]| . wj = i (k) ri S[F ] k=1 i
(1.28)
The first term in Eq. (1.26) is the fuzzy entropy of S, and the second term is the expected value of the fuzzy entropy after S is partitioned using the attribute Ai . FIG(S, Ai ) is therefore the expected reduction in the fuzzy entropy caused by the fact that the value of the attribute Ai has already been known. There are some other methods to select the branching attribute [7, 32]. Algorithm 1.1 summarizes the FDT construction procedure. As it is mentioned in lines 3 and 5 of Algorithm 1.1, a stopping criterion is utilized as an early stopping technique to construct the FDT. According to the conclusions of [30], Normalized Maximum FIG multiplied by Number of Instances (NMGNI) is applied as a stopping criterion.
1.2 Decision Tree
17
Algorithm 1.1: The fuzzy decision tree induction algorithm (adapted from [30]). 1 2 3 4 5 6 7 8 9 10
Input: The crisp training data; the predefined membership functions on each attribute; the splitting criterion; the stopping criterion; the threshold value of the stopping criterion; Output: A Fuzzy Decision Tree (FDT); Generate the root node with a fuzzy dataset containing all the crisp training data in which all the membership degrees have been assigned to one; for each new node N , with the fuzzy dataset S do if the stopping criterion is satisfied then Make N as a leaf and assign the fraction of the examples of N belonging to each class as a label of that class; end if the stopping criterion is not satisfied then Calculate FIG for each attribute, and select the attribute Amax that maximizes FIG as a branching attribute; Generate new child nodes {child1 , child2 , . . . , childrmax }, where child j corresponds ( j)
( j)
to the fuzzy term Fmax that contains the fuzzy dataset S[Fmax ] with all the attributes ( j) of S except Amax . The membership degree of each example in S[Fmax ] is calculated using Eq. (1.25); 11 end 12 end
1.2.1.2
Inference of FDT
A classical decision tree can be converted to a set of rules [33]. One can think of each leaf as one rule, the conditions that lead to the leaf generate the conjunctive antecedent and the classification of the examples in the leaf produces the consequence. In this case, only when examples of every leaf have a unique classification, a consistent set of rules is generated. This will take place only when the set of attributes used, include a sufficient number of features, and the training data is consistent. Since in the fuzzy representation, a value can have a non-zero membership in more than one fuzzy set, the inconsistency problem increases dramatically. To solve this problem, approximate reasoning methods have been developed to make appropriate inferences for a decision assignment [34]. Similar to classical decision trees, FDTs can also be converted to a set of fuzzy if-then rules. Figure 1.3 depicts a typical fuzzy decision tree. The corresponding fuzzy if-then rules for each leaf in this FDT are • Leaf1: if [A1 is F1(1) ] and [A2 is F2(1) ] then C1 : 0.2 and C2 : 0.8; • Leaf2: if [A1 is F1(1) ] and [A2 is F2(2) ] then C1 : 0.6 and C2 : 0.4; • Leaf3: if [A1 is F1(2) ] then C1 : 0.3 and C2 : 0.7. In this example, C1 and C2 represent class one and class two, respectively. An approximate reasoning method can be applied to derive conclusions from these sets of fuzzy if-then rules and from the known facts. Succinctly, the approximate reasoning method can be divided into four steps [35]. We discuss now these steps briefly and a more detailed description of how to perform Steps 2, 3 and 4 is provided later.
18
1 Preliminaries
Fig. 1.3 A typical fuzzy decision tree for binary classification
1. Degrees of compatibility: Compare the known facts with the antecedents of fuzzy rules to find the degrees of compatibility with respect to each antecedent (1) MF. This is to apply the MF to the actual values of the attributes. E.g., μ(1) F1 (x ) in Fig. 1.4, obtaining 0.65. 2. Firing strength: Combine degrees of compatibility with respect to MFs of the antecedent in a rule to form a firing strength that indicates the degree to which the antecedent part of the rule is satisfied. In fuzzy decision trees, MFs of the antecedents connect together by means of “AND” operator. Therefore, one tnorm operator can be used for this step in the FDT. We adopt the multiplication operator among many alternatives. 3. Certainty degree of each class in each rule: The firing strength of each rule is weighted in the combination by the certainty degree of the classes attached to the leaf node. A t-norm operator (minimum, product, etc.) can be employed for this step. In this book, the multiplication operator is applied for FDT based methods. Therefore, for the above given example, each rule has two certainty degrees relating to the classes C1 and C2 . 4. Overall output: Aggregate all the certainty degrees from all fuzzy if-then rules relating to the same class. One s-norm operator should be used for aggregation. We adopt the sum from several alternatives. It is probable that the total values of certainty degrees for different classes exceed the unity, thereby they are normalized.
1.2 Decision Tree
19
Fig. 1.4 Fuzzy reasoning in fuzzy decision tree
Figure 1.4 shows the compatibility degree values of example X = (x (1) , x (2) ) for each antecedent MF of FDT given in Fig. 1.3. The values of the compatibility degrees are illustrated on the edges in Fig. 1.4. It should be highlighted that the total value of the certainty degree for each class is calculated by aggregating certainty degree values for all the rules. The aggregation is described by Eqs. (1.29) and (1.30) below. • Total Certainty Degree Value for C1 : C1 = 0.65 × 0.45 × 0.2 + 0.65 × 0.55 × 0.6 + 0.25 × 0.3 = 0.348 = μ(C1 ). (1.29) • Total Certainty Degree Value for C2 : C2 = 0.65 × 0.45 × 0.8 + 0.65 × 0.55 × 0.4 + 0.25 × 0.7 = 0.552 = μ(C2 ). (1.30) These results are then normalized so that the final values of certainty degrees add to one. In this way, as the values for C1 and C2 add to 0.9 (i.e., μ(C1 ) + μ(C2 ) = 0.9). we have that, that X belongs to class C1 with a total certainty degree of 0.348/0.9 = 0.39, and that it belongs to class C2 with the total certainty degree 0.552/0.9 = 0.61. Therefore, the inference procedure labels the example X as C2 .
20
1 Preliminaries
1.3 Dimensionality Reduction and Related Subjects Dimensionality reduction methods can be generally classified into feature extraction (FE) and feature selection (FS) approaches. Feature extraction methods map the original feature space to a lower-dimensional subspace and create new features. Feature selection methods conduct a search process to select a subset of features to build robust learning models [36]. The major function of these methods is to handle the problem of selecting a small subset of features which are necessary and sufficient to describe the target concept. Some irrelevant and/or redundant features generally exist in the learning data that make the learning process more complicated and reduce the general performance of the models learned. An irrelevant feature is the one that has no correlation with the target variable and has no effect on the target concept in any way, and a redundant feature gives no additional information about the target concept. Hence, the main idea of FS is to select a subset of features by eliminating irrelevant features and also by omitting redundant features that are strongly correlated. There are many potential advantages of feature selection such as to make data visualization and data understanding smooth, to reduce the processing and the storage requirements, to decrease the time required for both the training and the utilization processes and to improve the prediction performance. Some applications of feature selection are speech recognition, text categorization, gene selection from microarray data and face recognition [36–38]. Feature selection techniques can be classified into three categories of filter, wrapper and embedded techniques [39, 40]. The filter approach evaluates and selects feature subsets based on the general characteristics of data. The most prevalent examples of filter methods include ordinary auxiliary criteria such as correlation, mutual information, information gain, consistency, distance, and dependency. Some other simple statistical measures that employ no learning model also belong to this category. One advantage of filter methods is that they are normally fast since they use no learning algorithm. For this reason, such methods are well-suited for large datasets [41]. In contrast to filter techniques, wrapper methods involve a learning model so that the performance of such a model is used as the evaluation criterion. Wrapper algorithms search the space of all possible feature subsets to provide the best evaluation of a specific subset of features. Variety of search methods are frequently employed to obtain the optimal subset. Albeit wrapper techniques are computationally more expensive than the filter frameworks, the general performance of wrapper techniques is far better compared to that of filter methods [42]. Most of the studies combine filter and wrapper techniques. In these approaches, filters are used either to rank the features or to reduce the number of the candidate features. In particular, these hybrid methods are based on a sequential (e.g., two-step) approach. Their first step consists of applying the filter methods to reduce the number of the features considered in the second step. Using this reduced set, a wrapper method is then employed to select the desired number of features. This scheme enjoys the major advantage of filter methods that is its mechanism is model independent, as
1.3 Dimensionality Reduction and Related Subjects
21
well as having the advantages of the wrapper approach [43]. The performance of the algorithms constructed in this fashion is usually better than that of the ones developed based on filters but it is less than the performance of wrapper-based techniques. Feature selection algorithms whose mechanism is founded on the filter and embedded models may return either a subset of the selected features or the weights of all the features. According to these different types of output, such methods are divided into two categories, including feature weighting and subset selection algorithms. The algorithms that use the wrapper model usually return a feature subset [41]. A large number of filter and embedded methods have been composed of two major steps. First, they estimate the weighting score of the features, and then, they remove the features whose scores are unacceptable. These methods can be classified into univariate and multivariate methods. The univariate techniques only apply feature relevance (feature-target correlation) to compute weighting scores and they ignore feature redundancy (feature-feature correlation). For this reason, they usually result in poor classification performance. There are numerous well-known univariate filter methods, such as Kruskal-Wallis, Gini index, information gain, Bayesian logistic regression (Blogreg), Fisher score, Relief-F and Chi2 score [36]. Multivariate methods take both the feature relevance and the feature redundancy criteria into account. That is why such techniques act much better than the other frameworks. In the multivariate category, Rough Sets (RSs) [44] have been used to compute the dependency between the features (the conditional attributes) and the class labels (the decision attribute). The dependency degree calculated in this fashion can be employed as a measure of relevance to rank the features that are more relevant to (dependent on) the targets. The most popular multivariate methods are Minimal-Redundancy-MaximalRelevance criterion (MRMR), Correlation-based Feature Selection (CFS), Sparse Multinomial Logistic Regression via Bayesian Regularization (SBMLR), and Fast Correlation-Based Filter (FCBF) [41]. In the following, several important concepts such as correlation measures, similarity indices and rough sets are reviewed. These notions play a substantial role in dimensionality reduction algorithms, particularly in feature selection approaches. To be more specific, Sects. 1.3.1, 1.3.2 and 1.3.3 review the “Pearson’s correlation coefficient measure”, “correlation coefficient of hesitant fuzzy sets” and “correlationbased merit”, respectively. Next, similarity measures are considered in Sect. 1.3.4. Then, some basic concepts regarding rough sets and fuzzy-rough sets are briefly studied in Sect. 1.3.5. After that in Sect. 1.3.6, fundamental properties of weighted rough sets are discussed. Finally, the characteristics of fuzzy rough sets are presented in Sect. 1.3.7.
1.3.1 Pearson’s Correlation Coefficient Measure In statistics, the Pearson’s correlation coefficient [45] is a measure of the correlation between two variables X and Y . Its output is a value between -1 and 1, inclusive. Correlations equal to -1 or 1 correspond to data points lying exactly on a line. A value
22
1 Preliminaries
of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y rises as X increases. A value of -1 indicates that all data points lie on a line for which Y decreases as X increases. A value of 0 states that there is no linear correlation between the variables. This criterion is widely used in science to measure the strength of linear dependence between two variables. Formally, the Pearson’s correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations. That is, given by Eq. (1.31). p X,Y =
cov(X, Y ) E [(X − μ X ) (Y − μY )] = , σ X σY σ X σY
(1.31)
where cov is the covariance, σ X is the standard deviation of X , and μ X is the mean of X . It is evident that the Pearson’s correlation coefficient is symmetric which means that “correlation (X, Y ) = correlation (Y, X )”.
1.3.2 Correlation Coefficient of Hesitant Fuzzy Sets While the Pearson’s correlation coefficient is probably the most used tool to measure the relationship between two variables [46], it is not the only one that exists for this purpose. In particular, when variables incorporate considerable amount of uncertainty caused by unreliable information, a number of other measures can be adopted. Recently, Chen et al. [25] have introduced the concept of correlation coefficient for hesitant fuzzy sets. This definition has been used in tasks related to clustering for hesitant fuzzy sets. Definition 1.3.1 ([18]). Given two typical HFSs A and B, the correlation between A and B, ρHFS (A, B), is defined by Eq. (1.32). CHFS (A, B) (1.32) ρHFS (A, B) = √ √ CHFS (A, A) CHFS (B, B) " # n 1 li j=1 h Aσ( j) (x i )h Bσ( j) (x i ) i=1 li =$ " #$ " #, n n li 1 li 1 2 2 h (x ) h (x ) j=1 Aσ( j) i j=1 Bσ( j) i i=1 li i=1 li where h Aσ( j) (x) and h Bσ( j) (x) are defined as above. That is, h Aσ( j) (x) denotes the jth largest value in the set h A (x). In other words, Aσ( j) is the permutation of the membership values in h A (x) such that h Aσ( j) (x) ≥ h Aσ( j+1) (x). We can see that ρHFS satisfies the following conditions. 1. ρHFS (A, B) = ρHFS (B, A);
1.3 Dimensionality Reduction and Related Subjects
23
2. 0 ≤ ρHFS (A, B) ≤ 1; 3. ρHFS (A, B) = 1 if A = B. The first condition means that the correlation coefficient is symmetrical. The second condition implies that correlation coefficients lie in the interval [0, 1]. Finally, the third condition mean that when the two HFSs are exactly the same, the correlation coefficients reaches its maximum value. When a set of hesitant fuzzy sets are considered, A1 , A2 , . . . , Am we can define a correlation matrix of dimensions m × m considering all pairs of correlations ρ H F S (Ai , A j ) for i, j in {1, 2, . . . , m}. For example, we can build one of such matrices when we consider hesitant fuzzy clustering. Note that hesitant fuzzy clustering may produce a hesitant fuzzy set Ai for each cluster. Thus, we can compute the correlation matrix considering the correlations between all pairs of hesitant fuzzy sets. Given a correlation matrix, we can compose it with itself as the following definition shows. Definition 1.3.2 ([47]). Let C = (ρi j )m×m be a correlation matrix of size m. Then, C 2 is called a composition matrix of C. It is defined by C 2 = C ◦ C = (ρi j )m×m . Here, ρi j = max{min{ρik , ρk j }}, i, j = 1, 2, . . . , m, (1.33) k
where ρik and ρk j are the correlation coefficients between “i and k” and “k and j”, respectively. Definition 1.3.3 ([47]). Let C = (ρi j )m×m be a correlation matrix. The λ-cutting matrix is defined by Eq. (1.34). λρi j =
0 if ρi j < λ , 1 if ρi j ≥ λ
i, j = 1, 2, . . . , m,
(1.34)
where ρik and ρk j are the correlation coefficients between “i and k” and “k and j”, respectively.
1.3.3 Correlation-Based Merit A well-known definition of merit for correlation-based feature selection algorithms was introduced by Hall et al. [48]. They proposed a formula, given by Eq. (1.35), to take into account both (i) the relevance between the features and their class labels, and (ii) the notion of redundancy among the features. k × RCF . merit = % k + k(k − 1)RFF
(1.35)
24
1 Preliminaries
In this definition k indicates the number of features, and RCF corresponds to the relevance between the features and their class labels. For instance, the information gain, discussed in Sect. 1.2, can be utilized as the RCF index. Then, RFF denotes the redundancy among the features. For example, the Pearson’s correlation coefficient can be used for RFF. It can be seen in Eq. (1.35) that the merit reaches a maximum value when, at the same time, the relevance of feature-class is maximized and the redundancy among the features is minimized. Thus, this expression for merit is defined to underline those cases with both the maximum relevance and minimum redundancy.
1.3.4 Similarity Measures In this subsection, we review three similarity measures used in the next chapters. They are measures for vectors of features. Let us assume that X and Y are the two vectors of samples located in the pth and qth columns of a dataset. In data mining, these columns are usually called the pth and qth features. The first measure [49] that evaluates the similarities between the pth and qth features is called Inverse of Euclidean Distance (IED), and is defined in Eq. (1.36). IED(S) p,q =
1 1 = . Euclidean Distance(X, Y ) X − Y 2
(1.36)
Here, X and Y are two feature vectors of the same length. The absolute value of the Pearson’s correlation coefficient [50] is another similarity measure of the pth and qth features and is given by Eq. (1.37).
PC p,q
n (x − X )(y − Y ) i i=1 i , & = & n n 2 2 (x − X ) (y − Y ) i=1 i i=1 i
(1.37)
where X and Y are two feature vectors of the same size, xi and yi are the ith element of X and that of Y , respectively, and X and Y are respectively the arithmetic mean of elements of the vector X and that of the vector Y . Finally, we define the Cosine Similarity (CS) [51] that is formulated as follows: n CS p,q = cosine(X, Y ) = & n
xi yi & n 2
i=1
i=1 x i
2 i=1 yi
,
(1.38)
where X and Y are two feature vectors of the same size, and xi and yi are the i th element of X and Y , respectively.
1.3 Dimensionality Reduction and Related Subjects
25
1.3.5 Rough Set and Fuzzy-Rough Set Basic Concepts The rough set structure is defined in terms of the notion of an information system. In particular, let IS = U, A, V, f be an information system where U = {x1 , x2 , . . . , xn } is a set of instances called the universe of discourse in which xi is the i th instance, A is the set of all attributes that has two parts, A = C ∪ D, where C is the set of the conditional attributes that are simply called features and D is the decision attribute which is the collection of all classes. It should be noted that in the majority of cases, datasets have only one decision attribute. In addition, V is the domain of attributes, and Va is a subset of V consisting of all the elements of the attribute a. Furthermore, f : U × A → V is an information function which indicates the value of an instance for a specific feature. In the following, some key concepts from rough set theory including indiscernibility, lower and upper approximations, positive, negative and boundary regions and attribute dependency are presented. Definition 1.3.4 ([52]) Indiscernibility relation is a function whose duty is to find the equivalence class of an instance with respect to a specific subset of attributes, and it is formulated as IND(P) = {(x, y) ∈ U 2 |∀a ∈ P, f (x, a) = f (y, a)},
(1.39)
where P ⊆ A is a subset of the conditional attributes, and x and y are instances in the universe of discourse. Moreover, a partition of U generated by IND(P) is denoted by U/IND(P) and is defined as U/(IND(P)) = ⊗{a ∈ P : U/IND({a})},
(1.40)
A ⊗ B = {X ∩ Y : ∀X ∈ A, Y ∈ B, X ∩ Y = ∅},
(1.41)
in which
and where IND({a}) is the partition considering a single attribute a. Definition 1.3.5 ([52]) Let IS = U, A, V, f be an information system so that X ⊆ U and P ⊆ A. The indiscernible instances of the universe of discourse which exactly belong to X respecting IND(P) are called the lower approximation of X with respect to IND(P). The lower approximation is denoted by P X and is given by Eq. (1.42). In addition, the upper approximation of X respecting IND(P), symbolized by P X , is the set of indiscernible instances which probably belong to X with respect to IND(P) and is presented by Eq. (1.43). P X = {x ∈ U | [x] P ⊆ X },
(1.42)
26
1 Preliminaries
and P X = {x ∈ U | [x] P ∩ X = ∅},
(1.43)
where [x] P is the set of instances which are indiscernible from x with respect to IND(P). The ordered pair P X, P X is called the rough set of X . The accuracy of every approximation can be calculated by the lower and the upper approximations. The accuracy of an approximation is denoted by α P (X ) and is defined as α P (X ) =
|P X | |P X |
.
(1.44)
Definition 1.3.6 ([52]) Let IS = U, A, V, f be an information system and A = C ∪ D in which C is the set of conditional attributes (features) and D is the decision attribute (classes). Then, the positive, negative and boundary regions are defined, respectively, by Eqs. (1.45), (1.46) and (1.47). POS P (D) =
P X,
(1.45)
X ∈U/D
NEG P (D) = U −
P X,
(1.46)
X ∈U/D
and BND P (D) =
X ∈U/D
PX −
P X,
(1.47)
X ∈U/D
where U/D is the partition of the universe of discourse constructed by D. In fact, the positive region is a partition of the universe of discourse that is exactly classified by the equivalence classes of U/D. The boundary region is a part of U that is probably (not definitely) classified by the classes of U/D. The negative region is the set of instances that are not classified correctly with respect to P. The most important question in the rough set theory is how to find the dependency between two sets of attributes that is usually raised regarding the dependency between the decision attributes and the conditional attributes. This is the motivation behind Definition 1.3.7. Definition 1.3.7 ([52]) Let P be a subset of the conditional attributes and suppose that D is the decision attribute. The dependency between D and P, γ P (D), is calculated by Eq. (1.48). |POS P (D)| γ P (D) = . (1.48) |U |
1.3 Dimensionality Reduction and Related Subjects
27
1.3.6 Weighted Rough Set Basic Concepts In a classification problem, classes are imbalanced when there are some classes with a large number of instances and others with only a few. Classifiers usually do not perform well for this type of problems. Because of that, methods such as resampling, filtering inconsistent samples, and weighted rough sets [53, 54] have been developed to improve classification results. A weighted rough set is an extension of the rough set to deal with the class imbalance data. It applies the prior knowledge of instances which means that each sample has a weight. To be more specific, a weighed rough set is a weighted information system WIS = U, W, A, V, f , where U is a non-empty set of samples, A is the feature set and W consists of the weight of samples. Most definitions in the weighted rough set theory are the same as that definitions in the theory of classical rough sets, such as the definition of the lower and upper approximations, and that of the positive, negative and boundary regions. Nevertheless, the accuracy of approximation and the dependency degree are defined by different formulas. Definition 1.3.8 ([53]) The weighted accuracy of an approximation is αW P (X ) =
|P X |W |P X |W
,
(1.49)
where |P X |W = xi ∈P X (wi ) and |P X |W = xi ∈P X (wi ) are the weighted cardinality of the lower and the upper approximations, respectively. Definition 1.3.9 ([53]) The weighted dependency degree is γ PW (D) =
|POS P (D)|W , |U |W
(1.50)
where P is a subset of the conditional attributes, D is the decision attribute, |POS P (D)|W = xi ∈POS P (D) (wi ) and |U |W = xi ∈U (wi ).
1.3.7 Fuzzy Rough Set Basic Concepts A fuzzy rough set [52] is a powerful extension of a rough set that offers enormous potential to tackle the vagueness and uncertainty in continuous datasets. This extension uses fuzzy relations to explain both the indiscernibility and discernibility concepts. Assume that U is a non-empty universe of discourse and R is a fuzzy relation on U such that R satisfies the following properties: 1. Reflexivity: R(x, x) = 1 for all x ∈ U . 2. Symmetry: R(x, y) = R(y, x) for all x, y ∈ U . 3. Transitivity: R(x, z) ≥ min y (R(x, y), R(y, z)).
28
1 Preliminaries
The fuzzy similarity function utilized to calculate the equivalence relation is defined by Eq. (1.51). |a(x) − a(y)| , (1.51) R =1− |amax − amin | where a is an attribute, x and y are two instances and amax and amin are the maximum and minimum values in the attribute a, respectively. Definition 1.3.10 ([52]) Let U be a universe of discourse, R be a fuzzy relation on U and F(U ) be a fuzzy power set of U in which F is a fuzzy concept that is going to be approximated. Then, the fuzzy lower and upper approximations are given by Eqs. (1.52) and (1.53). R F(x) = inf max{1 − R(x, y), F(y)},
(1.52)
R F(x) = sup min{R(x, y), F(y)},
(1.53)
y∈U
and y∈U
where F(y) is the membership function of y belonging to the fuzzy set F. The ordered pair R F(x), R F(x) is called a fuzzy rough set. The fuzzy lower approximation ensures that a sample belongs to one of the classes and the fuzzy upper approximation presents the possibility of such an event. In order to detect relevant features by means of a fuzzy rough set, a function should be defined in such a way that for a given sample x, it shows that x exactly belongs to one of the classes. This leads to Def. 1.3.11. Definition 1.3.11 ([52]) Let U be a universe of discourse and suppose that F(U ) is a fuzzy power set of U . The fuzzy positive region and the fuzzy dependency degree with respect to the feature set P are formulated, respectively, by Eqs. (1.54) and (1.55). (1.54) μ P O S P (D) (x) = sup (P F(x)), F∈U/D
and ' γ P (D) =
x∈U
μPOS P (D) (x) . |U |
(1.55)
The fuzzy dependency, denoted by ' γ P (D), indicates the relevance of a subset of features. Considering that B is a subset of P, a measure of the significance of B is defined as ' γ P−B (D) γ P−B (D) ' γ P (D) − ' =1− . (1.56) ' σ P,D (B) = ' γ P (D) ' γ P (D) Note that the case ' σ P,D (B) = 0 reveals that the subset B is not important and it can be deleted from the set P.
1.4 Fuzzy Clustering
29
1.4 Fuzzy Clustering The idea of clustering is one the most significant issues in unsupervised learning. It measures the similarity shared among data and puts the similar data into the same groups. A clustering algorithm partitions a dataset into several groups or clusters so that the similarity within a group is maximized while this factor is minimized among the groups. In addition to the classical clustering in which each sample belongs to only one cluster, fuzzy clustering is another noteworthy category of clustering algorithms that allows each data point to belong to several clusters with a membership degree. In fact, the main difference between classical clustering and its fuzzy version lies in the fact that a sample may belong to more than one cluster. The collection of fuzzy clustering algorithms are one of the most important categories of methods used in unsupervised pattern recognition. Two well-known fuzzy clustering techniques are Fuzzy c-means clustering and subtractive clustering [35] which are discussed in Sects. 1.4.1 and 1.4.2, respectively. Furthermore, in Sects. 1.4.3 and 1.4.4, fuzzy partitions and I-fuzzy partitions that are two essential notions in fuzzy clustering are studied, respectively.
1.4.1 Fuzzy c-Means Clustering In the field of cluster analysis, the k-means (KM) method is a well-known clustering algorithm. It is inherently an iterative method that partitions a given dataset {x1 , . . . , xn } into K clusters, where K denotes the number of clusters and it is either known from the first or determined prior to initiation of the cluster analysis. The great merits of KM are its fast processing time, its robustness and its simplicity in implementation. Moreover, many studies indicate that its results are acceptable in case the clusters are well-separated in the datasets. A major weakness of KM is that it does not perform well when there is overlapping on the clusters. Another drawback of KM is that it is not invariant to non-linear transformations of the data. A number of solutions have been proposed to tackle the drawbacks of KM. For example, the fuzzy c-means (FCM) algorithm provides an extension of KM based on an idea suggested by Dunn [55] so that an object is a member of several clusters with various degree of membership. Partial membership degrees between 0 and 1 are assigned to objects included in the boundaries of the clusters, rather than including them by force to one of the clusters. The FCM algorithm is described by the following four steps [35]. 1. Initialize the membership matrix U = [u i j ] with random values between 0 and 1 such that the constraint in Eq. (1.57) is satisfied. K i=1
u i j = 1,
j = 1, . . . , n.
(1.57)
30
1 Preliminaries
2. Calculate K fuzzy cluster centers ci for i = 1, . . . , K according to Eq. (1.58). n j=1
ci = n
u imj x j
j=1
u imj
,
(1.58)
where m ∈ [1, ∞) denotes a weighting exponent. 3. Compute the cost function using Eq. (1.59). J (C) =
n K
x j − ci 22 u imj .
(1.59)
i=1 j=1
Stop if either the result of the cost function calculation is below a certain tolerance value or its improvement over previous iteration is below a certain threshold. 4. Compute a new U = [u i j ] according to Eq. (1.60). Then, go to step 2. ui j = K
1
di j 2/(m−1) k=1 ( dk j )
.
(1.60)
1.4.2 Subtractive Clustering The subtractive clustering approach, proposed by Chiu [56], is better than other fuzzy clustering methods in terms of stability. A clustering technique is stable when different executions of the algorithm lead to the same clusters. Most clustering algorithms are defined in terms of a stopping criteria and by a predefined set of parameters. In subtractive clustering, the number of clusters is not determined by the user, instead, it depends on the radii parameters. If one uses a fixed set of radii parameters and runs the algorithm several times, the clusters we obtain each time are always the same. The subtractive clustering method is an extension of the mountain clustering method. This latter method considers that each data point is a potential cluster center. Then, based on the density of the surrounding data points, it calculates a measure of the likelihood of this data point to be a cluster center. Using this information, the algorithm selects the data point with the highest potential to be the first cluster center and then, it eliminates all data points in the vicinity of the first cluster center (as determined by radii) in order to determine the next data cluster and its center location. This subtraction is like removing the effect of the cluster. This process is repeated until all the data points are covered by a cluster center. This clustering approach is suitable for feature selection problems (see Sect. 3.2.1). Formally, let X = {X 1 , . . . , X k , . . . , X m } be a dataset for which X k ∈ Rn and X k = {X k1 , . . . , X k j , . . . , X kn }. Here, m is the number of samples and n is the number of features in the dataset X . Each data point X i is a candidate for a cluster center, and the density measure for each data point X i is defined as
1.4 Fuzzy Clustering
31 m
−X i − X j 22 Di = exp (ra /2)2 j=1 or, equivalently,
,
(1.61)
n −1 Di = exp (xir − x jr )2 , 2 /2) (r a r =1 j=1 m
(1.62)
in which · 2 denotes the Euclidean distance and ra is a positive constant. Hence, a data point has a high density value if it has many neighboring data points, while the points whose distance to it are greater than ra have little influence on its density. Once the density of every data point, the data point with the highest density is selected as the first cluster center. Let X c1 be the first cluster center and Dc1 be its density value. Then, the density measure of each data point X i is revised by Eq. (1.63).
−X i − X c1 22 Di = Di − Dc1 exp (rb /2)2
,
(1.63)
where rb is a positive constant. Therefore, the data points near to the first cluster center can reduce significantly the density measures, thereby making the points unlikely to be selected as the next cluster center. To have some distance between cluster centers, rb is considered larger than ra . More particularly, a good choice is rb = 1.5ra . After revising the density function, the data point with the greatest density value is selected as the next cluster center. The process of revising the density function and finding the next cluster center continues until a sufficient number of clusters is attained or a condition on density changes is met.
1.4.3 Fuzzy Partitions A partition of a set defines a way to group all the elements of this set. Formally the groups, usually called parts, are disjoint and all the elements of the set are in one of the parts. This definition has been extended to fuzzy sets. Nevertheless, there are different types of generalizations, all called fuzzy partitions. The most common definition of fuzzy partition is as follows. Most fuzzy clustering methods and, in particular, fuzzy c-means [57] (FCM), and the related algorithms, construct a fuzzy partition of a given dataset of this form. Recall that fuzzy c-means is the most widely used fuzzy clustering technique. Definition 1.4.1 ([57, 58]) Let X be a reference set. A set of membership functions n μi (x) = 1. M = {μ1 , . . . , μn } on X is a fuzzy partition of X if for all x ∈ X , i=1
32
1 Preliminaries
Nevertheless, it is important to underline that not all fuzzy clustering algorithms lead to membership functions of this form. For instance, possibilistic clustering imposes no constraints on memberships to add to one.
1.4.4 I-Fuzzy Partitions Intuitionistic fuzzy sets were introduced by Atanassov in 1983 [11]. It takes into account the membership degree as well as the non-membership degree. In an ordinary fuzzy set, the non-membership degree is the complement of the membership degree, while in an intuitionistic fuzzy set, the non-membership degree is less than or equal to the complement of the membership degree due to the hesitation degree. Definition 1.4.2 ([59]) An Atanassov intuitionistic fuzzy set (AIFS) A in X is defined by A = {x, μ A (x), ν A (x) | x ∈ X }, where μ A : X → [0, 1] and ν A : X → [0, 1] satisfy Eq. (1.64). (1.64) 0 ≤ μ A (x) + ν A (x) ≤ 1. For each x ∈ X , μ A (x) and ν A (x) represent its degrees of membership and nonmembership with respect to AIFS A, respectively. Definition 1.4.3 ([58]) For each IFS A = {x, μ A (x), ν A (x) | x ∈ X }, the I-fuzzy index for x ∈ X is defined by π A (x) = 1 − μ A (x) − ν A (x). It should be noted that there is an alternative definition of I-fuzzy partitions which is provided by broadening its classic definition using AIFSs. This generalization is given in Definition 1.4.4. Definition 1.4.4 Let X be a reference set. Then, a set of AIFSs A = {A1 , . . . , Am } in which Ai = μi , πi is an I-fuzzy partition if Conditions i and ii are fulfilled. m (i) i=1 μi (x) = 1 for all x ∈ X . (ii) For all x ∈ X , there is at most one i such that νi (x) = 0. In other words, there is at most one IFS so that μ A (x) + π A (x) = 1 for all x. Condition i in Def. 1.4.4 states that μi s are required to define a standard fuzzy partition, which means that the memberships μi add to one for all x. In addition, every Ai is required to be an IFS, and as a result, πi stands for the I-fuzzy index for each element x (and for each partition element Ai ). Therefore, μi (x) + πi (x) ≤ 1.
(1.65)
Condition ii constraints Ai so that Equation (1.65) is satisfied for at most one Ai with equality. That is, for each x, either there is one Ai or there is none such that μi (x) + πi (x) = 1. Proposition 1.4.1 ([58]) I-fuzzy partitions generalize fuzzy partitions.
1.5 Linear Regression
33
1.5 Linear Regression Linear regression is one of the most popular modeling concepts in statistics and provides a starting point to develop supervised learning methods. In particular, we can consider different alternative ways of fitting the data, formalized introducing different loss functions. Consider (xi ,yi ) in which xi ∈ R D is the D-dimensional input part and yi ∈ R is the target output, where i = 1, . . . , n. For example, the most used loss function for implementing linear regression is the squared loss presented by Eq. (1.66). (1.66) L( yˆ , y) = y − yˆ 22 , where y − yˆ 2 is the 2-norm of the vector y − yˆ in which y = [y1 , . . . , yn ]T denotes the output variable and yˆ = [ yˆ1 , . . . , yˆn ]T is the regression function that is defined as D yˆi = b + wd xid . (1.67) d=1
In Eq. (1.67), b is called the bias term, w = (w1 , . . . , w D )T is a vector valued regression weight and xid denotes the d th dimension of the i th data vector xi . Now, we can revisit Eq. (1.67), and express it as yˆi = b + xiT w. In matrix form, the expression results in Eq. (1.68). (1.68) yˆ = X w + b1n , where the matrix X corresponds to the input variables and 1n is a column vector whose entries are all equal to 1. Note that we can express this formula in an alternative way. Instead of using the bias term b in an explicit way, we can consider b as an extra component in the weight vector w which means that we have the vector w = (w1 , . . . , w D , b)T . Accordingly, a 1 can be inserted in the data vectors xi as a new component as follows, xi = (xi1 , . . . , xi D , 1)T . Now, Eq. (1.67) can be rewritten as yˆi = xiT w. Hence, yˆ = X w. As a result, the squared loss function can be rewritten as L(w) = (X w − y)T (X w − y) = X w − y22 .
(1.69)
In addition, if X T X is of full-rank, then the least square solution is calculated as w = (X T X )−1 X T y. In Sects. 1.5.1, 1.5.2 and 1.5.3, regression models with sparsity are described. They are regression models that are useful when the number of instances or examples are much less than the number of parameters in the model. See e.g. [60] for details. In particular, we discuss Ridge, LASSO and Elastic Net regressions.
34
1 Preliminaries
1.5.1 Ridge Regression Least squares and the Ridge regression are classic statistical algorithms that are known for their widespread usage in analyzing multiple regression data. The Ridge regression makes a minor modification on the least squares method by considering the objective function presented in Eq. (1.70). argmin X w − y22 + αw22 , w
where α is a fixed positive constant and w22 =
n i=1
(1.70)
wi2 .
1.5.2 LASSO Regression The LASSO regression is a common approach to regression problems when ordinary least squares is applied to parameter estimation. The popularity of the LASSO regression is due to the fact that the estimation of the parameters and the procedure of variable selection can be performed simultaneously. In particular, when the vector of regression coefficients shrinks toward zero there is the possibility of setting some coefficients identically equal to zero. By employing the l1 -norm for both the fitting and penalization of the coefficients, the LASSO regression can be considered as a form of the penalized least squares. It is formalized in Eq. (1.71). argmin X w − y22 + βw1 , w
where w1 =
n i=1
(1.71)
|wi | and β ≥ 0 determines the amount of shrinkage.
1.5.3 Elastic Net Regression The Elastic Net regression is a regularized linear regression approach that connects the Ridge and LASSO regressions. The main objective of the Elastic Net method is to further enhance predictions based on the ordinary least squares. In this respect, it identifies important variables by shrinking the estimates of some parameters toward zero. The objective function of the Elastic Net regression is formulated as argmin X w − y22 + λ + λ αw22 + (1 − α)w1 , w
where λ ≥ 0 is a regularization parameter.
(1.72)
References
35
References 1. Zadeh, Lotfi A. 1965. Fuzzy sets. Information and Control 8 (3): 338–353. 2. Medasani, S., J. Kim, and R. Krishnapuram. 1998. An overview of membership function generation techniques for pattern recognition. International Journal of approximate reasoning 19 (3–4): 391–417. 3. Chi, Z., H. Yan, and T. Pham. 1996. Fuzzy algorithms: With applications to image processing and pattern recognition, vol. 10. Singapore: World Scientific. 4. Ishibuchi, Hisao, Ken Nozaki, and Hideo Tanaka. 1993. Efficient fuzzy partition of pattern space for classification problems. Fuzzy Sets and Systems 59 (3): 295–304. 5. Dubois, Didier, and Henri Prade. 1983. Unfair coins and necessity measures: Towards a possibilistic interpretation of histograms. Fuzzy sets and systems 10 (1–3): 15–20. 6. Keller, J.M., M.R. Gray, and J.A. Givens. 1985. A fuzzy K-nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cybernetics 15 (4): 580–585. 7. Yuan, Y., and M.J. Shaw. 1995. Induction of fuzzy decision trees. Fuzzy Sets and Systems 69 (2): 125–139. 8. Pedrycz, Witold. 2001. Fuzzy equalization in the construction of fuzzy sets. Fuzzy Sets and Systems 119 (2): 329–335. 9. Vicenç Torra. On fuzzy c -means and membership based clustering. In Ignacio Rojas, Gonzalo Joya Caparrós, and Andreu Català, editors, Advances in Computational Intelligence - 13th International Work-Conference on Artificial Neural Networks, IWANN 2015, Palma de Mallorca, Spain, June 10-12, 2015. Proceedings, Part I, volume 9094 of Lecture Notes in Computer Science, pages 597–607. Springer, 2015. 10. Runkler, Thomas A., and James C. Bezdek. 1999. Alternating cluster estimation: a new tool for clustering and function approximation. IEEE Trans. Fuzzy Syst. 7 (4): 377–393. 11. Atanassov, Krassimir T. 1986. Intuitionistic fuzzy sets. Fuzzy Sets and Systems 20 (1): 87–96. 12. Mendel, J.M. 2017. Type-2 fuzzy sets. In Uncertain rule-based fuzzy systems, 259–306. Berlin: Springer. 13. Humberto Bustince, Francisco Herrera, and Javier Montero. Fuzzy sets and their extensions: Representation, Aggregation and Models: Intelligent systems from decision-making to data mining, Web Intelligence and Computer Vision, volume 220. Springer, 2007. 14. Torra, Vicenç. 2010. Hesitant fuzzy sets. International Journal of Intelligent Systems 25 (6): 529–539. 15. Rosa M Rodríguez, B Bedregal, Humberto Bustince, YC Dong, Bahram Farhadinia, Cengiz Kahraman, L Martínez, Vicenç Torra, YJ Xu, ZS Xu, et al. A position and perspective analysis of hesitant fuzzy sets on information fusion in decision-making. towards high quality progress. Information Fusion, 29:89–97, 2016. 16. Zhu, Bin, Xu. Zeshui, and Meimei Xia. 2012. Dual hesitant fuzzy sets. Journal of Applied Mathematics 1–13: 2012. 17. Qian, Gang, Hai Wang, and Xiangqian Feng. 2013. Generalized hesitant fuzzy sets and their application in decision support system. Knowledge-Based Systems 37: 357–365. 18. Rosa M Rodríguez, Luis Martínez, Vicenç Torra, ZS Xu, and Francisco Herrera. Hesitant fuzzy sets: state of the art and future directions. International Journal of Intelligent Systems, 29(6):495–524, 2014. 19. Sardari, Sahar, Mahdi Eftekhari, and Fatemeh Afsari. 2017. Hesitant fuzzy decision tree approach for highly imbalanced data classification. Applied Soft Computing 61: 727–741. 20. Ebrahimpour, M.K., and M. Eftekhari. 2017. Ensemble of feature selection methods: A hesitant fuzzy sets approach. Applied Soft Computing 50: 300–312. 21. Mohammad Kazem Ebrahimpour and Mahdi Eftekhari. 2018. Distributed feature selection: A hesitant fuzzy correlation concept for microarray high-dimensional datasets. Chemometrics and Intelligent Laboratory Systems 173: 51–64. 22. Mohtashami, Mohammad, and Mahdi Eftekhari. 2019. A hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts. Iranian Journal of Fuzzy Systems 16 (2): 165–182.
36
1 Preliminaries
23. Aliahmadipour, Laya, Vicenç Torra, Esfandiar Eslami, and Mahdi Eftekhari. 2016. A definition for hesitant fuzzy partitions. International Journal of Computational Intelligence Systems 9 (3): 497–505. 24. Xu, Z. 2014. Hesitant fuzzy sets theory, vol. 314. Berlin: Springer. 25. Chen, Na., Xu. Zeshui, and Meimei Xia. 2013. Correlation coefficients of hesitant fuzzy sets and their applications to clustering analysis. Applied Mathematical Modelling 37 (4): 2197–2211. 26. Zeshui, Xu., and Meimei Xia. 2011. Distance and similarity measures for hesitant fuzzy sets. Information Sciences 181 (11): 2128–2138. 27. Xia, Meimei, and Xu. Zeshui. 2011. Hesitant fuzzy information aggregation in decisionmaking. International Journal of Approximate Reasoning 52 (3): 395–407. 28. Farhadinia, B., and Z. Xu. 2019. Information measures for hesitant fuzzy sets and their extensions (Uncertainty and operations research). Singapore: Springer. 29. Mitchell, T.M. 1997. Machine learning. New York: McGraw-Hill. 30. Zeinalkhani, Mohsen, and Mahdi Eftekhari. 2014. Comparing different stopping criteria for fuzzy decision tree induction through idfid3. Iranian Journal of Fuzzy Systems 11 (1): 27–48. 31. M Umanol, Hirotaka Okamoto, Itsuo Hatono, HIROYUKI Tamura, Fumio Kawachi, Sukehisa Umedzu, and Junichi Kinoshita. Fuzzy decision trees by fuzzy id3 algorithm and its application to diagnosis systems. In The IEEE 3rd International Fuzzy Systems Conference, pages 2113– 2118. IEEE, 1994. 32. Xiaomeng Wang and Christian Borgelt. Information measures in fuzzy decision trees. In IEEE International Conference on Fuzzy Systems (IEEE Cat. No. 04CH37542), volume 1, pages 85–90. IEEE, 2004. 33. J Ross Quinlan. C4. 5: Programs for machine learning. Elsevier, 2014. 34. Cezary Z Janikow. Fuzzy decision trees: issues and methods. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(1):1–14, 1998. 35. Jyh-Shing Roger Jang, Chuen-Tsai Sun, and Eiji Mizutani. Neuro-fuzzy and soft computing-a computational approach to learning and machine intelligence. IEEE Transactions on automatic control, 42(10):1482–1484, 1997. 36. Pradip Dhal and Chandrashekhar Azad. A comprehensive survey on feature selection in the various fields of machine learning. Applied Intelligence, pages 1–39, 2021. 37. Saberi-Movahed, Farid, Mahdi Eftekhari, and Mohammad Mohtashami. 2020. Supervised feature selection by constituting a basis for the original space of features and matrix factorization. International Journal of Machine Learning and Cybernetics 11 (7): 1405–1421. 38. Mahdi Eftekhari, Farid Saberi-Movahed, and Adel Mehrpooya. Supervised feature selection via information gain, maximum projection and minimum redundancy. In SLAA10 Seminar Linear Algebra and Its Application (pp. 29-35), 2020. 39. Adel Mehrpooya, Farid Saberi-Movahed, Najmeh Azizizadeh, Mohammad Rezaei-Ravari, Farshad Saberi-Movahed, Mahdi Eftekhari, and Iman Tavassoly. High dimensionality reduction by matrix factorization for systems pharmacology. bioRxiv, 2021. 40. Farshad Saberi-Movahed, Mahyar Mohammadifard, Adel Mehrpooya, Mohammad RezaeiRavari, Kamal Berahmand, Mehrdad Rostami, Saeed Karami, Mohammad Najafzadeh, Davood Hajinezhad, Mina Jamshidi, et al. Decoding clinical biomarker space of COVID-19: Exploring matrix factorization-based feature selection methods. medRxiv, 2021. 41. Chandrashekar, Girish, and Ferat Sahin. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40 (1): 16–28. 42. Verónica Bolón-Canedo, Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, José Manuel Benítez, and Francisco Herrera. A review of microarray datasets and applied feature selection methods. Information Sciences, 282:111–135, 2014. 43. Hsu, Hui-Huang., Cheng-Wei. Hsieh, and Lu. Ming-Da. 2011. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications 38 (7): 8144–8150. 44. Javad Rahimipour Anaraki and Mahdi Eftekhari. Rough set based feature selection: A review. In The 5th Conference on Information and Knowledge Technology, pages 301–306, 2013. 45. Philip Sedgwick. Pearson’s correlation coefficient. BMJ: British Medical Journal, 345:e4483, 2012.
References
37
46. Na, Lu., and Lipin Liang. 2017. Correlation coefficients of extended hesitant fuzzy sets and their applications to decision-making. Symmetry 9 (4): 47. 47. Bolón-Canedo, Verónica, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2014. Data classification using an ensemble of filters. Neurocomputing 135: 13–20. 48. Mark Andrew Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato Hamilton, 1999. 49. Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490):713–726, 2010. 50. Deng Cai, Chiyuan Zhang, and Xiaofei He. Unsupervised feature selection for multi-cluster data. In The 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 333–342, 2010. 51. Ye, Jun. 2011. Cosine similarity measures for intuitionistic fuzzy sets and their applications. Mathematical and computer modelling 53 (1–2): 91–97. 52. Jensen, Richard, and Qiang Shen. 2008. New approaches to fuzzy rough feature selection. IEEE Transactions on fuzzy systems 17 (4): 824–838. 53. Liu, Jinfu, Hu. Qinghua, and Yu. Daren. 2008. A comparative study on rough set based class imbalance learning. Knowledge-Based Systems 21 (8): 753–763. 54. Liu, Jinfu, Hu. Qinghua, and Yu. Daren. 2008. A weighted rough set based method developed for class imbalance learning. Information Sciences 178 (4): 1235–1256. 55. Joseph C Dunn. A fuzzy relative of the isodata process and its use in detecting compact wellseparated clusters. Journal of Cybernetics, 3(3):32–57, 1973. 56. Stephen L Chiu. Fuzzy model identification based on cluster estimation. Journal of Intelligent & fuzzy systems, 2(3):267–278, 1994. 57. James C Bezdek. Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media, 2013. 58. Torra, Vicenç, and Sadaaki Miyamoto. 2011. A definition for i-fuzzy partitions. Soft Computing 15 (2): 363–369. 59. Krassimir T Atanassov. On intuitionistic fuzzy sets theory. Springer, 2012. 60. Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the LASSO and generalizations. Chapman and Hall/CRC, 2015.
Part I
Unsupervised Learning
Chapter 2
A Definition for Hesitant Fuzzy Partitions
2.1 Introduction Most clustering algorithms are sensitive to the selection of initial parameters. For instance, the results of fuzzy c-means clustering change when we use different kernels and different selection methods for the initial cluster centers. The selection of a clustering algorithm depends on two issues, the type of data and the particular purpose of the study [1]. Thus, for a dataset about which there is no prior knowledge, one encounters difficulties in selecting the proper clustering algorithm, the appropriate kernel, and the well-suited initial cluster centers. In order not to miss the proper clusters, one can consider the application of different clustering algorithms. In this chapter, we discuss how hesitant fuzzy sets can be applied to represent more than one fuzzy clustering result. The idea behind this approach is to avoid losing relevant information by means of using two or more fuzzy clustering algorithms at the same time. To be more specific, a set of fuzzy clustering algorithms are used whose initial parameters and executions are different. Then, the results are modeled by a hesitant fuzzy partition (H-fuzzy partition). Hesitant fuzzy partitions (H-fuzzy partitions) are defined to provide a framework to analyze the results of standard fuzzy clustering methods. In particular, for studying the results of fuzzy c-means and intuitionistic fuzzy c-means. It is important to underline that new sets of cluster centers and membership values are also defined, these definitions are practical in various cluster validity indices. More particularly, a method is introduced to construct H-fuzzy partitions out of a set of fuzzy clusters obtained from several executions of fuzzy clustering algorithms with various initialization of their parameters. The main purpose of the current chapter is to describe how to find a global optimal solution via the study of some local optimal solutions and to provide the user with a structure to evaluate their problem applying different cluster validity indices by offering them the possibility of using various reliable membership values and cluster centers.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Eftekhari et al., How Fuzzy Concepts Contribute to Machine Learning, Studies in Fuzziness and Soft Computing 416, https://doi.org/10.1007/978-3-030-94066-9_2
41
42
2 A Definition for Hesitant Fuzzy Partitions
The reminder of this chapter is organized as follows. Section 2.2 offers a definition of an H-fuzzy partition and provides an example to help in understanding this concept. Section 2.3 provides a discussion. The chapter finishes in Sect. 2.4 with an exercise.
2.2 H-Fuzzy Partition In this section, hesitant fuzzy partitions are introduced. This notion is proposed for typical HFSs, in case the value of membership degrees is finite. Definition 2.2.1 Let X = {x1 , . . . , xn } be a reference set. Suppose that H ∗ = {x, hˆ j | j = 1, 2, . . . , m} is an HFS on X , where m is the number of clusters and the sets hˆ j = {μkj | k = 1, 2, . . . , κ j } are hesitant fuzzy elements. That is, hˆ j is a finite set such that hˆ j ⊆ [0, 1] and κ j is the number of membership degrees in hˆ j which means that the cardinality of hˆ j is κ j . In general, κ j can be any arbitrary number. Nevertheless, for the sake of simplicity, we usually suppose that κ j = κ for all j. H ∗ is called a Hesitant fuzzy partition (H-fuzzy partition) if m 1 m j=1
κ j k=1
μkj (x)
κj
≤ 1 and 0 ≤ μkj (x) ≤ 1 for all x ∈ X.
(2.1)
A more general case is the one in which the set hˆ j is infinite. This is a generalization of the former definition. The case that the membership is an AIFS was discussed in Definition 1.4.4. Definition 2.2.2 Let X = {x1 , . . . , xn } be a reference set. A set of HFE H = {hˆ 1 , . . . , hˆ m } in which hˆ j is an infinite set is called a hesitant fuzzy partition if Eq. (2.2) is held. m 1 0 yμ j (x)(y)dy ≤ 1 for all x ∈ X, (2.2) 1 j=1 m 0 μ j (x)(y)dy where μ j (x) is the characteristic function of the set hˆ j (x). Notice that Inequality (2.2) is a reformulation of Eq. (2.1). We underline that Definition 2.2.2 is valid not only for hesitant fuzzy sets, but also for type-2 fuzzy sets since μ j (x) can be a fuzzy set. It should be pointed out that if hˆ j (x) is a single value, then the typical fuzzy set is of type-1, in case hˆ j (x) is a finite set, the typical fuzzy set is a hesitant one, and when hˆ j (x) is a fuzzy set, the case of study is a fuzzy set of type-2. In the last case, the membership degree of x with respect to the cluster j is given by the membership function μ j (x)(y) for y ∈ [0, 1] instead of μ j (x) which is single value. Proposition 2.2.1 The H-fuzzy partitions generalize the I-fuzzy partitions.
2.2 H-Fuzzy Partition
43
Proof In order to prove Proposition 2.2.1, it is required to show that every I-fuzzy partition is also a H-fuzzy partition. That is, the conditions given in Definition 1.4.4 imply the conditions presented in Definition 2.2.1 in case of typical HFSs, while in case of HFSs with infinite membership degrees, those conditions imply the conditions assumed in Definition 2.2.2. For this reason, it must be proven that for all x ∈ X , the conditions considered in Definition 1.4.4 guarantee that Eq. (2.1) or Eq. (2.2) is held, depending on whether we understand the I-fuzzy partition as a discrete or a continuous set. Suppose that for an arbitrary x ∈ X , A = {A1 , . . . , Am } is the set of Ifuzzy sets in which Ai = μi , πi and x satisfies Conditions (i) and (ii) of Definition 1.4.4. As mentioned in [2], the envelope of an HFS is an intuitionistic fuzzy set. Hence, for each Ai = μi , πi , we have h i+ = 1 − (1 − μi − πi ) and h i− = μi . In order to embed an IFS in an HFS, two models (2.2) and (2.2) can be considered. (I) Ai is modeled as a finite HFS h i = {μi , 1 − μi − πi }. It is proven that h i = {μi , 1 − μi − πi } satisfies the conditions of Definition 2.2.1. Note that in this case κ = 2, and as a result, for all h i we have μi + (1 − μi − πi ) ≤ 1. Thus, it is evident that m i=1 1 − πi ≤ 1. 2m (II) Ai is modeled by an interval. Since Ai = μi , πi corresponds to the interval [μi , 1 − μi − πi ], one can define h i = χ[μi ,1−μi −πi ] , where χ is a characteristic function. It is apparent that 1 0
1 0
therefore,
yχ[μi ,1−μi −πi ] (y)dy χ[μi ,1−μi −πi ] (y)dy
≤ 1,
m 1
i=0 0 yχ[μi ,1−μi −πi ] (y)dy 1 m 0 χ[μi ,1−μi −πi ] (y)dy
≤ 1.
2.2.1 Construction of H-Fuzzy Partitions In this section, we study how H-fuzzy partitions are constructed from r fuzzy clustering algorithms such as FCM and IFCM with m possible parameterizations each. We consider K different executions, one for each fuzzy clustering algorithm and parameterization. For instance, in FCM, one can determine different initial cluster center selection methods, various kernels and variety of values for its membership parameter. The application of r fuzzy clustering algorithms with K different parameters to a dataset X = {x1 , . . . , xn } results in r × K fuzzy partitions. In the following,
44
2 A Definition for Hesitant Fuzzy Partitions
we discuss how to build an H-fuzzy partition from a given set of fuzzy partitions. First, some notations are introduced. Let h i j denote the set of the membership values obtained by the ith clustering algorithm for the jth cluster. That is, for clustering algorithms i = 1, . . . , r and parameterizations j = 1, . . . , m, h i j = {μikj | 0 ≤ μikj ≤ 1, k = 1, . . . , κ} for all x ∈ X.
(2.3)
In this definition, we use κ to denote the number of different membership degrees obtained for a particular clustering algorithm (e.g., κ different executions of the same clustering algorithm). More particularly, κi is the number of the membership degrees obtained by the ith clustering algorithm. While, κi can be different for different clustering algorithms i, for the sake of simplicity, we take κi = κ. This means that we have κ different views for each clustering algorithm. Taking all this into account, we can arrange the clustering results of the r clustering algorithms as a set of HFSs, one for each cluster j = 1, . . . , m, as described in Eq. (2.4). (2.4) Hi = {h i1 , h i2 , . . . , h im } for i = 1, 2, . . . , r. Note that each HFS includes the memberships of the κ executions. In addition, a set of cluster centers are selected for each clustering algorithm that are given by Eq. (2.5). Ci j = {cikj | k = 1, 2, . . . , κ} for all x ∈ X,
(2.5)
where cikj is the cluster center vector obtained by the i th clustering algorithm for the j th cluster with the k th parametrization, i = 1, 2, . . . , r and j = 1, 2, . . . , m. A new set of cluster centers, presented by Eq. (2.6), is also defined for each clustering algorithm. κ k k=1 ci j i ∗ ∗ for i = 1, 2, . . . , r. (2.6) C = {ci j | j = 1, 2, . . . , m} and ci j = κ Definition 2.2.3 We define the H-fuzzy partition H ∗ inferred from the sets Hi and C i as the fuzzy partition that is the output of the three-step algorithm composed of Steps (i), (ii), and (iii) given below. (i) Cluster alignment. Find a correct alignment between the clusters. To this aim, compute the correlation coefficient between every pair of h il and h pj for i, p = 1, 2, . . . , r and j, l = 1, 2, . . . , m using both Eq. (1.32) and the following rule. • If ρHFS (h i L , h p J ) = max(ρHFS (h i,l , h pj )), then the L th cluster of the i th clustering algorithm corresponds to the J th cluster of the p th clustering algorithm, where p, i i= p = 1, 2, . . . , r and l, jl= j = 1, 2, . . . , m. Therefore, pairs of clusters for any pair of clustering algorithms are associated.
2.2 H-Fuzzy Partition
45
(ii) Determining the membership for each cluster. Use an average operator such as Eqs. (1.11) or (1.14) on the membership values of the associated clusters to compute the mean membership degree for the j th cluster. That is, for each cluster, calculate hˆ j using Eq. (2.7). ˆh j = HFA(h 1 j , . . . , h r j ) = ⊕ri=1 1 h i j , for all x ∈ X. (2.7) r (iii) Defining the H-fuzzy partition. Define a hesitant fuzzy set of clusters, H ∗ , for each x ∈ X , utilizing Eq. (2.8). H ∗ = {x, hˆ j | j = 1, 2, . . . , m}.
(2.8)
Note that H ∗ is an H-fuzzy partition. It should be mentioned that there is a set of cluster centers for each cluster which is given by Eq. (2.9). C j = {ci∗j | i = 1, 2, . . . , r }, where j = 1, 2, . . . , m.
(2.9)
Definition 2.2.3 permits users to consider a variety of membership degrees and cluster centers. It also allows users to postpone the decision on determining the membership degrees and the cluster centers that are more preferable than the others. Then, the quality of the solution can be evaluated using the most suitable validity index. In Example 2.2.1 below, Fscore is used as the selected validity index, although there are other ones that could be used for this purpose as well. The Fscore evaluates the quality of given clusters [3]. Proposition 2.2.2 Definition 2.2.3 builds an H-fuzzy partition. Proof Let H ∗ = {x, hˆ j | j = 1, 2, . . . , m} be a set of hesitant fuzzy sets constructed using Definition 2.2.3 and hˆ j using Eq. (2.7). In this case, as the operator ⊕ in Eq. (2.7) is based on the extension principle combining r different sets, we have that the cardinality of hˆ j , say κ j , is at most κ r . Therefore, it is evident that the inequality in Expression (2.1) corresponds for these sets to the following equation: m 1 m j=1
κ j k=1
μkj (x)
κj
≤ 1 and 0 ≤ μkj (x) ≤ 1 for all x ∈ X,
therefore In order to give a vivid illustration of the procedure of H-fuzzy partition construction presented in Definition 2.2.3, an artificial example is provided in the following. Example 2.2.1 Our proposed approach was tested on IRIS dataset [4]. This dataset contains 150 records and four numerical variables. Each record is classified as an element of one of the three existing classes including Iris Setosa, Iris Versicolour, and Iris Virginica. In our experiment, a subset of 22 records all of which belong to
46
2 A Definition for Hesitant Fuzzy Partitions
Fig. 2.1 The Iris_14 dataset
the Iris Setosa class were selected. In addition, only two variables of the dataset, specifically, the first and the fourth variables, denoted by Iris_14, were included in this experiment. The reason for this limited selection of data records and variables was the possibility to display the data and implement the construction procedure in a simple and straightforward manner. A graphical view of this data is given in Fig. 2.1. To cluster this dataset, two clustering algorithms, FCM with fuzziness degree m = 1.5 and FCM with fuzziness degree m = 2.3, were used. This corresponds to r = 2. Each clustering algorithm was applied for nine different settings of parameters (i.e., κ = 9). These nine settings correspond to the combination of three kernels (cosine distance, Euclidean distance and Mahalanobis distance) and three cluster center initialization techniques (random method, cumulative approach [5] and subtractive clustering methods [6]). Finally, three clusters are considered (m = 2). To describe the process of constructing an H-fuzzy partition, the results obtained for the point x = (5, 0.3) of the cluster number 2 are shown in Table 2.1. We select this point x = (5, 0.3) because it is positioned between two cluster centers. It is worth noting that some points such as x = (4.25, 0.1) are not interesting for this example. The reason lies in the fact that such points can be correctly clustered in a straightforward manner.
2.2 H-Fuzzy Partition
47
Table 2.1 The clustering values for x = (5, 0.3) H1 (5, 0.3): FCM (m = 1.5) h 11 = {0.9641, 0.9640, 0.0251, 0.0250, 0.0250, 0.0248, 0.0131, 0.0130, 0.0130}, h 12 = {0.9640, 0.9637, 0.9636, 0.0248, 0.0248, 0.0130, 0.0129, 0.0129}, h 13 = {0.9640, 0.9637, 0.9636, 0.9635, 0.0248, 0.0248, 0.0130, 0.0129, 0.0129}, H2 (5, 0.3): FCM (m = 2.3) h 21 = {0.9614, 0.9613, 0.9607, 0.0256, 0.0255, 0.0255, 0.0135, 0.0135, 0.0135}, h 22 = {0.96086, 0.96083, 0.9607, 0.0256, 0.0251, 0.0251, 0.0136, 0.0134, 0.0134}, h 23 = {0.960863, 0.96083, 0.96074, 0.02565, 0.02514, 0.01360, 0.01342, 0.01341},
∗ = (4.8037, 0.2163). c11 ∗ = (4.9357, 0.3235). c12 ∗ = (5.4721, 0.299). c13
∗ = (5.0193, 0.2617). c21 ∗ = (5.0086, 0.2883). c22 ∗ = (5.1838, 0.289). c23
As r = 2, there are two hesitant fuzzy sets H1 and H2 (i.e, the one for FCM with parameter 1.5 and the other for FCM with parameter 2.3, that play the role of two different clustering algorithms). In addition, three sets of cluster centers exist, each one corresponding to one of the clusters. Each hesitant fuzzy set is defined by at most 9 membership values corresponding to each of the 9 possible parameters (i.e., κ = 9). In the following, the process of constructing an H-fuzzy partition based on the data described in this example is implemented. • Cluster alignment. First, using Eq. (1.32), the correlation coefficient between every pair of h il and h pj is computed for i, p = 1, 2 and j, l = 1, 2, 3. The corresponding correlation matrix is given in Eq. (2.10). ⎛
⎞ 1.4848 1.4847 1.4847 ρ = ⎝1.4332 1.4335 1.4334⎠ . 1.4332 1.4335 1.4336
(2.10)
The matrix ρ reflects the fact that the pairs (h 11 , h 21 ), (h 12 , h 22 ) and (h 13 , h 23 ) are associated. • Determining the membership for each cluster. The average operator, presented in Eq. (1.14), is applied to the membership values of the associated clusters to calculate the mean membership degree for the j th cluster. This membership degree is given in Eq. (2.11). 1 h 1 j ⊕ h 2 j for all x ∈ X and for j = 1, 2, 3. 2 (2.11) In the current example, to include all items of information in the calculations, it is assumed that hˆ j is a multiset. Hence, the number of membership values is at most κ × κ = 81. hˆ j = HFA(h 1 j , h 2 j ) =
48
2 A Definition for Hesitant Fuzzy Partitions
• Defining the H-fuzzy partition. An HFS, H ∗ , of clusters is defined for each x ∈ X as (2.12) H ∗ = {x, hˆ j | j = 1, 2, 3}. H ∗ is an H-fuzzy partition. Furthermore, to each of the clusters 1, 2 and 3, a set of cluster centers presented respectively by Eqs. (2.13), (2.13) and (2.15) is defined. C1 = {(4.8037, 0.2163), (5.0193, 0.2617)},
(2.13)
C2 = {(4.9357, 0.3235), (5.0086, 0.2883)},
(2.14)
C3 = {(5.4721, 0.299), (5.1838, 0.289)}.
(2.15)
and
In order to evaluate the quality of the final resulting clusters associated to H ∗ , the accuracy measure, Fscore, is employed. Let m be the number of individual classes. The total Fscore is computed as the weighted sum of the Fscores associated with these m classes so that the size of each class is considered as its weight in the summation. The Fscore can be calculated by Eq. (2.16) in which n r and Fr denote the size and the Fscore of the r th class, respectively. Fscore =
m nr r =1
N
Fr .
(2.16)
For the class r , Fr finds a cluster Si that agrees better with the r th class than do the other clusters. Fr is calculated using the formula given by Eq. (2.17), where PSi is the precision, defined as the number of objects in the cluster Si which belong to the r th class divided by the number of objects in the cluster Si , and R Si is the recall, formulated as the number of objects in the cluster Si that belong to the r th class divided by number of objects in the r th class. Fr = max Si
2Psi Rsi Psi + Rsi
.
(2.17)
The results of Fscore for the H-fuzzy partition and for the two executions of FCM corresponding to m = 1.5 and m = 2.3 calculated with respect to Iris_14 and IRIS datasets are reported in Table 2.2.
2.3 Discussion
49
Table 2.2 The results of the first experiment Fscore of FCM (m = 1.5) H-fuzzy partition Dataset
FCM (m = 2.3)
Iris_14 IRIS
0.8933 0.8933
1 0.9264
0.8933 0.8933
2.3 Discussion In this chapter, we review a definition for H-fuzzy partitions and a method to build them from fuzzy partitions. The notion of H-fuzzy partitions explored in Sect. 2.2 also generalizes the concept of I-fuzzy partitions. It is important to underline that these partitions provide the user with membership values and cluster centers to effectively handle the task of cluster validation when various cluster validity indices are applied to new samples. In particular, the Fscore can be used to evaluate the H-fuzzy partition as we describe in Sect. 2.2. The experimental results show the advantages of considering H-fuzzy partitions and the approach to build them from fuzzy clustering algorithms. An example with two FCM approaches has been described above.
2.4 Computer Programming Exercises for Future Works 1. Implement an algorithm based on the idea of H-fuzzy partitions that permits users to use different types of clustering algorithms and create clusters for a set of given samples. 2. In this chapter, we applied fuzzy c-means variants that are of the same category. Develop an algorithm that has the ability to employ various clustering methods including both fuzzy and non-fuzzy clustering techniques.
References 1. Han, J., J. Pei, and M. Kamber. 2011. Data mining: concepts and techniques. Amsterdam: Elsevier. 2. Torra, V. 2010. Hesitant fuzzy sets. Int. J. Intell. Syst. 25 (6): 529–539. 3. Zhao Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In The 11th International Conference on Information and Knowledge Management, pp 515–524 (2002). 4. Dua, D., and C. Graff. 2017. UCI machine learning repository. Irvine, CA: Irvine, School of Information and Computer Sciences, University of California. 5. Erisoglu, M., N. Calis, and S. Sakallioglu. 2011. A new algorithm for initial cluster centers in k-means algorithm. Pattern Recogn. Lett. 32 (14): 1701–1705. 6. Chiu, S.L. 1994. Fuzzy model identification based on cluster estimation. Journal of Intelligent & fuzzy systems 2 (3): 267–278.
Chapter 3
Unsupervised Feature Selection Method Based on Sensitivity and Correlation Concepts for Multiclass Problems
3.1 Introduction In this chapter we describe a unsupervised filter method, called Sensitivity and Correlation based Feature Selection (SCFS). It is based on subtractive clustering and the concepts of sensitivity and Pearson’s correlation. We show how this method is employed as the fitness function in a genetic algorithm (GA) in order to evaluate feature subsets. Informally, the method works as follows. First, the sensitivity index of each feature is computed by applying the subtractive clustering technique to the other features. Then, the results of the sensitivity index are used to calculate the relevance of the features. In addition, the Pearson’s correlation coefficient is used to determine the redundancy among the selected features. The goal of SCFS is to obtain a set with maximum relevance to the target concept and minimum redundancy among the features included. That is, to maximize the score of a selected feature subset minimizing its size. This method, unlike many other methods that are commonplace in the literature, is appropriate for both supervised and unsupervised problems. To validate the approach, it was applied to a series of numerical datasets. The SCFS method was studied in [1]. The analysis is not only in terms of its performance over a number of well-known benchmark datasets, but also a comparison against other similar feature selection methods. The results in [1] show tha albeit SCFS is an unsupervised filter technique, it is comparable to other well-known supervised methods in terms of the classification accuracy and the number of selected features. The remainder sections of this chapter are organized as follows. In Sect. 3.1.1, the application of genetic algorithms to feature selection is briefly described. In Sect. 3.2, we describe the SCFS method in detail. In Sect. 3.3, experimental results and comparisons of the proposed algorithm with some standard methods are presented. The chapter finishes with a programming exercise.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Eftekhari et al., How Fuzzy Concepts Contribute to Machine Learning, Studies in Fuzziness and Soft Computing 416, https://doi.org/10.1007/978-3-030-94066-9_3
51
52
3 Unsupervised Feature Selection Method …
3.1.1 GA for Feature Selection In this section, we review feature selection strategies that are based on Genetic Algorithms (GAs). Genetic algorithms are generally quite effective for searching large, non-linear and poorly understood spaces. Genetic algorithm is one of the most widely used techniques for both feature and instance selection, and it can improve the performance of data mining algorithms [2]. In [3], a comparative study was carried out considering a large number of feature selection algorithms. As another complementary research, [1] considered many experiments, some briefly reported in this chapter, which show that GA is more suitable in comparison to other heuristic search methods for large and medium sized problems. The output of these methods are several optimal or close-to-optimal feature subsets. In GA, a binary chromosome with its length equal to the number of the original features, is used to represent the selection of features. A zero or one in the jth position of the chromosome denotes the absence or the presence of the jth feature in this particular subset. An initial population of chromosomes needs to be created. The size of this population and how it is created is in practice an important issue. Given a size (i.e., the number of cromosomes), the population is usually produced randomly. The typical genetic operators such as crossover and mutation are applied to a population (the pool of feature subsets) to create a new population (i.e., a new pool). Again, it is of primary concern to decide which types of crossover and mutation are well-suited in our problem. This new feature subset pool needs to be evaluated. This can be done in two different ways. 1. The filter approach. In this case, the evaluation (fitness) of each individual (i.e., a feature subset) is calculated using an appropriate criterion function. This function evaluates the “goodness” of the feature subset so that a larger value of the function indicates a better feature subset. Examples of the criterion function include the entropy index, the correlation measure, and a combination of several other criteria. 2. The wrapper approach. In this case, to evaluate the chromosomes, first, a classifier is induced based on the feature subset. Then, the classification accuracy on the data, or an estimation of it, is obtained. To guide the search toward minimal feature subsets, the subset size can also be incorporated into the fitness function of both filter and wrapper methods. The next step is to select the best individuals and, then, using crossover and mutation, a new generation of individuals (a new population) is built. Again, we evaluate the population and repeat the process. This iterative process is repeated until a certain stopping condition is met. Therefore, a suitable stopping criterion must be used. This is typically achieved by limiting the number of generations that can take place or by setting some threshold which must be exceeded by the fitness function. The general procedure of a genetic algorithm is as follows. 1. Create initial population 2. Evaluate population
3.1 Introduction
53
3. While stopping criteria is not met (a) Select better individuals (b) Create new population (c) Evaluate population
3.2 Proposed Unsupervised Feature Selection Method In this section, we define first the relevance of the features in terms of subtractive clustering, which is one the most suitable clustering methods for feature selection problems. Then, we show how to aggregate three criteria (relevance, redundancy, and the number of selected features) in a fitness function.
3.2.1 Feature Relevance Evaluation via Sensitivity Analysis Based on Subtractive Clustering The density measure defined in Eq. (1.62) permits to quantify the relationship between the features and the sample clusters/classes. This measure is affected by the sample-feature values. We can compute the sensitivity of the ith sample with respect to the kth feature as follows n ∂ Di S(i, k) = ∂x
pk
p=1
,
(3.1)
where n is the number of samples. Then, the sensitivity S(i, k) computes the influence that the kth feature has on the density measure of the ith data point. In other words, it determines the contribution when clustering the ith sample in the kth feature. According to Eq. (1.62), ∂∂xDpki can be calculated by Eq. (3.2). ⎛ ⎞ m n −1 ∂ Di =∂⎝ exp (xir − x jr )2 ⎠ /∂ x pk . 2 ∂ x pk /2) (r a r =1 j=1
(3.2)
Equation (3.2) can be reduced to the following cases: • When i = p and j = p, then ∂ Di = ∂ x pk
2 (ra /2)2
m −1 2 (xik − x pk ) × exp (xir − x pr ) . (ra /2)2 r =1
(3.3)
54
3 Unsupervised Feature Selection Method …
• When i = p and j = p, then ∂ Di = ∂ x pk j=1 n
−2 (ra /2)2
(x pk
m −1 − x jk ) × exp (x pr − x jr )2 (ra /2)2 r =1
. (3.4)
• When i = p and j = p or when i = p and j = p, ∂ Di = 0. ∂ x pk
(3.5)
The sensitivity of the kth feature for the dataset X is defined by S(k) =
n i=1
S(i, k) =
n n ∂ Di ∂x i=1 p=1
pk
.
(3.6)
In this expression, we can observe that the sensitivity of the kth feature includes the effect of this feature on the density of all samples. The sensitivity index indicates that a feature with higher sensitivity value contains more information about the clusters compared to the other features.
3.2.2 A General Scheme for Sensitivity and Correlation Based Feature Selection (SCFS) There are two basic steps in any filter-based subset selection method: the generation step, in which the candidate feature subsets are generated; and the evaluation step, during which the generated candidate feature subsets are evaluated according to a merit measure. This is an iterative process. It is stopped when the required number of features is obtained or the user-specified iterations are reached. Feature selection methods are divided into three groups according to the search mechanism used. They are complete, heuristic, and random methods. In heuristic methods, some meta-heuristic search algorithms such as genetic algorithms are used to perform the two steps above. The approach uses GA with an evaluation merit (fitness function) based on a combination of information based (sensitivity) and dependency based (correlation coefficient) criteria. This combination, based on the Correlation-based Feature Selection (CFS) merit [4], is defined in Eq. (3.7). kS . Merit = k + k(k − 1)rff
(3.7)
In this equation, S is the average value of the sensitivity of all selected features, rff, is the average value of all feature-feature correlation coefficients in the selected
3.2 Proposed Unsupervised Feature Selection Method
55
Fig. 3.1 The flowchart corresponding to the proposed feature selection method, SCFS
subset, and k is the number of selected features. This merit is to be maximized. This means that we maximize features sensitivity S, and at the same time we minimize the features-features redundancy rff and the number of the selected features k. The schematic diagram of SCFS is shown in Figure 3.1. Initially, the dataset is normalized to fit in a proper range, next, we look for the best feature subset. Selected subsets must be tested using some classifier based methods. In the subset selection step, first, the sensitivity values of all the features are calculated and next, these values are fed into the genetic algorithm. GA initializes the population randomly. Then, the fitness of the generated subset is subsequently determined in the evaluation step using relevance, redundancy, and the number of the selected features. Then, GA process continues generating populations and evaluating them until the number of generations reaches 50. Finally, the selected feature subset will be evaluated in the testing step.
3.3 Discussion We reviewed in this chapter the filter method SCFS that finds a minimum size feature subset that preserves a high classification accuracy. The advantages of this method include the ability to find small subsets of features with good accuracy. Subtractive clustering and Pearson’s correlation are used to compute the relevance and the redundancy of the features. The algorithm implements a heuristic search using GA. The fitness value is calculated taking into account a merit function based on the CFS feature selection merit, and using the sensitivity. This sensitivity measure is com-
56
3 Unsupervised Feature Selection Method …
Table 3.1 Properties of different methods used for the comparisons Algorithm Kruskal-Wallis Gini index Information gain FCBF CFS Blogreg SBMLR Fisher score Relief-F MRMR Chi2 score
Learning Algorithm supervised supervised supervised supervised supervised supervised supervised supervised supervised supervised supervised
Methods Type filter filter filter filter filter embedded embedded filter filter filter filter
Univariate/ Multivariate univariate univariate univariate multivariate multivariate univariate multivariate univariate univariate multivariate univariate
Output feature ranking feature ranking feature ranking feature subset feature subset feature subset feature subset feature ranking feature ranking feature subset feature ranking
puted using the subtractive clustering technique. The analysis of the sensitivity has been introduced in the literature related to the fuzzy c-means clustering algorithm. In this chapter, the subtractive clustering algorithm was applied to define an unsupervised relevance measure. Taking all these aspects into consideration, the method can be seen as a multiclass filter technique that removes the redundant and irrelevant features. This method was compared in [1] with eleven common filter and embedded algorithms whose name and characteristics are presented in Table 3.1. To compare these methods, the following datasets were used: Libras, Parkinson, Soybean, Breast cancer, Ecoli, Yeast, Page-blocks, Dermatology, Seeds, Waveform, Thyroid Disease, Diabetes, Vehicle, Iris, and Laryngeal. Most of these datasets are freely available from the University of California Irvine Machine Learning Repository (UCI) [5], and have been used in a large number of studies. According to [1] which describes the experiments, the algorithm described in this chapter performs better than the others in terms of classification performance and of the number of selected features. The comparison in [1] was based on some nonparametric statistical tests. The outcome of these tests, according to [1], confirms the ability of the method in comparison to the others. Barchinezhad and Eftekhari [1] summarize the contributions of this approach as follows. 1. The approach defines a sensitivity measure that is based on the subtractive clustering. 2. The approach proposes a new feature evaluation measure that is based on the sensitivity and Pearson’s correlation. 3. The approach compares feature selection methods by means of non-parametric statistical tests.
3.4 Computer Programming Exercise for Future Works
57
3.4 Computer Programming Exercise for Future Works 1. Implement the algorithm described in this chapter and use two other fuzzy clustering methods, such as fuzzy c-means and its variants. Combine the sensitivities obtained for these different clustering methods via hesitant fuzzy approaches.
References 1. Barchinezhad, S., and M. Eftekhari. 2016. Unsupervised feature selection method based on sensitivity and correlation concepts for multiclass problems. Journal of Intelligent & Fuzzy Systems 30 (5): 2883–2895. 2. Tsai, C.-F., W. Eberle, and C.-Y. Chu. 2013. Genetic algorithms in feature and instance selection. Knowl.-Based Syst. 39: 240–247. 3. Kamyab, Shima, and Mahdi Eftekhari. 2016. Feature selection using multimodal optimization techniques. Neurocomputing 171: 586–597. 4. Hall, M. A. 1999. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato Hamilton. 5. Dua, Dheeru, and Casey Graff. 2017. UCI machine learning repository. Irvine: School of Information and Computer Sciences, University of California.
Part II
Supervised Learning Classification and Regression
Chapter 4
Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
4.1 Introduction It is evident that the performance of fuzzy rule-based classifiers is significantly influenced by the shape of Membership Functions (MFs) that define them. There are two general methods to determine MFs, manual methods and automatic methods. In manual methods, experts determine MFs based on their own experiments. In contrast, in automatic ones, MFs are automatically generated from data. Automatic generation of MFs from data is one of the fundamental challenges in applications of fuzzy set theory. There are no guidelines or rules that can be employed to decide which membership generation technique is the most appropriate one [1]. For many real-world problems, it is not easy to generate MFs that reflect subjective perceptions about imprecise concepts. The difficulty appears when we want to evaluate the correctness of the generated MFs. This problem is even more serious when MFs are generated from data in an automatic manner. That is, without any human intervention in the process. The Fuzzy Decision Tree (FDT) induction algorithm searches among all possible fuzzy rules and selects a subset of them that leads to the construction of the most accurate and simple FDT. Membership functions defined on the domain of attributes determine the size and the quality of fuzzy rules available in the search space. Designing good MFs is an essential problem when developing a fuzzy decision tree classifier because it requires the partition all continuous attributes. In this chapter, a two-step algorithm is described to generate MFs from data. It uses the accuracy of a FDT as a quality measure of the generated MFs and employs the crisp discretization as an initial partitioning. In short, the process is as follows. First, the discretization algorithm divides the attribute’s domain into crisp parts, and then for each part we define a membership function. We discuss four methods to define MFs on crisp partitions. The first method is based on the width of each part, the second one uses the standard deviation of the examples associated to each part, the third one
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Eftekhari et al., How Fuzzy Concepts Contribute to Machine Learning, Studies in Fuzziness and Soft Computing 416, https://doi.org/10.1007/978-3-030-94066-9_4
61
62
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
applies the coverage rate of the neighboring parts, and the last one is defined using the Partition Coverage Rate (PCR). The organization of the rest of this chapter is as follows. First, Sect. 4.2 focuses on discretization methods. Then, Sect. 4.3 describes how to build fuzzy partitions based on discretization methods and crisp partitions. This includes the Fuzzy Entropy Based Fuzzy Partitioning (FEBFP) among other methods. Section 4.4 discusses previous experimental results. The chapter finishes with a final section that includes exercises for futher work.
4.2 Discretization Methods Discretization is the process of partitioning continuous attributes to obtain categorical or discrete ones. Since a number of machine learning algorithms are designed to operate only on categorical attributes, discretization is an important step in machine learning applications in which continuous data should be handled and processed. In the literature, there are many discretization algorithms [2] which can be classified according to several dimensions. Some of them are the following ones: “splitting vs. merging”, “supervised vs. unsupervised”, “dynamic vs. static” and “global vs. local” (see e.g. [2] for details). Splitting discretization methods are top-down approaches that start with an empty set of cut-points and gradually divide the intervals and subintervals to obtain the discretization. In contrast, merging discretization methods are bottom-up approaches that begin with a complete list of all continuous values of an attribute as the cut-points and merge the intervals by eliminating some of the cut-points to obtain the discretization. Supervised discretization methods consider the class information associated to the examples available for the discretization. In contrast, unsupervised methods do not use this additional information. Since this information is not available, approaches as equal width and the equal frequency [2] are then used in the discretization process. Dynamic discretization methods discretize continuous attributes along with the construction of a classifier such as C4.5 [2], while static discretization methods discretize attributes before the construction of a classifier. All the dynamic and static discretization methods developed so far have been compared and reviewed in [2]. Global discretization methods have a global perspective and use all the examples. In constrast, local discretization methods, as e.g. in C4.5, only use some of the examples in each step of the process. There are a large number of splitting discretization measures in the literature, such as binning, entropy and dependency. Binning is the simplest discretization method that creates a specified number of bins. Equal width and equal frequency [2] are examples of supervised discretization methods and 1R [2] is an example of the unsupervised discretization method which uses binning. The equal width discretization method takes the parameter k that determines the number of intervals and generates k intervals with equal width. The equal frequency discretization method takes a parameter k and generates k intervals with equal numbers of examples. 1R divides the domain of a continuous attribute into a number of disjoint intervals each
4.2 Discretization Methods
63
of which contains at least 6 examples, and adjusts the boundaries based on the class labels associated with the examples. Entropy is the most commonly used discretization measure that determines the average amount of information of each interval. Entropy of the example set S with k class labels is defined by Eq. (4.1). Ent(S) = −
k
pi log( pi ),
(4.1)
i=1
where pi is the probability of the ith class. Entropy reaches its maximum value when the probability of all classes are equal and it hits the minimum value, in case all examples belong to the same class. ID3, Mantaras distance, and Fayyad’s discretization methods [2] use the entropy for the discretization. More particularly, ID3 discretization method divides the domain of an attribute into two sub-intervals using the cut-point, and leads to the minimization of the weighted entropy of the generated sub-intervals, then, it repeats this procedure on each generated sub-interval until all the examples of each sub-interval belong to the same class. Mantaras distance discretization method applies the Mantaras distance measure to find the splitting cutpoint that minimizes the Mantaras distance. Then, it uses the Minimum Description Length Principle (MDLP) [2] to determine if more sub-parts should be generated. Fayyad’s discretization method divides the intervals into sub-intervals in a way similar to ID3. Nevertheless, it uses MDLP to stop the discretization process. Dependency is another relevant tool used for discretization. It is a measure of the strength of the association between a class and a feature. Zeta, Chi-Merge, Chi2, ModifiedChi2, ExtendedChi2, CADD and CAIM are examples that use dependency as a splitting discretization measure [2]. More particularly, the Zeta [2] discretization method, which uses dependency discretization measure, is defined as the maximum achievable accuracy when each value of an attribute predicts a different class value. Chi-Merge [2] is a bottom-up discretization procedure that merges adjacent intervals with the least χ 2 value until all the adjacent intervals are considered significantly different by the χ 2 independence test. Chi2 [2] is an automated version of Chi-Merge in which the statistical significance level keeps changing to merge more and more adjacent intervals as long as the inconsistency criterion is satisfied. Modified Chi2 [2] is a modified version of Chi2 that addresses inaccuracy problems caused by the inherent inaccuracy in both the merging criterion and the user-defined inconsistency rate. The extended Chi2 [2] is a modified version of the modified Chi2 that addresses the classification with a controlled degree of uncertainty and also considers the effect of variance in two merged intervals. The Class-Attribute Dependent Discretizer (CADD) method [2] has as its goal the maximization of the mutual dependence, when this is measured by the interdependence redundancy between the discrete intervals and the class labels. The Class-Attribute Interdependence Maximization (CAIM) method [2] is a supervised discretization algorithm that maximizes the class-attribute interdependence to generate the possible minimum number of intervals.
64
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
There are other discretization methods that consider alternative splitting discretization measures. For example, the Ameva discretization method [2] maximizes a contingency coefficient based on the chi-square statistics and generates a potentially minimal number of discrete intervals. Fixed-Frequency discretization method [2] produces intervals with k examples in which k is a user defined parameter. The proportional discretization method [2] is a unsupervised technique that sets the interval frequency and the interval numbers equally proportional to the amount of the training data to guarantee a low level of both the bias and the variance. Khiops discretization method [2] optimizes the chi-square criterion in a global manner on the whole discretization domain to discretize the attribute’s domain. The MODL method which is based on a Bayesian approach of the discretization problem [2] introduces a space of discretization models and a prior distribution defined on this model space to define the Bayes optimal evaluation criterion of discretization. The Hellinger-based discretization (HellingerBD) [2] measures the amount of information provided by each interval to the target attribute using the Hellinger divergence. In HellingerBD, the interval boundaries are decided so that each interval contains as equal amount of information as possible. The Distribution-Index-Based Discretizer (DIBD) [2] employs the compound distributional index for discretization that combines both the homogeneity degree of the attribute value distribution and the decision value distribution. The Unsupervised Correlation Preserving Discretization (UCPD) [2] is a PCA-based unsupervised algorithm that takes the correlation among continuous attributes into account as well as considering the interactions between the continuous and the categorical attributes, simultaneously. The Class-Attribute Contingency Coefficient (CACC) algorithm [2] is a static, global, incremental, supervised and top-down discretization method based on the class-attribute contingency coefficient. Eventually, the Hypercube Division-based Discretisation (HDD) technique [2] is an algorithm that considers the distribution of both the class and the continuous attributes, as well as the underlying correlation structure in the dataset. The aim of this procedure is to find the minimum number of those hypercubes so that all examples that are included in one belong to the same class.
4.3 Employing Discretization Methods for Fuzzy Partitioning Different discretization methods use different approaches to generate cut-points and, consequently, they discretize continuous attributes in alternative ways. The definition of membership functions on the domain of an attribute can be considered as a fuzzy discretization, or a fuzzy partitioning. This is so because the boundaries of the different parts are not crisp values. That is, the membership function defines each part. This section describes how to transform the crisp partitions generated by discretization methods into fuzzy partitions.
4.3 Employing Discretization Methods for Fuzzy Partitioning
65
Fig. 4.1 To employ a discretization method for fuzzy partitioning
Each crisp discretization method is converted into a fuzzy partitioning method in two steps as described in Fig. 4.1. In the first step, the discretization method is applied to the continuous attribute’s domain to find the crisp partitions. In the second step, each crisp partition generated in the first step is transformed to a fuzzy partition. This fuzzy partition is defined by membership functions. In other words, the second step defines a set of membership functions on each crisp partition generated in the previous step. More particularly, one membership function is associated to each part. This general approach can benefit from the different measures that are being used for discretization. These measures include binning, entropy, dependency, and accuracy [2]. Since there are many discretization methods in the literature [2], many options exist for the first step. Moreover, several of the discretization methods generate no cut-points for some attributes and, as result, these attributes cannot be used later in the construction of the fuzzy decision tree. This process, which may lead to the elimination of some attributes, can be seen as an implicit feature selection process included in the fuzzy partitioning procedure. This implicit feature selection process discards those features for which no cut-points are found. This section also offers a description of alternative methods to transform a crisp partition to various kinds of MFs in the second step. These methods can be divided into two main categories, the ones that are “independent of the distribution of the examples” and the ones that are “based on the distribution of the examples”. The approach of defining MFs based on partition widths is an instance of the first category. It ignores the distribution of examples located inside each part and defines the MFs based on the width of each part. It is noted that the with of a part is the distance between the upper and the lower cut-points. Three notable examples of the second category are the ones in which the definition of MFs are based on the standard deviation, on the partition coverage rate and on the neighboring part coverage rate. The parameters of MFs generated according to these methods are sensitive to the distribution of the examples located inside each part or located in the neighboring parts. The standard deviation based MF definition method determines the parameters of MFs with respect to the mean value and the standard deviation of the examples located inside a part. The definition of MFs is based on the partition coverage rate. It defines MFs so that a certain percent of examples inside a part have a membership grade equal to one. Finally, the approach based on the neighboring partition coverage rate establishes MFs so that a certain percent of the left and the right adjacent neighbors have a non-zero membership grade. Section 4.3.1 explains these methods in detail. Then, Sect. 4.3.2 presents a full description of the fuzzy entropy based fuzzy partitioning approach.
66
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
4.3.1 Defining MFs Over Crisp Partitions Note that we need to define a membership function for each part in the partition generated by a discretization method. Each partition is defined by n cut-points and define n + 1 crisp parts. Therefore, we need to define n + 1 MFs for these n cutpoints. We consider here using four measures for this purpose. They are the partition width, the standard deviation, the neighbor partition coverage rate, and the partition coverage rate. Sections 4.3.1.1, 4.3.1.2, 4.3.1.3 and 4.3.1.4 discuss these methods in detail. In these subsections, each part is specified by two cut-points called the lower cut-point denoted by c1 , and the upper cut-point denoted by c2 .
4.3.1.1
Membership Functions Based on Partition Widths
In this case, membership functions are defined independently of how examples are distributed inside each part. Let p1 and p2 denote two adjacent crisp sets determined by the three cut-points c1 , c2 and c3 . That is, the interval [c1 , c2 ] denotes the part p1 and the interval [c2 , c3 ] denotes the part p2 . Suppose that M FP1 and M FP2 are the MFs defined on p1 and p2 , respectively. The membership grade of the cut-point, c2 , in the MFs M FP1 and M FP2 , which separates each two adjacent sets, should be equal to 0.5. In other words, for each two adjacent MFs, p1 and p2 , we should have μ M FP1 (c2 ) = μ M FP2 (c2 ) = 0.5, where μ M FP1 (c2 ) and μ M FP2 (c2 ) indicate the membership grades of c2 in M FP1 and M FP2 , respectively. The rest of this subsection explains how the definition of MFs based on width determines the parameters of the triangular, trapezoidal and Gaussian MFs. Partition widths based MFs definition method defines a trapezoidal MF for the leftmost and the rightmost MFs, which have only one neighbor, and it specifies triangular MF for all other sets. The parameters of the membership functions are computed using the expressions presented in Table 4.1. In the middle triangular MFs, the second parameter of the triangular MF is set to the center of the part, and the first and last parameters are determined so that the membership grade of the lower and the upper cut-points are equal to 0.5. In the case of the leftmost trapezoidal MF, the membership grades in the left half of the part are equal to one, while for the rightmost trapezoidal MF, the membership grades in the right half of the part are equal to one. Figure 4.2 shows an example of the triangular MFs based on partition widths for the cut-points 10, 40, 60 and 70. In contrast to the previous definition where membership functions are mainly triangular, if we allow them all to be trapezoidal, we proceed according to Table 4.2. In the middle membership functions, half of the interval defined by the cutting points c1 and c2 is considered to have a membership grade equal to one. Using this fact, the parameters b and c can be determined. Other parameters of the middle MFs are specified in a way that the membership grade of the lower and the upper cut-points are equal to 0.5. Figure 4.3 illustrates an example of the trapezoidal MFs based on partition widths for the cut-points 10, 40, 60 and 70.
4.3 Employing Discretization Methods for Fuzzy Partitioning
67
Table 4.1 Parameters of the triangular MFs based on partition widths Leftmost MF Middle MFs Rightmost MF (trapezoidal) (triangular) (trapezoidal) a = c1 b = c1 c1 + c 2 c= 2 c2 − c1 d = c2 + 2
c2 − c1 a = c1 − 2 c1 + c2 b= 2 c2 − c1 c = c2 + 2
c2 − c1 a = c1 − 2 c1 + c2 b= 2 c = c2 d = c2
Fig. 4.2 Triangular MFs based on the partition widths Table 4.2 Parameters of the trapezoidal MFs based on partition widths Leftmost MF Middle MFs Rightmost MF a = c1 b = c1
c2 − c1 3 c2 − c1 d = c2 + 3 c = c2 −
c2 − c1 4 c2 − c1 b = c1 + 4 c2 − c1 c = c2 − 4 c2 − c1 d = c2 + 4
a = c1 −
c2 − c1 3 c2 − c1 b = c1 + 3 c = c2
a = c1 −
d = c2
We consider now the definition of Gaussian membership functions (see Sect. 1.1.2). In this case, the parameters are defined according to Table 4.3. Similar to the case of triangular MFs, the definition of the leftmost and the rightmost ones are different from the middle MFs. They are 2-sided Gaussian MFs (see Equation 1.5). In the middle MFs, the center of the Gaussian MF is placed on the center of the interval (i.e., (c1 + c2 )/2). Then, the parameter σ is set so that the membership grade of the lower and the upper cut-points are equal to 0.5. In the leftmost and rightmost membership functions, we have an interval with membership
68
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
Fig. 4.3 Trapezoidal MFs based on partition widths Table 4.3 Parameters of Gaussian MFs based on partition widths Leftmost MF Middle MFs Rightmost MF (2-sided Gaussian) (Gaussian) (2-sided Gaussian) c1 + c2 cleft = c1 cleft = 2 c1 + c2 c2 − c1 c= σleft = √ σleft = 1 2 2 ln(4) c1 + c2 c2 − c1 σ = √ cright = c2 cright = 2 2 ln(4) c2 − c1 σright = √ σright = 1 2 ln(4)
value equal to one. This interval is defined by cleft and cright . The interval is defined to reach the end of the partition. Because of that, the parameters σleft in the leftmost MF and σright in the rightmost MF can take any arbitrary value. Note that the left side of the leftmost MF and the right side of the rightmost MF fall outside the attribute’s domain. In Table 4.3, they are set to one. Figure 4.4 shows an example of Gaussian MFs defined using partition widths for the cut-points 10, 40, 60 and 70. As the widths of consecutive parts are very different, the MFs based on partition widths may generate inappropriate MFs because one membership can include in its support several complete neighboring parts. The disadvantage of this method is that it determines the parameters of a membership function in a way completely independent of the neighboring parts and only using it’s own width. This fact can raise a problem in two situations. The first situation arises when the neighboring parts are much wider than the part itself, and the second one appears when the part itself is much wider than the neighboring ones. In the first situation, only a few examples of the neighboring parts have a non-zero membership grade to the MF corresponding to part itself. In this case, due to the membership values, the influence of these examples will be limited on the induced FDT. In the second situation, most examples of either the neighboring parts, the whole neighboring part, or even more
4.3 Employing Discretization Methods for Fuzzy Partitioning
69
Fig. 4.4 Gaussian MFs based on partition widths
Fig. 4.5 Trapezoidal MFs based on partition widths for the cut-points 40, 43, 48 and 95
than one part have assigned a non-zero membership grade by the MF of the part itself. The problem in the second situation is that the examples in the neighboring parts have larger influence on the induced FDT than the influence of other examples fully belonging to the part itself. Figure 4.5 shows these situations.
4.3.1.2
Membership Functions Based on the Standard Deviation
In this approach, two statistical features, the mean and the standard deviation, are extracted from the set of examples located to each part. They are used to compute the membership of each part. In addition, we have a user-defined parameter called StdCoefficient, which is a real positive number, and that controls the amount of fuzziness of the generated MFs. In contrast to the definition above based on the partition widths, the membership grade of the lower and the upper cut-points in the standard deviation based method are not necessarily equal to 0.5.
70
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
Table 4.4 Parameters of the triangular MFs based on the standard deviation Leftmost MF Middle MFs Rightmost MF (trapezoidal) (triangular) (trapezoidal) a = c1 b = c1 c = mean∗ d = mean∗ + 2 × stdVal∗∗ n ∗ mean = 1 xi n i=1 ∗∗ stdVal
= std × StdCoefficient,
mean∗
− 2 × stdVal∗∗
a= b = mean∗ c = mean∗ + 2 × stdVal∗∗ std =
a = mean∗ − 2 × stdVal∗∗ b = mean∗ c = c2 d = c2
1 n (xi − x) ¯ 2 n i=1
Fig. 4.6 Triangular MFs based on the standard deviation (stdCoefficient=1)
We consider again different types of membership functions. When they are triangular, the parameters are computed using the expressions in Table 4.4. In the middle MFs, the second parameter of the triangular MF is set to the mean value of all the examples inside the part. Then, the first and the third parameters are set to the points located at a distance of 2 × stdVal in the left and the right hand side of the mean value, respectively. The parameter, stdVal, is the standard deviation of all the examples positioned inside the part multiplied by the user-defined parameter stdCoefficient. Here, similar to the case of using width, the leftmost and the rightmost MFs are also trapezoidal. In the leftmost trapezoidal MF, the membership grade of all the examples on the left of the mean value is equal to one. Similarly, in the rightmost trapezoidal MF, the membership grade of all the examples on the right of the mean value is equal to one. Figure 4.6 illustrates an example of the triangular MFs based on the standard deviation for the cut-points 10, 40, 60 and 70. When membership functions are triangular, we use the parameters in Table 4.5. In middle MFs, there is an interval centered at the mean and defined by stdVal (i.e., [mean - stdVal, mean + stdVal]) with membership grades equal to one. The leftmost and rightmost parts are different from others in that the left side and the right side,
4.3 Employing Discretization Methods for Fuzzy Partitioning
71
Table 4.5 Parameters of the trapezoidal MFs based on the standard deviation Leftmost MF Middle MFs Rightmost MF a = c1 b = c1 c = mean∗ + stdVal∗∗ d = mean∗ + 2 × stdVal∗∗ n ∗ mean = 1 xi n i=1
a = mean∗ − 2 × stdVal∗∗ b = mean∗ − stdVal∗∗ c = mean∗ + stdVal∗∗ d = mean∗ + 2 × stdVal∗∗
∗∗ stdVal
std =
= std × StdCoefficient,
a = mean∗ − 2 × stdVal∗∗ b = mean∗ − stdVal∗∗ c = c2 d = c2
1 n (xi − x) ¯ 2 n i=1
Fig. 4.7 Trapezoidal MFs based on the standard deviation (stdCoefficient=1)
respectively, have also membership grades equal to one. See Table 4.5 for their definition. Figure 4.7 shows an example of the trapezoidal MFs for the cut-points 10, 40, 60 and 70. Table 4.6 presents the expressions to define a partition with Gaussian MFs. The definition is similar to the use of width, as both the leftmost and the rightmost MFs are 2-sided Gaussian ones and the parameters σleft of the leftmost MF and σright of the rightmost MF can take any arbitrary value. We set both of them to one. Figure 4.8 illustrates an example of Gaussian MFs for the cut-points 10, 40, 60 and 70.
4.3.1.3
Membership Functions Based on the Neighbor Partition Coverage Rate
The neighbor partition coverage rate specifies the trapezoidal MFs with respect to the distribution of examples in the neighboring parts. More particularly, taking into account left and right adjacent parts. This method employs a new parameter called Neighbor Partition Coverage Rate (NPCR) which is defined as the percentage of examples belonging to each neighboring part that should be included in the support
72
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
Table 4.6 Parameters of Gaussian MFs based on the standard deviation Leftmost MF Middle MFs Rightmost MF (2-sided Gaussian) (Gaussian) (2-sided Gaussian) cleft = c1 σleft = 1 cright = mean∗ σright = stdVal∗∗ n ∗ mean = 1 xi n i=1
mean∗
c= σ = stdVal∗∗
∗∗ stdVal
= std × StdCoefficient,
std =
cleft = mean∗ σleft = stdVal∗∗ cright = c2 σright = 1
1 n (xi − x) ¯ 2 n i=1
Fig. 4.8 Gaussian MFs based on the standard deviation (stdCoefficient = 1)
(a)
(b)
Fig. 4.9 The NPCR based MF definition. a Three consecutive parts. b The MF based on NPCR
of the generated MF. For example, an NPCR equal to 0.5 means that half of the examples of each neighboring part should be included in the support of the generated MF. This method solves some of the disadvantage of the approach based solely on width and, at the same time, it preserves the major advantage of that technique. In addition, the NPCR based approach assigns the membership grade of each cut-point to 0.5. Figure 4.9a shows three consecutive parts determined by four cut-points, C0 , C1 , C2 , and C3 . We explain the MF generation process for the second part [C1 , C2 ]. The point L divides the left part into two sub-parts P1 and P2 , and the point U divides
4.3 Employing Discretization Methods for Fuzzy Partitioning Table 4.7 Parameters of the trapezoidal MFs based on NPCR Leftmost MF Middle MFs a = c1 b = c1 c = c2 − (U − c2 ) d =U
a=L b = c1 + (c1 − L) c = c2 − (U − c2 ) d =U
73
Rightmost MF a=L b = c1 + (c1 − L) c = c2 d = c2
the right one into two other sub-parts P3 and P4 . The points L and U are selected so that the generated sub-parts satisfy Eq. (4.2). |P2 | |P3 | = = NPCR, |P1 ∪ P2 | |P3 ∪ P4 |
(4.2)
where NPCR is a user-defined parameter of the algorithm, |P2 | is the number of examples in the sub-part P2 , |P1 ∪ P2 | is the number of examples in both the subparts P1 and P2 which is, in fact, the number of examples in the left neighboring parts, |P3 | is the number of examples in the sub-parts P3 , and |P3 ∪ P4 | is the number of examples in both the sub-parts P3 and P4 that is, specifically, the number of examples in the right neighbor part. After determining the points L and U , which are respectively the first and the last parameters, a and d, of the trapezoidal MF, the second and the third parameters, b and c, are determined in such a way that the membership grade of the cut-points C1 and C2 equal 0.5. Figure 4.9b depicts the generated MF and Table 4.7 defines the parameters of the trapezoidal MF. Since the leftmost and the rightmost parts have only one neighbor, their parameters are specified differently. As it can be seen from Table 4.7, the parameter b is defined in terms of the value L and the parameter c is defined in terms of the value U . Since the relations presented in Table 4.7 force the membership grade of the cutpoints to be equal to 0.5, it is possible that the parameter b is less than the parameter c, especially, when the examples of the neighboring parts are far from the cut-points. Figure 4.10 illustrates an example of this special case. In the case of b < c, we redefine the parameters b and c such that it leads to the minimum modification in the value of the membership grade of the cut-points. Figure 4.10 presents the redefinition procedure of the parameters b and c of the trapezoidal MF. The parameters e and f utilized in Fig. 4.11 have are represented in Fig. 4.10. The MF generation process based on NPCR depends on the distribution of examples in the neighboring parts and it may generate asymmetric MFs. The next section describes a method that generates MFs based on the distribution of examples inside the parts.
4.3.1.4
Membership Functions Based on the Partition Coverage Rate
The partition coverage rate defines trapezoidal MFs based on the distribution of examples of each part taking into account a user-defined parameter: Partition Cover-
74
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
Fig. 4.10 A special case of generating an MF based on NPCR Fig. 4.11 Correction procedure for a special case of NPCR
age Rate (PCR). The parameter PCR takes a value from [0, 1] and determines what percentage of examples in each part need to have a membership grade equal to one. For instance, a PCR equal to 0.8 means that the parameters b and c of the trapezoidal MF should be determined so that 10% of the examples have a value less than b and other 10% of the examples have a value greater than the parameter c. The other parameters of the trapezoidal MF will be specified so that the membership grade of the cut-points is 0.5. Figure 4.12a describes how to build a fuzzy set determined by two cut-points C1 and C2 . The points L and U divide the interval into three sub-parts P1 , P2 , and P3 . These points are selected so that the generated parts satisfy Eq. (4.3).
4.3 Employing Discretization Methods for Fuzzy Partitioning
(a)
75
(b)
Fig. 4.12 PCR based MF definition. a A partition divided into three sub-partitions. b An MF based on PCR Table 4.8 Parameters of the trapezoidal MF based on the partition coverage rate Leftmost MF Middle MFs Rightmost MF a = c1 b = c1 c=U d = c2 + (c2 − U )
a = c1 − (L − c1 ) b=L c=U d = c2 + (c2 − U )
a = c1 − (L − c1 ) b=L c = c2 d = c2
|P1 | |P3 | 1 − PCR |P2 | = PCR and = = , |P1 ∪ P2 ∪ P3 | |P1 ∪ P2 ∪ P3 | |P1 ∪ P2 ∪ P3 | 2 (4.3) where PCR is a user-defined parameter of the algorithm, |P1 |, |P2 |, and |P3 | are the number of examples of the sub-parts P1 , P2 and P3 respectively, and |P1 ∪ P2 ∪ P3 | is the number of examples of the part which is the number of examples of the subparts P1 , P2 , and P3 . Figure 4.12b illustrates the MF we build from the construction described in Fig. 4.12a. After determining the points L and U , the relations given in Table 4.8 can be used to define the parameters of the trapezoidal MF. The MFs based on PCR are not necessarily symmetric while the MFs generated using the width of the parts or the standard deviation are always symmetric.
4.3.2 Fuzzy Entropy Based Fuzzy Partitioning Fayyad and Irani [3] proposed the entropy based discretization method that discretizes an attribute’s domain in a binary recursive manner. The entropy based discretization method selects a cut-point to divide a continuous attribute’s domain into two sub-partitions. The cut-point is selected so that it minimizes the weighted entropy of the generated sub-partitions. This process is repeated for each generated subpartition while the stopping criterion is not satisfied. The algorithm determines the proper time to stop the recursive discretization by applying the Minimum Description Length Principle (MDLP). This section presents a modification of the entropy based discretization method to generate fuzzy partitions for the domain of the continuous attribute. First, a brief description of the entropy based discretization method is provided.
76
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
Entropy Based Discretization method (EBD) selects a cut-point that minimizes the weighted entropy for the generated sub-parts. For an example set S and a cut-point T , the weighted entropy is defined by Eq. (4.4). WEnt(T, S) =
|S2 | |S1 | Ent(S1 ) + Ent(S2 ), |S| |S|
(4.4)
where S1 is the subset of the example set S with values less than or equal to T , S2 = S − S1 , |Si | is the number of examples in the example set Si , and Ent(Si ) denotes the entropy of Si calculated using Eq. (4.1). Fayyad and Irani [3] have proven that the cut-point T minimizing Ent(T, S) must be a boundary point. A value T in the range of the attribute A is a boundary point if and only if in the sequence of examples sorted by the value of A, there exist two examples, e1 ∈ S and e2 ∈ S, whose class labels are different so that A(e1 ) < T < A(e2 ) and also, there exists no other example e ∈ S such that A(e1 ) < A(e ) < A(e2 ). Fayyad and Irani [3] applied MDLP to determine if the candidate cut-point minimizing the weighted entropy should be accepted as a cut-point or should be rejected. The cut-point T on the example set S composed of N examples is accepted if and only if the following condition is met, and it is rejected otherwise. log2 (N − 1) + log2 (3k − 2) − [kEnt(S) − k1 Ent(S1 ) − k2 Ent(S2 )] , N (4.5) in which k, k1 and k2 are the number of classes of S, S1 , and S2 , respectively, Ent(Si ) is the entropy of Si , and Gain(T, S) is the information Gain. Information gain indicates the amount of difference between the entropy of the example set before and after the discretization. The information gain of an attribute relative to the example set S and the cut-point T is defined as Gain(T, S) >
Gain(T, S) = Ent(S) − WEnt(T, S),
(4.6)
where Ent(S) is the entropy of S and WEnt(T, S) is the weighted entropy calculated based on Eq. 4.4. The details of the codes and the proofs corresponding to the conditions and formulas discussed in this section can be found in [3]. The proposed FEBFP merges the two steps of generating MFs based on the EBD method. Similar to EBD, FEBFP is composed of three sub-algorithms including candidate cut-point selection, cut-point selection and a technique to determine when to stop the partitioning process. The process of selecting the candidate cut-point in FEBFP is done in the same way as done in EBD. In order to select the best cutpoint, FEBFP considers all the candidate cut-points. Next, an MF is defined and the weighted fuzzy entropy is calculated. It should be pointed out that the selected cut-point is the one that minimizes the weighted fuzzy entropy. The value of the weighted fuzzy entropy depends on the shape and the parameters of the MFs. The weighted fuzzy entropy is computed by Eq. (4.7).
4.3 Employing Discretization Methods for Fuzzy Partitioning
WFEnt(T, S) =
|S1 | |S2 | FEnt(S1 ) + FEnt(S2 ), |S| |S|
77
(4.7)
in which |S1 |, |S2 | and |S| are the fuzzified number of examples of S1 , S2 and S, respectively, and FEnt(Si ) is the fuzzy entropy of Si . The fuzzified number of the examples of partition S on which the membership function M is defined is calculated using Eq. (4.8). n μ M (Xi ), (4.8) |S| = i=1
where μ M (Xi ) denotes the membership grade of Xi in the MF M. The fuzzy entropy is calculated by Eq. (4.9). FEnt(S) = −
k |S(y=c ) | i
i=1
|S|
log2 (
|S(y=ci ) | ), |S|
(4.9)
in which k is the number of the class labels, and S(y=ci ) is a subset of S that have the class label ci . In addition, the number of examples is calculated by the formula given in Eq. (4.8). The mechanism of the FEBFP’s third sub-algorithm is similar to EBD. However, in FEBFP, the membership functions are fuzzy MFs. In other words, Eq. (4.10) is employed in FEBFP to determine the stopping criterion of the algorithm. FGain(A, T ; S) >
log2 (N − 1) log2 (3k − 2) − [kFEnt(S) − k1 FEnt(S1 ) − k2 FEnt(S2 )] + , 2 N
(4.10) where A is the attribute under consideration, T is the threshold corresponding to the partition S, N is the fuzzy number of the examples of S, S1 and S2 are the two sub-partitions generated by the threshold T , k is the number of the class labels of S, k1 and k2 are the number of the class labels of S1 and S2 , respectively, FEnt(Si ) is the fuzzy entropy of Si computed using Eq. (4.9), and FGain(A, T ; S) is the fuzzy information gain calculated by Eq. (4.11). FGain(A, T ; S) = FEnt(S) −
|S2 | |S1 | FEnt(S1 ) − FEnt(S2 ). |S| |S|
(4.11)
4.4 Discussion The estimation of membership functions according to data is an important issue in many applications of fuzzy theory. In this chapter, some methods were presented in which the crisp partitions generated from the output of the crisp discretization algorithms were transformed into fuzzy membership functions. Four approaches were discussed for this transformation. The first approach was based on the partition’s
78
4 Fuzzy Partitioning of Continuous Attributes Through Crisp Discretization
widths, the second one was based on the standard deviation of the examples in each of the parts of the partition, the third one is defined in terms of the neighbor partition coverage rate, and the last approach is based on the partition coverage rate. Moreover, an entropy based discretization algorithm, presented by Fayyad, was modified to generate membership functions based on their shape and their parameters. The description of the experiments that are conducted making use of several discretization methods and the comparison of the results of these experiments are included in [4]. The methods considered include Equal Width (equWidth), Equal Frequency (equFreq), Fayyad, ID3, Bayesian, MantarasDist (ManDist), USD, Chi-Merge, Chi2, Ameva , Zeta, CADD, CAIM, ExtendedChi2 (exChi2), FixedFrequency (fixFreq), Khiops, ModifiedChi2 (modChi2), MODL, 1R, Proportional (prop), HeterDisc (hetDisk), HellingerBD (helBD), DIBD, UCPD, CACC, HDD, ClusterAnalysis (cluAna), MVD and FUSINTER (FUS) [2]. These methods, which were briefly described in Sect. 4.2, are implemented in the KEEL Package [2, 5]. To evaluate these alternative discretization methods datasets from the UCI machine learning repository [6] were used. Statistical analysis of the experimental results are included in [4]. The results described in [4] show that eight MF generation methods (PCR-CAIM, PCR-Zeta, PCR-FEBFP, PCR-MantarasDist, PCR-Fayyad, NPCR-FEBFP, NPCRFayyad and NPCR-MantarasDist) outperformed the other ones in terms of both the accuracy and the number of nodes in the tree. Among these eight methods, the trapezoidal MFs defined by PCR on the crisp partitions generated by the Zeta discretization algorithm outperformed the other methods when the accuracy and the complexity of FDT had the same degree of importance.
4.5 Computer Programming Exercises for Future Works 1. According to the results reported in this chapter from [4], Zeta discretization is the best method in terms of its performance together with the trapezoidal MFs based on PCR. Consider using different discretization methods for different features. For instance, if four attributes are given, is it reasonable to use a distinct discretization method for each attribute? Develop a program in which different discretization techniques and MF definition methods are applied to different features and build fuzzy decision trees using this approach. 2. Study and implement the methods proposed in this chapter when instead of traditional fuzzy sets, the Z-numbers introduced by Zadeh are used in the fuzzy decision trees. This type of FDT, which can be called Z-FDT, may be of relevance when we need to deal with higher levels of uncertainty.
References
79
References 1. Medasani, S., J. Kim, and R. Krishnapuram. 1998. An overview of membership function generation techniques for pattern recognition. International Journal of approximate reasoning 19 (3–4): 391–417. 2. García, S., J. Luengo, and F. Herrera. 2015. Discretization, 245–283. Cham: Springer. 3. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In The Thirteenth International Joint Conference on Articial Intelligence, Chambéry, France, pp. 1022–1027 (1993) 4. Zeinalkhani, M., and M. Eftekhari. 2014. Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers. Information Sciences 278: 715–735. 5. Alcalá-Fdez, J., A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera. 2011. Keel data-mining software tool: dataset repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17 (2–3): 255– 287. 6. Dua, D., and C. Graff. 2017. UCI machine learning repository. Irvine: School of Information and Computer Sciences, University of California.
Chapter 5
Comparing Different Stopping Criteria for Fuzzy Decision Tree Induction Through IDFID3
5.1 Introduction Decision trees [1, 2] are commonly used as classification models in data mining whose induction methods recursively partition the instance space for generating tree structured models. The recursion is completed when the stopping condition is satisfied. In these tree structures, leaves represent classifications and branches represent conjunctions of attributes that lead to those classifications. Decision trees are comprehensible classifiers that cannot handle language uncertainties and are noise sensitive. One of the challenges in the fuzzy decision tree induction is to develop algorithms that produce fuzzy decision trees of small size and depth. Larger fuzzy decision trees (or over-fitted FDTs) lead to poor generalization performance and need more time to classify a new instance. FDT’s growth controlling methods try to handle the problem of over-fitting. They may be classified into two categories, post-pruning methods and pre-pruning methods. • Post-pruning methods allow the FDT learning algorithms to over-fit the training data. Then, the over-fitted FDT is cut back into a smaller one by removing those sub-branches that hinder generalization. The problem of these methods is that they are computationally prohibitive. Cintra et al. [3] have analyzed the effect of different pruning rates of different pruning strategies on fuzzy decision trees. • Pre-pruning methods stop the development of the FDT according to a stopping criterion before the tree becomes too large. These methods require less computation time. Their main difficulty is the definition of the stopping criterion, and, when this criterion is based on a threshold, the concrete value of this threshold. Strong stopping criteria tend to create small and under-fitted fuzzy decision trees. On the other hand, loose ones tend to generate large decision trees that are over-fitted to the training data. For example, we can use a certain number of nodes as the threshold to stop the tree induction. A value too small will generate trees that are too compact © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Eftekhari et al., How Fuzzy Concepts Contribute to Machine Learning, Studies in Fuzziness and Soft Computing 416, https://doi.org/10.1007/978-3-030-94066-9_5
81
82
5 Comparing Different Stopping Criteria for Fuzzy …
and not accurate enough. The extreme case, only one node is allowed and thus all instances will be classified in the same way. In contrast, a value too large will generate trees with an arbitrary number of nodes with maximum over-fitting. In this chapter, an FDT induction algorithm, Iterative Deepening Fuzzy ID3 (IDFID3), is described. This FDT induction algorithm is based on a fuzzy ID3 approach, and has the ability to control, in an iterative process, the tree’s growth via dynamically setting the threshold value of the stopping criterion. The final FDT induced by IDFID3 and the one obtained by the standard FID3 are the same when the numbers of the nodes of the induced FDTs are equal. Nonetheless, the main intention for introducing IDFID3 is to consider alternative number of nodes, and the comparison of various stopping criteria. To do so, a new stopping criterion, Normalized Maximum fuzzy information Gain multiplied by Number of Instances (NMGNI), is considered. Then, IDFID3 is used to compare the above mentioned NMGNI and other stopping criteria. The organization of this chapter is as follows. A variety of stopping criteria are discussed in Sect. 5.2. Section 5.3 describes IDFID3 as well as the above mentioned stopping criterion NMGNI. The chapter focuses then in a method for comparing stopping criteria independent of their threshold values. This method can be found in Sect. 5.3. Section 5.4 discusses some experimental results for the methods reviewed in this chapter. A motivational programming exercise is given in Sect. 5.5.
5.2 Stopping Criteria The best fuzzy decision tree is the one that has the minimum number of nodes and the maximum classification accuracy. Increasing the number of nodes enhances the classification accuracy for the training examples. Nevertheless, a tree with many nodes may be over-fitted to the training examples and lead to a poor generalization, and a not so good accuracy with real data. Therefore, the best FDT is the one that makes the best trade-off between the complexity (number of nodes) and the classification accuracy. A stopping criterion is one of the methods to control the growth of a tree and its complexity. For example, limiting the number of nodes by means of setting a threshold. Two different stopping criteria can generate two FDTs which have equal sizes in terms of the number of nodes, but have different structures. Figure 5.1 shows two trees of equal size whose structures are dissimilar. Three factors impact on the structure of an FDT. They are the stopping criteria, the splitting criteria, and the number of membership functions defined on each attribute’s domain. The number of membership functions of the attribute Ai determines the number of child nodes that should be generated if Ai has been selected as the branching attribute. The splitting criterion decides the branching node and, consequently, the number of child nodes for each branch. The stopping criterion specifies which nodes should be expanded and, then, its threshold determines when the FDT growth should be stopped. Having the same splitting criterion and the same
5.2 Stopping Criteria
83
Fig. 5.1 Trees with equal size but with different structures
membership functions on each attribute’s domain, the stopping criterion is the only effective parameter that controls the structure of an FDT. The structure of an FDT has, of course, influence on its classification accuracy. Therefore, the type of stopping criterion used in the FDT induction, regardless of its threshold value, has impact on the classification accuracy of the FDT. The threshold value of the stopping criterion has also influence on the classification ability of an FDT. Since the number of nodes in an FDT has been controlled by the threshold value, the number of nodes has great impact on the accuracy of an FDT. Briefly, both the type of the stopping criterion and its threshold value have influence on the classification accuracy of an FDT. More particularly, the stopping criterion affects the outcome by controlling the structure of the FDT and its threshold influences by controlling the number of nodes. We consider two types of stopping criteria in this chapter that we call type I and type II. • Type I. The algorithm to construct a FDT proceeds while there is a leaf with at least one attribute for which the value of the stopping criterion is less than a predefined threshold value. Accuracy is an example of this type of stopping criteria. Setting the threshold value equal to 0.8, nodes will be expanded until all has an accuracy larger than 0.8. • Type II. The algorithm expands a node while its associated value is greater than a predefined threshold value. The number of instances covered by a node is an example of this type of stopping criteria. Setting the threshold value equal to 10, nodes with more than 10 associated instances will be expanded. FDT learning algorithms use any of the stopping criteria applied in crisp decision tree learning. The most common stopping criteria for crisp decision trees reported in the literature are as follows [2]. 1. Accuracy of a node. 2. Number of instances of a node. 3. Tree depth.
84
5 Comparing Different Stopping Criteria for Fuzzy …
4. Entropy. 5. Information gain. The first and the second criteria have been extended to FDT induction (see [4]). To employ the entropy and information gain criteria in FDT induction we need to apply their fuzzy versions, introduced in Eqs. (1.27) and (1.26), respectively.
5.3 Iterative Deepening FID3 One of the most important elements when defining stopping criteria is to determine a proper threshold value. In general, it is not possible to predict the number of nodes of a fuzzy decision tree given the threshold value of a stopping criteria. Similarly, it is difficult to infer properties about the structure of the trees that are obtained from such given threshold values. In this section, we describe a new fuzzy decision tree construction method introduced in [5] called Iterative Deepening FID3 (IDFID3). This method was introduced to overcome the problem of controlling the FDT size (the number of nodes) via stopping criteria. The IDFID3 technique employs the same splitting and stopping criteria, and also the same inference procedure as FID3. The main characteristic of IDFID3 is that it uses a predefined Minimum Number of Nodes (MNN) of the FDT. Additionally, it controls the tree’s growth by dynamically setting the threshold value of the stopping criterion. The idea behind IDFID3 is that a fixed predefined value of the threshold for a stopping criterion is not suitable for controlling the size of the tree. Therefore, IDFID3 changes the threshold value dynamically in successive iterations in order to better control the tree size. Another important characteristic of IDFID3 is that it categorizes the nodes which have not yet been expanded into two groups, expandable and non-expandable. When the fuzzy dataset of a node has at least one attribute to be used as the branching attribute, it is classified as expandable, otherwise it is considered as non-expandable. This is used by IDFID3 in its iterative process to help on node expansion. More particularly, the algorithm computes an expansion preference of each expandable node based on the type and the value of the stopping criterion. Then, it selects the node with the Highest Expansion Preference (HEP). The process of growing the tree is repeated until all the expandable nodes have an expansion preference less than HEP. This iterative process is repeated until the induced FDT has at least MNN nodes. It is important to notice that the FDT induced by IDFID3 is the same as the one produced by FID3 when the number of nodes of both decision trees are equal. Algorithm 5.1 describes the IDFID3 technique in detail. Since the node expansion process in IDFID3 is the same as the node expansion process in FID3 (Steps 9 and 10 in Algorithm 1.1), which were described in Algorithm 5.1, the details of the node expansion are not included here. We refer the reader to Sect. 1.2.1.2. In each iteration of the while loop in IDFID3, the algorithm updates the threshold value in order to add a minimum number of nodes to the FDT. In each iteration, the
5.3 Iterative Deepening FID3
85
algorithm finds and expands the node with the highest expansion preference (HEP). In order to compute this expansion preference for each node, the type of stopping criterion employed in the FDT induction process is used. Recall that when a Type I stopping criteria is used, an increase in the threshold value leads to an increase in the number of the nodes. Hence, the lower is the threshold value for a node, the higher the priority for its expansion. After Step 6 in which the new threshold value was determined, all the expandable nodes with the stopping criterion value less than or equal to the new threshold are expanded. On the other hand, when the stopping criteria is of type II, a drop in the threshold value increases the number of the nodes. Therefore, higher values of the threshold imply higher priority for expansion. Namely, after Step 6, all expandable nodes with the stopping criterion value greater than or equal to the new threshold are expanded.
Algorithm 5.1: The iterative deepening FID3 (IDFID3) algorithm. 1 2 3 4 5 6 7 8 9 10 11 12
Inputs: The crisp training data; the predefined membership functions on each attribute; the splitting criterion; the stopping criterion; the Minimum Number of Nodes (MNN); Output: The constructed FDT; Generate a root node with a fuzzy dataset containing all the crisp training data and set all the membership degrees to one; while (the minimum number of the nodes has not been generated) do Find an expandable node N with the highest expansion preference; Set the threshold value to the value of the stopping criterion for N ; Expand the node N ; while (there is an expandable node in the FDT whose expansion preference is higher than or equal to the expansion preference of N ) do expand it; end Make each of the nodes that are not expanded a leaf and assign the fraction of the examples of N belonging to each class as a label of that class; end
The new threshold value found in Step 6 may lead to the expansion of more than one node, thereby, making it difficult to estimate the number of new nodes. The IDFID3 technique tries to add the minimum number of the new nodes to the FDT in each iteration. This goal is achieved through iteratively growing and deepening the FDT by the expansion of a subtree rooted at the node with the highest expansion preference in each iteration. The number of the new generated nodes in each iteration depends on the distribution of data and the employed stopping criterion. It is worth mentioning that a Growth Control Capability (GCC) measure was introduced in [6], to determine the ability of the stopping criterion to control the tree’s growth. Algorithm FID3 adds new child nodes to the FDT in a way that is similar to a greedy hill-climbing search: the algorithm selects the best node to insert in the tree based on FIG and never backtracks to reconsider the earlier choices. The proposed IDFID3 method selects such best nodes using FIG, however, the order of the expansion of these selected nodes does not follow a Depth First Search (DFS)
86
5 Comparing Different Stopping Criteria for Fuzzy …
Fig. 5.2 Complete fuzzy decision tree. The value inside each node shows its accuracy
strategy. Instead, the IDFID3 technique follows an Iterative Deepening A∗ (IDA∗ ) search method [7], whose heuristic function is a stopping criterion. In other words, the algorithm makes a preference for expanding the candidate nodes by means of the value of the stopping criterion. Thus, the growth of tree in IDFID3 is achieved in a way similar to a combination of the breadth first search and DFS algorithms. So, it takes place along the width and the depth of the tree. Figure 5.2 depicts a completely expanded FDT constructed on a fuzzy dataset with four attributes. The value inside each node presents its accuracy. In addition, Fig. 5.3 illustrates four iterations of IDFID3 utilizing the accuracy stopping criterion. In this figure, the new generated nodes are highlighted and the new threshold value, which should be employed in the next iteration, is written in bold face inside the node with the highest expansion preference. In the first iteration, the root node is constructed and the value of the threshold is set to 0.4 that is the accuracy of the root node. In the second iteration, two new nodes are generated whose accuracy values, which are stopping criterion values, are 0.42 and 0.47. The expansion process terminates in the second iteration. The reason is that both 0.42 and 0.47 values violate the threshold value, 0.4, determined in the first iteration. In this iteration the minimum value of accuracies, 0.42, is used as a threshold value for the next iteration. In the third iteration, the node that has the lowest value of accuracy, which is the node with accuracy of 0.42 in the previous iteration, has the highest preference for expansion. In this step, two new nodes with accuracies 0.6 and 0.45 are produced by the expansion of this node. Both of these accuracy values exceed from the threshold value for this step. Thus, the algorithm stops the expansion in this iteration and sets the threshold value for the next iteration as 0.45, which is the minimum value among 0.6, 0.45 and 0.47. In the fourth iteration, the algorithm starts from the node with the accuracy 0.45 since this node has the highest priority for the expansion. Its expansion generates two child nodes with the accuracy values of 0.55 and 0.44. The first child that is the node whose accuracy is 0.55 exceeds the threshold value, nevertheless, the second one, which has the accuracy 0.44, does
5.3 Iterative Deepening FID3
(a) Iteration 1
(c) Iteration 3
87
(b) Iteration 2
(d) Iteration 4
Fig. 5.3 Four iterations of IDFID3
not exceed the threshold value. As a result, this iteration should be continued with its expansion. The expansion of this node produces two new child nodes with the accuracies 0.43 and 0.7. None of these new generated child nodes are expandable. The reason lies in the fact that the fuzzy dataset has four attributes, all of them have been used in the parent nodes as a branching attribute. Therefore, this iteration should be terminated. Finally, the threshold value of the next iteration is set to 0.47. The reason is 0.47 is the minimum accuracy value among the accuracy values of the expandable nodes including 0.6, 0.55 and 0.47. If the number of the generated nodes is less than MNN, which is specified by the user, the iterations should be continued and it should be stopped otherwise. As it can be seen from the example discussed above, in each iteration at least one node is expanded. In the two following subsections, stopping criteria and their comparison are discussed. The description begins with Sect. 5.3.1 that discusses in detail the stopping criterion and then continues with Sect. 5.3.2 that studies how to compare different stopping criteria.
5.3.1 The Stopping Criterion of IDFID3 The IDFID3 induction method utilizes the greedy algorithm to construct a fuzzy decision tree. Greedy algorithms make a locally optimal choice in the hope that this choice will lead to a globally optimal solution. The IDFID3 technique applies the greedy approach in two steps. The first step is the selection of an attribute for branching in which an attribute with the maximum FIG or splitting criterion is selected.
88
5 Comparing Different Stopping Criteria for Fuzzy …
The second step is the selection of a node for expansion. In this case, the node with the highest expansion preference is selected based on the stopping criterion. In other words, the greedy approach of IDFID3 is controlled by two heuristics which are the splitting criterion and the stopping criterion. In the current subsection, a method employed for the splitting criterion is described. The stopping criterion proposed in [5] and called the fuzzy information gain results in a better understanding of the FDTs. Which node should be selected for the expansion in the IDFID3 iterations? The best choice, according to the greedy approach, is the node whose expansion leads to the maximum increase in the accuracy of the FDT. Calculating the precise amount of the increase in accuracy of an FDT when a certain node is expanded requires the evaluation of the whole FDT. Since the evaluation of an FDT demands high computational efforts, heuristic functions that approximate the accuracy are good alternatives. Such a heuristic function can be used as a stopping criterion as its values for each of the tree nodes under consideration provide a suitable guide for IDFID3. Therefore, in the greedy approach the best stopping criterion is the one that assigns in each iteration, the highest expansion preference to the node whose expansion results in the maximum increase in the accuracy of the FDT. Here, it is supposed that the process of the FDT construction occurs in an arbitrary node. The data set associated to a given node c is denoted by Sc . Note that this data set contains several instances with different fuzzy membership degrees. Because of that, we can use the term fuzzy data set to denote this set. Then, one attribute that has the highest value of the fuzzy information gain must be selected as a branching attribute. This attribute is indicated by A M and is given by Eq. (5.1). A M = argmax FIG(Sc , Ai ).
(5.1)
Ai
In this expression, FIG(Sc , Ai ) is the fuzzy information gain of the attribute Ai relative to Sc . This is described in detail in Sect. 1.2.1.1. Expansion of the current node according to the attribute A M makes the maximum difference between the information of the current node and its child nodes. Hence, A M is possibly the most promising attribute for improving the accuracy of the classification. Recall that a higher value of FIG related to A M means more contribution of this attribute in the improvement of the total classification accuracy. When we consider the attribute A M , the criterion called Normalized Maximum fuzzy information Gain multiplied by Number of Instances (NMGNI) is defined by Eq. (5.2). NMGNI(Sc ) = |Sc | ×
FIG(Sc , A M ) . log2 m
(5.2)
In this definition, as above, FIG(Sc , A M ) is the fuzzy information gain of the attribute A M related to Sc , m is the number of classes and |Sc | is the number of instances of Sc . When the problem consists of m classes, FIG can get a value in the range [0, log2 m]. Thus, in order to normalize the FIG value so that it takes a value in the
5.3 Iterative Deepening FID3
89
range [0, 1], we need to divide it by log2 m. Then, naturally, the higher the value of c ,A M ) , the more expected contribution of the current node to the final the term FIG(S log2 m classification of the FDT. Multiplying this term by |Sc | the expression takes into account the size of the fuzzy dataset Sc that may be correctly classified after the expansion of the current node. As a summary, the definition of the NMGNI heuristic tries to predict the usefulness of a node in terms of its contribution to the classification task. In the following subsection, this is further discussed in terms of the experiments and comparisons presented in [5].
5.3.2 Comparison Method for Various Stopping Criteria In this section we describe a method to compare two stopping criteria irrespective of their threshold values. The approach considers the effectiveness of stopping criterion for fuzzy decision tree induction using the IDFID3 algorithm. Effectiveness is defined in terms of both the accuracy and complexity (number of nodes) of the tree. That is, the best stopping criterion is the one that produces FDTs with the minimum number of nodes and the greatest accuracy. It is known that several parameters affect the number of nodes and the accuracy of the induced FDT. For example, this is the case of the stopping criterion, the associated threshold value, the splitting criterion, the type of membership functions, and the number of membership functions. In order to study the effectiveness of different stopping criteria, two of these parameters are predefined. They are the splitting criterion and the type of membership functions. They are set, respectively, to FIG and to triangular membership functions. In order to properly evaluate the effect of the stopping criteria, different number of membership functions are considered. Then, the accuracies of the induced FDTs—using IDFID3—with these number of membership functions are averaged. In this way, it is possible to draw better conclusions on the different threshold values of the stopping criterion. As stated above, this comparison considers simultaneously both the classification accuracy and the size of the induced FDT. We have already discussed, see Sect. 5.2, that the stopping criterion and its threshold value have impact on both the accuracy and the size of the FDT. An stopping criterion that produces a more accurate FDTs for each arbitrary size is better than the others stopping criteria. Figure 5.4 illustrates this situation. It displays both the error and the size (in terms of number of nodes) of FDTs generated in different iterations of the IDFID3 algorithm for a given stopping criterion. In this figure, each square marker corresponds to a generated FDTs. Edges permit us to represent the iterative process of the IDFID3 algorithm. For example, in the 11th iteration, the algorithm produces an FDT with 52 nodes, and in the next iteration it produces an FDT with 60 nodes. Figure 5.4 clearly shows that different iterations of IDFID3 generate FDTs with different sizes. The edge that connects two square points can be used to interpolate the error of FDTs between two consecutive iterations. For example, we can use Fig. 5.4 and the results of the 11th and 12th iteration to interpolate the accuracy of FDTs with sizes in the range 53–59.
90
5 Comparing Different Stopping Criteria for Fuzzy …
Fig. 5.4 Error and size of FDTs in terms of number of nodes that are generated in different iterations of the IDFID3 algorithm
An issue of considerable importance is the induction of FDTs with a large number of nodes, as well as inference using these trees. Tree induction in this case may require high computational efforts. Moreover, such FDTs are probably over-fitted to training data and, because of that, they may lead to wrong conclusions. In order to avoid wrong conclusions in the analysis, [5] considers only the FDTs with less than 100 nodes. More particularly, for a given stopping criterion, the average error for all the FDTs of sizes 1–100 is computed. This computation is used to evaluate the effectiveness of the criterion. This process is as follows. First, for a particular number of membership functions, the accuracy of different FDTs of arbitrary sizes is calculated. As the IDFID3 algorithm does not provide trees for all arbitrary sizes, when a size is not available, the accuracy is interpolated using the description above. Next, the same process is repeated for different number of membership functions. Using this information, for each size of the tree, we can compute an average accuracy value by means of averaging the accuracies for that size obtained from trees with different number of membership functions. Finally, an ineffectiveness rate is computed by averaging the accuracies obtained for the different sizes in the previous step. In this way, the ineffectiveness rate combines both the accuracy and the size of the tree. This ineffectiveness rate is employed to compare the stopping criteria.
5.4 Discussion
91
5.4 Discussion The accuracy and the size of the induced FDT can be used to measure the effectiveness of FDT induction algorithms. Occam’s razor principle of machine learning advises to induce the simplest but most accurate FDTs. This is the motivation behind the introduction of the iterative approach IDFID3, which uses one stopping criterion, for fuzzy decision tree induction. In each iteration, IDFID3 determines a specific threshold value for a stopping criterion. This threshold value is based on the FDT constructed in the previous iteration. In other words, the IDFID3 algorithm controls the growth of an FDT by dynamically setting the threshold value of the stopping criterion. In contrast to standard FID3, which growths the tree in a greedy manner as in a depth first search, IDFID3 growths the fuzzy decision tree taking into account both depth and width. The stopping criterion NMGNI has been evaluated (see details in [5]). This criterion is to approximate the usefulness of expanding a node. Moreover, we have reviewed in the previous section a method for comparing different stopping criteria irrespective of their threshold values. This method is based on the IDFID3 algorithm. Zeinalkhani and Eftekhari [5] did a series of experiments in which different stopping criteria were considered for the IDFID3 algorithm. Different datasets were used for this comparison. The effects of the stopping criteria on accuracy have been reported. In addition to the proposed NMGNI stopping criterion, five other stopping criteria were also considered. These criteria were accuracy, number of instances, fuzzy information gain, fuzzy entropy, and tree depth (depth of node). The study considered twenty numerical datasets selected from both UCI machine learning repository [8] and KEEL dataset repository [9] for experiments. The number of instances in the experiments range from 106 (for the breast tissue dataset) to 2201 (for the titanic dataset), attributes ranged from 3 to 60, and the number of classes ranged from 2 to 10. Membership functions for the domain of the attributes were defined according to uniformly distributed triangular membership functions. The authors in [5] state that the results obtained from their experiments show that their stopping criterion outperforms the other stopping criteria, except for both the number of nodes and the number of features. That is, these two criteria performed better than the fuzzy information gain stopping criterion in terms of both the classification accuracy and the number of nodes of the induced FDTs. Moreover, the tree depth and the fuzzy information gain stopping criteria outperforms the fuzzy entropy, accuracy and number of instances in terms of the mean depth of the generated trees. Thus, they concluded that tree depth and fuzzy information gain produces more interpretable FDTs comparing to the other criteria.
92
5 Comparing Different Stopping Criteria for Fuzzy …
5.5 Computer Programming Exercise for Future Works 1. Propose a new stopping criterion by means of combining existing ones using hesitant fuzzy set concepts. Then, develop a computer program to implement the method described in this chapter and compare the new criterion with it.
References 1. Rokach, Lior, and Oded Maimon. 2005. Top-down induction of decision trees classifiers: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 35 (4): 476–487. 2. Rokach, Lior, and Oded Z. Maimon. 2008. Data mining with decision trees: Theory and applications, vol. 69. Singapore: World Scientific. 3. Cintra, Marcos E., Maria C. Monard, and Heloisa A. Camargo. 2010. Evaluation of the pruning impact on fuzzy C4.5. In Brazilian congress on fuzzy systems-CBSF, 257–264. 4. Quinlan, J. Ross. 1986. Induction of decision trees. Machine Learning 1 (1): 81–106. 5. Zeinalkhani, Mohsen, and Mahdi Eftekhari. 2014. Comparing different stopping criteria for fuzzy decision tree induction through IDFID3. Iranian Journal of Fuzzy Systems 11 (1): 27–48. 6. Zeinalkhani, Mohsen, and Mahdi Eftekhari. 2011. A new measure for comparing stopping criteria of fuzzy decision tree. In The 1st international eConference on computer and knowledge engineering (ICCKE), 71–74. IEEE. 7. Korf, Richard E. 1985. Depth-first iterative-deepening: An optimal admissible tree search. Artificial Intelligence 27 (1): 97–109. 8. Dua, Dheeru, and Casey Graff. 2017. UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences. 9. Alcalá-Fdez, Jesús, Alberto Fernández, Julián Luengo, Joaquín Derrac, Salvador García, Luciano Sánchez, and Francisco Herrera. 2011. Keel data-mining software tool: Dataset repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17 (2–3): 255–287.
Chapter 6
Hesitant Fuzzy Decision Tree Approach for Highly Imbalanced Data Classification
6.1 Introduction Imbalanced data classification is a problem of great interest in both machine learning and bioinformatics [1]. A classification problem contains imbalanced data when the distribution of data samples is not the same in different classes. That is, there is usually a large difference among the number of instances in different classes. If this is the case, learning algorithms, with their goal of maximizing the accuracy of the inferred model, may ignore the samples of the smaller class. The problem of imbalanced classification applies to any arbitrary number of classes. When there are only two, the samples of one class, called the majority class, largely outnumber the ones assigned to the other class, called the minority class. In this chapter we consider the problem of constructing a hesitant fuzzy decision tree for the classification of imbalanced datasets. Dealing with imbalanced classification problems is different from that of balanced data. Some methods have been introduced so far to solve this problem. They are grouped into three general categories [2]: data level approaches, cost-sensitive approaches, and algorithm level approaches. Data level approaches tend to balance datasets by increasing the samples of the minority class data (oversampling) or by decreasing the samples of the majority class data (undersampling). Both of these approaches have some disadvantages. For example, in oversampling approaches, an increase on the size of the training data increases the run time of the algorithm, and in some cases, the learning process may also lead to over-fitting. The disadvantage of undersampling techniques is that such methods may remove some useful data. In cost-sensitive approaches, some misclassification costs have been considered for data in the minority class and, in most cases, these costs are specified by a cost matrix. The main drawback of this approach is that in most of the cases there is no adequate information to determine the actual costs for this cost matrix. The algorithm level approach is about adapting classification algorithms to imbalanced datasets by means of making some changes
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Eftekhari et al., How Fuzzy Concepts Contribute to Machine Learning, Studies in Fuzziness and Soft Computing 416, https://doi.org/10.1007/978-3-030-94066-9_6
93
94
6 Hesitant Fuzzy Decision Tree Approach …
in the existing algorithms themselves. Examples of such algorithm modifications include adding some bias to support the minority class samples. Mahdizadeh and Eftekhari [3, 4] proposed two methods which were founded on a fuzzy rule based classifier for highly imbalanced datasets. These classifiers were produced by combining subtractive clustering and evolutionary algorithms for generating fuzzy rules. The FH-GBML method, which was introduced by Ishibuchi et al. [5], is a combination of both Pittsburgh and Michigan methods and its goal is to build a fuzzy rule base classifier. In this method, the Pittsburgh approach is applied first to generate a population from the rule sets. Then, for each rule set a predetermined probability is calculated in order to execute one Michigan iteration for all the rule sets. Nakashima et al. [6] introduce the WFC method that was a weighted fuzzy rule based classifier approach. WFC stands for Weighted Fuzzy Classifier. Mansoori et al. [7] introduced SGERD, a method based on genetic algorithms in which the number of iterations is selected according to the problem dimension. The fitness function of SGERD is determined by a rule evaluation criterion. In GFSLogitBoost [8] and GFS-Max LogitBoost [9], each weak learner is considered as a fuzzy rule extracted from data by the genetic algorithm. When a new simple classifier is added to the compound one, the examples in the training set are re-weighted. Berlanga et al. [10] introduced the GP-COACH method. This approach also applies genetic programming to obtain fuzzy rules. The initial population in the genetic programming is generated by rules of a context-free grammar. This technique is designed in such a way that it competes and cooperates with the rules to generate the collection of fuzzy rules. Villar et al. [11] uses the GA-FS+GL approach that is a genetic algorithm for feature selection and granularity learning for fuzzy rule based classification systems. Their aim is to extract a linguistic fuzzy rule based classifier system (FRBCS) with high accuracy for highly imbalanced datasets. Lopez et al. [12] presented the GP-COACH-H method that is a hierarchical fuzzy rule base classifier. This classifier uses information granulation and genetic programming. GP-COACHH uses GP-COACH as a basis of the hierarchical model. Hong-Liang Dai [13] introduced a fuzzy total margin based support vector machine. In Dai’s approach, a total margin is considered instead of the soft margin algorithm. In order to define the total margin, the extra surplus variables are added to the formula of the soft margin so that they measure the distance between the correctly classified data points and the hyperplane. The total margin calculates the distance of all data points from the separating hyperplane. When the extension of the error bound is used, the method performs better than standard SVM. The weights of the misclassification samples and the correctly classified samples are fuzzy values. C. Edward Hinojosa et al. [14] proposed the IRL-ID-MOEA approach in which a multi-objective evolutionary algorithm was used to learn fuzzy classification rules from imbalanced datasets. In this method, one rule is learnt in each run of the multi-objective evolutionary algorithm. This approach first uses the SMOTE+TL preprocessing approach to balance the imbalanced datasets. Then, it uses an iterative multi-objective evolutionary algorithm called NSGA-II to learn fuzzy rules as well as using the accuracy and the number of the premise variables for each fuzzy rule as two objectives.
6.1 Introduction
95
Fuzzy decision tree algorithms are one of the most powerful classifiers for dealing with any kind of data. In this chapter, we describe an approach that combines data level and algorithm level methods to correctly classify imbalanced datasets. Classifiers, first, use k-means clustering algorithm at data level to divide the samples of the majority class into several clusters so that the datasets are balanced with no change in the number of samples in each class. Then, each cluster sample is labeled by a new synthetic class label. These data is used to build Fuzzy Decision Tree (FDT) that use techniques based on Hesitant Fuzzy Sets (HFSs). In particular, to select the best attribute for expanding FDTs, a new splitting criterion, Hesitant Fuzzy Information Gain (HFIG), is used. This new criterion is to replace the Fuzzy Information Gain (FIG). The HFIG criterion is defined in terms of an aggregation of standard FIGs, each one of them computed from a different discretization method. The splitting criteria are used to generate different FDTs. More particularly, as we will see in detail later, five different FDTs are generated using five different discretization methods. In order to explore the differences between the different methods, a taxonomy is offered that categorizes these methods into three general groups (see Sect. 6.2.5). These groups can be distinguished in terms of the splitting criterion (i.e., either FIG or HFIG), and in terms of the aggregation approach (i.e., whether aggregation is considered or not). When aggregation is considered, the results of the five FDTs are combined using three aggregation approaches: the fuzzy majority voting, hesitant fuzzy information energy voting, and fuzzy weighted voting. The experimental results described in [15], and briefly discussed below, show that this approach outperforms other fuzzy rule-based approaches over 20 highly imbalanced datasets of KEEL [16] in terms of AUC (a definition of AUC – Area Under the Curve – is given in [3]). The main innovation of the study presented in this chapter is the fusion of different sources of information to construct efficient FDTs via Hesitant Fuzzy Sets (HFSs). First, as briefly mentioned above, five discretization methods are used. They are Fayyad, Fusinter, Fixed frequency, Proportional, and Equal-Frequency. These methods are used to generate different cut-points for each attribute. Next, the corresponding Membership Functions (MFs) are defined over the obtained cut-points. Then, these MFs are considered as different sources of information that can be combined for the construction of FDTs. More particularly, five FDTs are constructed based on these five discretization methods. These membership functions are used to compute FIG or HFIG, depending on the approach. In case HFIG is used, it is calculated taking into account all five sets of membership functions associated to a node. FIG and HFIG are used to select the nodes of a tree. Then, these nodes need to have membership functions associated to them. At this point one of the five alternative membership functions is used. In this way, five FDTs are generated (one for each discretization method). After that, when a new instance needs to be classified, the outcomes of these FDTs for the particular instance can be combined through three alternative aggregation methods. For example, one of the aggregation method is the hesitant information energy of the fuzzy compatibility degrees. The rest of this chapter is organized as follows. First, in Sect. 6.2 we describe the approach to build fuzzy decision trees based on hesitant fuzzy sets. This includes
96
6 Hesitant Fuzzy Decision Tree Approach …
clustering for data balancing, generation of membership functions, and the construction of the tree based on information gains (FIG and HFIG). Section 6.3 discusses the results of the experiments reported in [15] over 20 highly imbalanced datasets. Section 6.4 concludes the chapter with some inspiring exercises.
6.2 Hesitant Fuzzy Decision Tree Approach The method for building FDT described in this chapter addresses the problem of imbalanced classes by converting the majority class into several classes. This technique includes three general components: the data balancing, the FDT construction and the aggregation of FDTs. The method uses the k-means clustering algorithm to divide the majority class into several classes. Next, five discretization methods which are Fayyad, Fusinter, Fixed frequency, Proportional, and Equal-Frequency are used to obtain cut-points. Then, the MFs of each attribute are generated. To do so, the partitions created for each discretization method are used. Thereafter, five FDTs are constructed on the obtained fuzzy datasets. Finally, these classifiers are aggregated into a single classifier to perform the new data classification task. The steps of the proposed method are explained in Sects. 6.2.1, 6.2.2, 6.2.3, 6.2.4, and 6.2.5 in more details.
6.2.1 Data Balancing Two conventional methods to balance imbalanced datasets are undersampling and oversampling. Undersampling may ignore some useful information by reducing the number of samples in the majority class and oversampling is likely to lead over-fitting by increasing the number of samples in the minority class. Hence, it is reasonable to convert the majority class into several classes without changing the number of data. To this end, clustering techniques can be employed. At first, the k-means clustering is applied to the samples of the majority class to obtain multiple clusters. Then, each sample of these clusters is labeled using the cluster number. As a result, we obtain a new balanced data with multiple pseudo classes. This idea is borrowed from Yin et al. [17] that use this approach for feature selection. The clustering method was also applied to the majority class in some previous works. For example, it was used to find and remove data points that are far away from the cluster centers [18], and to produce the data to construct several classifiers that are used as inputs for ensemble methods [19]. Furthermore, clustering techniques have been applied to imbalanced classification based on a support vector machine (SVM) classifier [20].
6.2 Hesitant Fuzzy Decision Tree Approach
97
6.2.2 Generating the Membership Functions We have already described in Sect. 4 that different discretization methods use different criterion to produce splitting cut-points [21]. We have also seen that generating membership functions on the domain of an attribute can be considered as either a fuzzy discretization or a fuzzy partitioning. That is, each fuzzy part is described by a particular membership function. Recall that partitions generated by a discretization method are defined by the cutpoints, the extreme ones (the lower and the upper cut-point) and the middle ones. As mentioned above, five discretization techniques are used to generate these cut-points. Then, the MFs are defined over these cut-points based on the standard deviation method introduced in [21] (see Section 4.3.1.2). Membership functions are triangular for all middle MFs and trapezoidal for the leftmost and rightmost ones. The approach uses the mean and the standard deviation of the samples located in each part of the partition. In addition, we also need a user-defined parameter called StdCoefficient, which is a real positive number. This parameter controls the fuzziness of the generated MFs. The membership grade of lower and upper cut-points of a membership function built using the standard deviation based method are not necessarily equal to 0.5. The expressions to compute the membership functions are presented in Table 4.4. Middle membership functions are triangular and are defined by three parameters a, b and c. The expressions to compute these parameters are given in the second column. The second parameter of the triangular MF is set to the mean value of all examples inside the part and its first and last parameters are set to the points at a distance of 2×stdVal in the left and right hand side of the mean value, respectively. The stdVal index is the standard deviation (i.e., std) of all examples inside the part which is normalized via multiplying it by the user-defined parameter stdCoefficient. Leftmost and rightmost membership functions are trapezoidal and defined by means of four parameters a, b, c, and d. Their definition is given in both the first and the last columns of Table 4.4.
6.2.3 Construction of Fuzzy Decision Trees The algorithm to construct fuzzy decision trees (FDT) was explained in Algorithm 1 (Sect. 1.2.1.1). We will describe in Sect. 6.2.5 the approaches used here for building several FDTs. Each approach will generate several trees. The first approach applies fuzzy information gain (FIG) as the node selection criterion and thereby five FDTs can be constructed based on the five discretization methods mentioned above. The second approach uses HFIG as the splitting criterion. The approach generates five FDTs in which the nodes are the same for all FDTs because these nodes are selected based on the HFIG technique. Nevertheless, differences on the trees are due to the fact that each tree is based on a specific type of discretization. In other words,
98
6 Hesitant Fuzzy Decision Tree Approach …
all trees share a common shape (e.g., number of nodes) and the same features associated to each node (as selected according to HFIG). Nevertheless, the membership functions associated to the nodes depend on the particular discretization method. Then, once different decision trees are generated, we may just select one of these trees or, alternatively, we may consider the aggregation of their results. The next section discusses aggregation in case the second approach is selected.
6.2.4 The Aggregation of FDTs Let us consider the aggregation of the outcomes of the fuzzy decision trees. That is, we have applied a set of fuzzy decision trees to an instance and have obtained the corresponding outcomes. Then, we need to aggregate these outcomes. Let us suppose that, for a given instance, the possible class labels of the outcome are C1 and C2 . Moreover, let AM1 and AM2 represent the final classification results obtained by a certain aggregation method for these classes C1 and C2 , respectively. For the i th classifier, μ(C1(i) ) and μ(C2(i) ) represent the membership degrees of the instance to classes C1 and C2 , respectively. The computation of these membership values for a fuzzy decision tree were described in Eqs. (1.29) and (1.30). In the approach described in this chapter, each of the five decision trees provides a membership values for each class (e.g., μ(C1(i) ) for i=1, …, 5 for class C1 ). Therefore, we have five membership values for each class. These membership values are aggregated by using three aggregation approaches including Fuzzy Majority Voting (FMV), Fuzzy Weighted Voting (FWV), and Hesitant Fuzzy Information Energy Voting (HFIEV). Table 6.1 describes these three aggregation methods briefly and summarizes the details of their aggregation strategies. As described above, these aggregated values are denoted by AM1 and AM2 . Finally, given the aggregated results AM1 and AM2 computed using the aggregation methods represented in Table 6.1, the instance is classified as C1 if AM1 ≥ AM2 , and as C2 , otherwise.
6.2.5 Notations for Different FDT Classifiers To represent each of the possible FDT classifiers we consider in this chapter, we use a notation that specifies three terms. First, the FDT construction measure, next, the fuzzy partitioning method, and, finally, the aggregation FDT method. We express this as follows: Construction measure/partition method/Aggregation. Among all possible combination of FDTs, we distinguish three categories. They are described in the next sections. Informally, the first category includes generation of trees using HFIG and no aggregation, and the second and third category include generation of trees using FIG and HFIG, respectively, and, then, applying aggregation.
6.2 Hesitant Fuzzy Decision Tree Approach
99
Table 6.1 The strategies and descriptions of the aggregation methods used Aggregation Method Fuzzy Majority-Voting (FMV)
Fuzzy Weighted-Voting (FWV)
Hesitant Fuzzy Information Energy Voting (HFIEV)
The
Strategy AM1 = (i) (i) 5 i=1 f (μ(C 1 ), μ(C 2 )) AM2 = (i) (i) 5 i=1 f (μ(C 2 ), μ(C 1 )). 5 AM1 = i=1 μ(C1(i) ) 5 AM2 = i=1 μ(C2(i) ).
AM1 = (i) 5 1/5( i=1 (μ(C1 ))2 ) AM2 = (i) 5 1/5( i=1 (μ(C2 ))2 ).
Description For the ith classifier; if μ(C1(i) ) ≥ μ(C2(i) ), then Class C1 gets a vote; if (i) (i) μ(C2 ) ≥ μ(C1 ), then Class C2 gets a vote. If AM1 ≥ AM2 , then Class C1 , else class C2 . Use the summation of the classification degree of these 5 classifiers for each class label. If AM1 ≥ AM2 , then Class C1 , else Class C2 . Use the information energy of the classification degree of these 5 classifiers for each class label. If AM1 ≥ AM2 , then Class C1 , else Class C2 .
function f (x, y) is defined as follows: 1 x≥y f (x, y) = . 0 x