Reverse Clustering: Formulation, Interpretation and Case Studies [957, 1 ed.] 3030693589, 9783030693589

This book presents a new perspective on and a new approach to a wide spectrum of situations, related to data analysis, a

141 34 4MB

English Pages 118 [117] Year 2021

Table of contents :
Preface
Introduction
Contents
List of Figures
List of Tables
1 The Concept of Reverse Clustering
1.1 The Concept
1.2 The Notation
1.3 The Elements of Vector Z: The Dimensions of the Search Space
1.4 The Criterion: Maximising the Similarity Between Partitions PA and PB
1.5 The Search Procedure
References
2 Reverse Clustering—The Essence and The Interpretations
2.1 The Background and the Broad Context
2.2 Some More Specific Related Work
2.3 The Interpretations
References
3 Case Studies: An Introduction
3.1 A Short Characterisation of the Cases Studied
3.2 The Interpretations of the Cases Treated
References
4 The Road Traffic Data
4.1 The Setting
4.2 The Experiments
4.3 Conclusions
References
5 The Chemicals in the Natural Environment
5.1 The Data and the Background
5.2 The Procedure: Determining the Partition PA
5.3 The Procedure: Reverse Clustering
5.4 Discussion and Conclusions
References
6 Administrative Units, Part I
6.1 The Background: Polish Administrative Division and the Province of Masovia
6.2 The Data
6.3 The Analysis Regarding the Administrative Categorization of Municipalities
6.4 A Verification
6.5 The Analysis Regarding the Functional Categorization of Municipalities
6.6 Conclusions and Discussion
References
7 Administrative Units, Part II
7.1 The Background
7.2 The Computational Experiments
7.3 Discussion and Conclusions
References
8 Academic Examples
8.1 Introduction
8.2 Fisher’s Iris Data
8.3 Artificial Data Sets
8.4 Conclusions
References
9 Summary and Conclusions
9.1 Interpretation and Use of Results
9.2 Some Final Observations
Reference

Recommend Papers

Chemical Product Formulation Design and Optimization. Methods, Techniques, and Case Studies 9783527332649, 9783527689637, 9783527689644, 9783527689620

171 88 6MB Read more

Geochronology Methods and Case Studies

This book includes a combination of methodological presentations and related case studies, from where we learn about pra

548 21 6MB Read more

Formulation Characterization and Stability of Protein Drugs Case Histories [1 ed.] 9780306453328, 0306453320

Leading scientists offer detailed profiles of ten protein drugs currently in development. The case histories of these im

295 96 7MB Read more

The Use of Textual Criticism for the Interpretation of Patristic Texts: Seventeen Case Studies 9780773430730, 0773430733

This book attempts to reunite what has been divided by illustrating the close relationship that should exist between tex

109 99 19MB Read more

CCIE Fundamentals Network Design and Case Studies

589 116 2KB Read more

Satellite Interferometry Data Interpretation and Exploitation: Case Studies from the European Ground Motion Service (EGMS) [1 ed.] 0443133972, 9780443133978

Satellite Interferometry Data Interpretation and Exploitation: Case Studies from the European Ground Motion Service (EGM

121 87 11MB Read more

TCM Case Studies: Pediatrics 9787117156684

TCM Pediatrics is concerned with children’s growth and development along with physiological and pathological conditions

475 45 12MB Read more

College and University Endowments: Case Studies and Tax Issues : Case Studies and Tax Issues [1 ed.] 9781617286452, 9781617282683

U.S. colleges and universities hold hundreds of billions of dollars in endowments. In congressional hearings and in acad

144 70 3MB Read more

Ethnic Conflict in the Post-Soviet World: Case Studies and Analysis: Case Studies and Analysis 9781315704487, 9781563247408

Presents 16 case studies of ethnic conflict in the post-Soviet world. The book places ethnic conflict in the context of

134 17 20MB Read more

Quranic Studies: Sources and Methods of Scriptural Interpretation 1591022010, 9781591022015

One of the most innovative thinkers in the field of Islamic Studies was John Wansbrough (1928-2002), Professor of Semiti

415 80 14MB Read more

Reverse Clustering: Formulation, Interpretation and Case Studies [957, 1 ed.]
3030693589, 9783030693589

Author / Uploaded
Jan W. Owsiński
Jarosław Stańczak
Karol Opara
Sławomir Zadrożny
Janusz Kacprzyk

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Studies in Computational Intelligence 957

Jan W. Owsiński · Jarosław Stańczak · Karol Opara · Sławomir Zadrożny · Janusz Kacprzyk

Reverse Clustering Formulation, Interpretation and Case Studies

Studies in Computational Intelligence Volume 957

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the ﬁelds of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artiﬁcial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/7092

Jan W. Owsiński Jarosław Stańczak Karol Opara Sławomir Zadrożny Janusz Kacprzyk •

•

•

•

Reverse Clustering Formulation, Interpretation and Case Studies

123

Jan W. Owsiński Polish Academy of Sciences Systems Research Institute Warsaw, Poland

Jarosław Stańczak Polish Academy of Sciences Systems Research Institute Warsaw, Poland

Karol Opara Polish Academy of Sciences Systems Research Institute Warsaw, Poland

Sławomir Zadrożny Polish Academy of Sciences Systems Research Institute Warsaw, Poland

Janusz Kacprzyk Polish Academy of Sciences Systems Research Institute Warsaw, Poland

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-69358-9 ISBN 978-3-030-69359-6 (eBook) https://doi.org/10.1007/978-3-030-69359-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

We witness nowadays an explosive growth and development of methods and techniques, related to data analysis, this growth being conditioned, on the one hand, by the rapidly expanding availability of data in virtually all domains of human activity, and, on the other hand, the very substantive progress in technical and scientiﬁc capabilities of dealing with the increasing volumes of data. All this amounts to a dramatic change, especially in quantitative terms. Yet, as researchers and practitioners involved in the work on methodological side of data analysis know very well, many of the fundamental substantive problems in this domain still require solutions, or at least—better solutions—than those available now. This concerns, in particular, such fundamental areas as clustering, classiﬁcation, rule extraction, and so on. The primary issue is here constituted by the opposition between precision or accuracy and speed or computational cost (when the problem at hand is already truly well-deﬁned). One cannot forget, neither, of the very strong data dependence of effectiveness and efﬁciency of many of the methodologies being applied nowadays, making the situation even more difﬁcult. The present book addresses this nexus of issues, aiming, in this case, apparently at the interface of clustering and classiﬁcation, but, in fact, being relevant to a much broader domain, with much broader implications in terms of applicability and interpretation. Namely, it describes the paradigm of “reverse clustering”, introduced by the present authors. The paradigm concerns the situation, in which we are given a certain data set, composed of entities, observations, objects…, which is usual for the data analysis situation, and, at the same time, we are given, or we consider, a certain partition of this data set. We do not assume a priori anything about the data set, nor about the partition, and, essentially importantly, about the relation between the data set and the partition. Thus, the partition may be the result of a deﬁnite kind of analysis of the given data set, but may, as well, result from quite a different mechanism (e.g. a division of the set of objects according to some variable or criterion not contained in the data set at hand).

v

vi

Preface

Under these circumstances—the data set and the partition being given—we try to reconstruct the partition on the basis of the data set, using cluster analysis. We try to ﬁnd the entire clustering procedure that will yield, for this given data set, a partition that is as close to the given one as possible. Thus, the result of the procedure is both the clustering procedure, deﬁned by a number of attributes (clustering method, its parameters, variable selection, distance deﬁnition,…) and the concrete partition found. It is obvious that the paradigm borders upon classiﬁcation (for a very speciﬁc formulation/interpretation of the situation faced), but extends to a much broader domain, in which the perception of the problem itself and the meaning of solutions can vary very widely. This is, in particular, shown in the present book. In the current stage of work, the results obtained and largely contained in this book pertain mainly to the substantive aspect of the paradigm, while the technical aspects of the respective algorithms are, as of now, left to future research. The reverse clustering paradigm constitutes a new perspective on quite a broad spectrum of problems in data analysis, and, as the book shows, it can provide very interesting, instructive and signiﬁcant results, under a wide variety of interpretational assumptions. We sincerely hope, therefore, that this book does not only give the Readers a new material and fresh insight into some problems of data analysis, but may also provoke them to deeper studies in the direction here indicated. Warsaw, Poland

Jan W. Owsiński Jarosław Stańczak Karol Opara Sławomir Zadrożny Janusz Kacprzyk

Introduction

This book is devoted to an approach or a paradigm, developed by the authors and applied to a series of cases, of diverse character, mostly based on real-life data; the approach (or paradigm) belonging to the broadly understood domain of data analysis—more precisely: classiﬁcation and cluster analysis. We call the approach “reverse clustering” because of its logic, which is formulated as follows: Assume we dispose of a set of data, X, composed of n objects or observations, indexed i, i = 1,…,n, each of these being described by a vector of m features or variables, indexed k, the respective vector being denoted xi = {xi1,…,xik, …,xim}. At the same time, assume we dispose of a partition of the set X of objects into subsets, this partition being denoted PA. For these data, we try to obtain a partition PB that is as close to PA as possible, by applying clustering algorithms to the set X. Thereby, we ﬁnd both the partition PB that is as close as possible to PA and the concrete clustering procedure, with all its parameters, which yields the partition PB. The above does not explicitly state the purpose of the exercise (to say nothing of the technical details), but it can easily be deduced that what is aimed at is closely related to the notion of classiﬁcation. While the close relation with classiﬁcation is not only obvious, but deﬁnitely true, the paradigm has a much wider spectrum of applications and meanings, as this is explained in Chap. 2 of the book, following the more precise presentation of this paradigm, given in Chap. 1. The paradigm is constituted, ﬁrst, by the above statement of the problem, which then has to be expressed in pragmatic technical terms, involving (1) the space of clustering algorithms with its granularity (what algorithms are accounted for and what parameters, deﬁning the entire clustering procedure, are being subject of the search for PB);

vii

viii

Introduction

(2) the measure of similarity between the partition of the set X, given at the outset, i.e. PA, and the partitions, obtained from the clustering algorithms, this measure being maximised (or the measure of distance between them, being minimised); and (3) the technique of search for the PB given the data of the concrete problem. This paradigm is, however, also, and perhaps even more importantly, constituted by the interpretation of the entire setting, and the particular instances of this interpretation—as mentioned, treated at length in Chap. 2. This is important insofar as it places the paradigm against the background of the data analysis domain, with special emphasis on classiﬁcation and related ﬁelds. These various interpretation instances are associated primarily with the status of the partition PA, namely its source, the degree of credibility we assign to it, as well as its actual or presumed connection with the data set X. Depending on these, and on the results obtained, the status of the obtained partition PB, including validity and applicability, will also vary signiﬁcantly. Owing to this variety of interpretations, the paradigm may ﬁnd application in a broad spectrum of analytic, but also cognitive, situations. The subsequent chapters of the book, starting with the third one, are exactly devoted to the presentation of the cases treated, which deﬁnitely differ not only as to their substance matter (domain, from which the data come), but, largely, as to the interpretation of the actual problem and the results obtained. The implication is that the paradigm can be used in many data analytic circumstances for diverse purposes, whenever the structuration of the data set into groups is appropriate. The paradigm of reverse clustering has been presented already in several papers by the same team of authors, e.g. in Owsiński et al. (2017a, b), Owsiński, Stańczak and Zadrożny (2018). The present book aims at a more complete presentation of the paradigm and its interpretations. The book does not go into the computational and numerical issues and details, which are, of course, of very high importance. Namely, the main purpose of the book is to present the approach and its capacities in terms of various kinds of situations, problems and interpretations of respective results. We do indeed hope it conveys the intended message in an effective and interesting manner. The book is structured in the following manner: ﬁrst, Chap. 1 presents the scheme of the approach, characterised, in particular, as it has been used in the cases illustrated in this book, along with notation used. Then, Chap. 2 outlines the context of the reverse clustering, starting with other approaches, which concern similar kinds of problems, related to data analysis, including also an ample reference to the very general idea of reverse engineering, as well as explainable artiﬁcial intelligence or data analysis. Then, the context is shortly analysed in terms of more detailed speciﬁc problems, arising in connection with both the reverse clustering procedure and the data analytic methods in a more general perspective (like, e.g. selection of variables, or deﬁnitions of distance). This chapter contains also a very important section on the potential interpretations of the reverse clustering paradigm and its results. Chapter 3 constitutes a very short introduction to the cases studied

Introduction

ix

and illustrated in the book, which are then presented in the consecutive chapters: Chap. 4 is devoted to the motorway trafﬁc data, Chap. 5 to environmental contamination data, Chaps. 6 and 7 to two separate cases of typologies or classiﬁcations of administrative units in Poland, and, ﬁnally, Chap. 8 to some more academic exercises. The book closes with Chap. 9 summarising the work done and proposing some new vistas. This book is intended to offer the Readers truly interesting and novel perspectives in data analysis, regarding the diverse ways of formulating and approaching problems, and understanding the results, and we shall be very satisﬁed if it did it at least in a perceptible degree. Jan W. Owsiński Jarosław Stańczak Karol Opara Sławomir Zadrożny Janusz Kacprzyk

Contents

1 The 1.1 1.2 1.3

Concept of Reverse Clustering . . . . . . . . . . . . . . . . . . . . . . The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Elements of Vector Z: The Dimensions of the Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The Criterion: Maximising the Similarity Between Partitions PA and PB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 The Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.... .... ....

1 1 4

....

5

.... .... ....

11 12 13

2 Reverse Clustering—The Essence and The Interpretations . 2.1 The Background and the Broad Context . . . . . . . . . . . . 2.2 Some More Speciﬁc Related Work . . . . . . . . . . . . . . . . 2.3 The Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

15 15 20 25 31

3 Case Studies: An Introduction . . . . . . . . . . . . . . 3.1 A Short Characterisation of the Cases Studied 3.2 The Interpretations of the Cases Treated . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

37 37 40 41

4 The Road Trafﬁc Data 4.1 The Setting . . . . . 4.2 The Experiments . 4.3 Conclusions . . . . . References . . . . . . . . . . 5 The 5.1 5.2 5.3

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

43 43 45 51 52

Chemicals in the Natural Environment . . . The Data and the Background . . . . . . . . . . . The Procedure: Determining the Partition PA The Procedure: Reverse Clustering . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

53 53 56 58

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

xi

xii

Contents

5.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 62

......

63

...... ......

63 64

...... ......

67 71

...... ...... ......

72 76 78

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

79 79 80 87 88

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

89 89 89 91 94 94

9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Interpretation and Use of Results . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Some Final Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95 98

6 Administrative Units, Part I . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Background: Polish Administrative Division and the Province of Masovia . . . . . . . . . . . . . . . . . . . . . . 6.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 The Analysis Regarding the Administrative Categorization of Municipalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 A Veriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 The Analysis Regarding the Functional Categorization of Municipalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Administrative Units, Part II . . . . . 7.1 The Background . . . . . . . . . . . . 7.2 The Computational Experiments 7.3 Discussion and Conclusions . . . References . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

8 Academic Examples . . 8.1 Introduction . . . . . 8.2 Fisher’s Iris Data . 8.3 Artiﬁcial Data Sets 8.4 Conclusions . . . . . References . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

List of Figures

Fig. 2.1 Fig. 2.2 Fig. 2.3

Fig. 3.1 Fig. 4.1 Fig. 4.2

Fig. 4.3 Fig. 5.1 Fig. 5.2 Fig. 5.3 Fig. 5.4 Fig. 5.5

Fig. 6.1

The scheme of the reverse clustering problem formulation . . . . The scheme of potential cases of interpreting the paradigm of reverse clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustration of division of a set of objects according to the rule of “putting together the dissimilar and separating the similar”: colours indicate the belongingness to three groups: blue, red and green . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rough indication of interpretations of the cases treated against the framework of Fig. 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Median hourly proﬁles of trafﬁc for the classes of the days of the week . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hourly proﬁles of trafﬁc intensity for individual hours of the week. Colours, assigned to successive days, denote the clusters, forming the initial partition PA . . . . . . . . . . . . . . . Visual interpretation of clusters described in Table 4.2 . . . . . . Concentration levels for Pb: areas in the order of increasing Pb concentrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concentration levels for Cd: areas in the order of increasing Cd concentrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concentration levels for Zn: areas in the order of increasing Zn concentrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concentration levels for S: areas in the order of increasing S concentrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The distribution of points (“areas”) in the space of concentrations for a particular elements and pairwise; b enlarged for Zn and S (upper box) and of Pb and Cd (lower box); see the text further on for the interpretation of colours . . Data on municipalities of the province of Masovia with administrative categorisation into three categories on the plane of the ﬁrst two principal components (colours refer to the results from Table 6.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..

25

..

28

..

30

..

41

..

44

.. ..

44 49

..

55

..

55

..

56

..

56

..

57

..

69 xiii

xiv

Fig. 6.2

Fig. 6.3 Fig. 6.4 Fig. 7.1

Fig. 7.2

Fig. 7.3

Fig. 8.1 Fig. 8.2 Fig. 9.1

Fig. 9.2

List of Figures

Map of the province of Masovia with the indication of the municipalities classiﬁed in three clusters resulting from the reverse clustering according to the data from Table 6.3. Red area in the middle corresponds to Warsaw and its neighbourhood, the bigger red blobs correspond to subregional centres (Radom, Płock, Siedlce and Mińsk Mazowiecki) . . . . . Map of Masovia province with the partition PB from Table 6.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map of Masovia province with the partition PB from Table 6.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two examples of the procedures, leading to the potential prior categorization of the sort similar to the one of interest here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map of Poland with indication of municipalities, which belonged in the solution of Table 7.2 to the “correct” categories from the initial partition and those that belonged to the other ones (“incorrect”) . . . . . . . . . . . . . . . . . . . . . . . . . Map of Poland, showing the partition of the set of Polish municipalities obtained with the own evolutionary method and the k-means algorithm, composed of 12 clusters, corresponding to Table 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of the artiﬁcial data set with “nested clusters”, subject to experiments with reverse clustering . . . . . . . . . . . . . An example of the artiﬁcial data set with “linear broken structure”, subject to experiments with reverse clustering . . . . . Map of the province of Masovia showing the municipality types, obtained from the reverse clustering performed with DBSCAN algorithm, characterised in Table 9.1 . . . . . . . . The meta-scheme of application of the reverse clustering paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..

70

..

76

..

77

..

80

..

84

..

85

..

92

..

92

. . 100 . . 101

List of Tables

Table 1.1

Table 1.2 Table 4.1 Table 4.2

Table 4.3 Table 5.1

Table 5.2 Table 5.3

Table 5.4

Table 5.5

Values of the Lance-Williams coefﬁcients for the most popular of the hierarchical aggregation clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elements of calculation of the Rand index of similarity between partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of results for the ﬁrst series of experiments with trafﬁc data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for trafﬁc data for the entire vector of parameters Z, with the use of hierarchical aggregation (values of Rand index = 0.850, of adjusted Rand = 0.654). The upper part of the table shows the coincidence of patterns in particular Aq, based on the days of the week, and obtained Bq . . . . . . Results for the trafﬁc data obtained with the “pam” algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pollution data for Baden-Württemberg (Germany), used in the exemplary calculations: total concentrations, in mg/kg of dry weight (Pb-Lead, Cd-Cadmium, Zn-Zinc, S-Sulphur) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Numbers of areas in the classes, deﬁned for the elements Zn and S contents in the herb layer . . . . . . . . . . . . . . . . . . . Contingency table for the partition PA assumed and the one obtained in Series 1 of calculations, PB, with the k-means algorithm and data only for Pb and Cd . . Contingency table for the partition PA assumed and the one obtained in Series 1 of calculations, PB, with the hierarchical aggregation algorithm and data only for Pb and Cd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency table for the partition PA assumed and the one obtained in Series 1 of calculations, PB, with the DBSCAN algorithm and data for all four elements. . . . . . . . . . . . . . . .

..

8

..

11

..

46

..

47

..

48

..

54

..

58

..

59

..

60

..

60

xv

xvi

Table 5.6

Table 6.1 Table 6.2 Table 6.3

Table 6.4

Table 6.5

Table 6.6

Table 6.7

Table 6.8 Table 6.9

Table 6.10

Table 6.11

List of Tables

Contingency table for the partition PA assumed and the one obtained in Series 2 of calculations, PB, with the hierarchical merger algorithm and data for all four elements . . . . . . . . . . Functional typology of municipalities of the province of Masovia (data as of 2009) . . . . . . . . . . . . . . . . . . . . . . . . Variables describing municipalities, accounted for in the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with own evolutionary algorithm using k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with own evolutionary algorithm using hierarchical aggregation . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with own evolutionary algorithm using DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with DE algorithm using “pam” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with DE algorithm using “agnes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of variable weights for two runs of calculations, presented in Tables 6.3 and 6.4 . . . . . . . . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Wielkopolska in Poland and clustering performed with the Z vector obtained for Masovia in the case shown in Table 6.3 (k-means algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency matrix for the administrative breakdown of municipalities of the province of Wielkopolska in Poland and clustering performed with the Z vector obtained for Masovia in the case shown in Table 6.4 (hierarchical aggregation algorithm). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with own evolutionary method using the k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..

60

..

65

..

66

..

67

..

67

..

68

..

68

..

68

..

71

..

71

..

72

..

73

List of Tables

Table 6.12

Table 6.13

Table 6.14

Table 7.1 Table 7.2

Table 7.3 Table 7.4

Table 8.1

Table 8.2 Table 8.3 Table 8.4

Table 9.1

xvii

The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with own evolutionary method using hierarchical aggregation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with DE using “pam” algorithm . . . . . . . . . . . . . . The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with DE using “agnes” algorithm. . . . . . . . . . . . . Functional typology of Polish municipalities . . . . . . . . . . . . Contingency table for the proposed functional typology of Polish municipalities and the reverse clustering partition obtained with own evolutionary method using k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variable weights in the solution illustrated in Table 7.2 . . . Contingency table for the proposed functional typology of Polish municipalities and the reverse clustering partition obtained with own evolutionary method using hierarchical aggregation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The results obtained for the Iris data with the DE method—comparison of “pam” and “agnes” algorithms and two selections of vector Z components (notation as in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contingency table for the DE method applied to the Iris data with the “pam” algorithm . . . . . . . . . . . . . . . . . . . . . . . Contingency table for the DE method applied to the Iris data with the “agnes” algorithm . . . . . . . . . . . . . . . . . . . . . . The reverse clustering results for the Iris data obtained with the own evolutionary method using DBSCAN, k-means and hierarchical merger algorithms . . . . . . . . . . . . . . . . . . . . Contingency matrix for the typological categorisation of the municipalities of the province of Masovia in Poland obtained with reverse clustering using own evolutionary algorithm and the DBSCAN algorithm (for explanations see Chap. 6) . . . .

..

73

..

74

.. ..

75 81

.. ..

82 83

..

86

..

90

..

90

..

90

..

91

..

99

Chapter 1

The Concept of Reverse Clustering

1.1 The Concept This book presents an approach, or a paradigm, within which we try to develop a reverse engineering type of procedure, aimed at reconstructing a certain partition1 of a data set, X, X = {x i }, i = 1,…,n, into p subsets (clusters), Aq , q = 1,…,p. We assume that each object, indexed by i, is characterized by m variables, so that x i = (x i1 ,…,x ik ,…,x im ). Having the partition PA = {Aq }q , given in some definite manner, we now try to figure out the details of the clustering procedure which, when applied to X, would have produced the partition PA or its possibly accurate approximation. That is, we search in the space of configurations, with a particular configuration denoted by Z, this space being spanned by the following parameters: (i) (ii) (iii)

the choice of the clustering algorithm, and characteristic parameters of the respective algorithm(s); the selection or other operations on the set of variables (e.g. weighing, subsetting, aggregation), and the definition of a similarity/distance measure between objects, used in the algorithm.

The partition, resulting from applying the clustering procedure with a candidate configuration of the above parameters is denoted PB and is composed of clusters Bq’ , q’ = 1,…,p’, PB = {Bq’ }q’ . The search is performed by optimizing with respect to a certain criterion, denoted Q(PA , PB ), defined on the pairs of partitions. So, as we denote the set of parameters, comprising a configuration, that is being optimized in the search, by Z (notwithstanding the potential differences in the actual content of Z), and the space of values of these parameters by , then we are looking in for a Z * that minimizes Q(PA , PB ), where PB (Z * ) is a partition of X obtained 1 The

concept of a Reverse Cluster Analysis has been introduced by Ríos and Velásquez (2011) in case of the SOM based clustering, but it is meant there in a rather different sense than associating original data points with the nodes in the trained network.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_1

1

2

1 The Concept of Reverse Clustering

using the configuration Z * . Formally, we can treat Z as a transformation of the data set X (a cluster operator) and, thus denote the optimization problem for a given data , where denotes the set of all set X and its known partition partitions of X, as follows: (1.1) Z ∗ = arg min Q(PA , Z (X )). Z ∈

(1.2)

(Notice that this optimization problem is in line with the reverse engineering, or backward engineering paradigm, i.e. a procedure that aims at finding out for some object or process what has been the underlying design, architecture or implementation process that led to the appearance of the object in question—more on this subject in Chap. 2.) Because of the irregularity of circumstances of this search (the nature of the search space and the values of the performance criterion, see later on for some more details on this), the solution of the optimization problem defined above is a challenging task. In our experiments, presented later on in the book, optimisation is performed with the use of evolutionary algorithms. Altogether, we try to reconstruct a way in which PA has been obtained from X, this way, represented through the configuration, Z, pertaining to the broadly conceived procedure of cluster analysis, for the very simple reason that clustering is a natural way to produce partitions of sets of data. In some specific circumstances one might imagine other approaches to the said reconstructing, but we stick to the apparently most natural one. A short discussion of this subject is also provided in Chap. 2 of the book. In order to bring the problem formulated here closer to some real life situation, let us consider the following three examples of situations: Example 1: a car dealer. Assume a secondhand car dealer disposes of a set of data on (potential) customers, who visit the website of the dealer, call this set Y, and the set of data on those, who actually bought a car from the dealer, the set X. Naturally, the set X is much smaller than Y (it may constitute, say, less than 1% of Y ). In this case, a partition of X, PA, might be based on the makes and/or types of cars purchased by customers represented in the data set X. The dealer might wish to identify the “rules” leading to the partition PB of the set of the purchasing customers, X, disregarding the labeling by car makes / types, yet such that approximate possibly well PA . The obvious objective would be to identify how the groups of customers interested in particular makes/types of cars may form. Having such a procedure (Z * ) identified one may apply it to the set Y and obtain the “classes” of (potential) customers of particular makes/types which, in turn, may be more effectively targeted with the promotional offers or just information, regarding definite makes / types of cars. Thus, upon finding the Z * that produces PB that is the closest to PA , one might hope that by applying Z * to Y it would be possible to define the classes of

1.1 The Concept

3

the (potential) customers, at whom appropriate offers could be effectively addressed during their search through the website. These classes would form the partition PB = Z * (Y ). Example 2: categorization of chemical compounds. Assume that i ∈ I is the index of a set X of chemical compounds, which are classified in PA according to their theoretically known properties, primarily related to their toxic properties, or, more generally, their environmental impact. These properties, along with the associated classification PA , are based on their composition and structure, and the known consequences thereof. On the other hand, let us assume that for each compound i, a vector/object x i of the actual measurements and assessments “from the field” is available, reflecting the actual action and characteristics of the respective i-th compound in the concrete, diverse, environmental conditions. Thus, the set X may be interpreted as a set of such vectors, X = {x i }i∈I or as a matrix X = [x ik ], where k is an index of an attribute characterizing object x i . These may be related to both the (induced or deduced) impact on the biotic and abiotic environment, and the characteristics of more physical character, such as penetration speed and reach, persistence, adhesion, etc. In addition, there may be multiple observations for a single compound i and, thus, X is actually a bag (multiset). Now, the (“best”, i.e., “closest” to PA ) partition PB we obtain for X = {x i }i∈I , especially regarding the clustering of x i ’s, reveals partly the influence of the variety of environmental situations on the actual action of the compounds, but, definitely, also sheds light “backwards” on the appropriateness of the categorization PA , motivating, perhaps, to the search for additional dimensions, characterizing the compounds analyzed. This can take on the form of an iterative back-and-forth procedure, with subsequent PA (t) and PB (t), obtained in consecutive iterations t, hopefully getting closer to each other, if not converging. Example 3: the actual toxicity of mushrooms. Even though this case might be regarded as anecdotal, mushrooms do constitute an important part of cuisine and diet in many cultures, and also in many of them lead, every year, to deaths or severe hospitalizations. It is also well known that owing to the biological properties of mushrooms their toxicity is highly variable, and the actual effects heavily depend upon the way they are prepared (e.g. boiling mushrooms in water and then pouring this water away) and consumed, as well as upon the consumer, her/his general and current characteristics (like, e.g., age, weight, or alcohol currently consumed). The partition PA is meant to correspond to the classes of toxicity / edibility of the particular species, with the aim of communicating these characteristics to the wide public in a possibly clear manner. Thus, PA , prepared by the experts is juxtaposed with the partition PB , obtained from the set X of descriptions x i of the actual medically described poisoning cases, as well as interviews with experienced cooks, specialized in mushroom dishes. The juxtaposition is intended to lead to better justified and cogently characterized classification PB (Z * ), supposedly communicated to the wide public, including general edibility assessments, cooking indications, advice as to the identification and first help, etc.

4

1 The Concept of Reverse Clustering

1.2 The Notation We shall now sum up the notation already introduced, extending it whenever necessary with the notions that will be used further on: X = {x i }i ∈ I –

n– i– I = {1,…,i,…,n} – m– k– x ik – xi – X– Ex –

P–

Aq –

PA – Z– –

2 We

a set of objects under consideration; this symbol, depending on the context, may be interpreted in a slightly different way (see further below); number of objects (observations) in the data set considered; index of the objects, i = 1,…,n; the set of indices of the objects; this set of indices is often equated, for simplicity, with the set of objects; number of variables (features, attributes, characteristics), describing the objects2 ; index of the variables of the objects, k = 1,…,m; value of variable k for object i; this value belongs to a domain k associated with variable k; complete description of the object i in the form of the vector of values x i = [x i1 ,…,x ik ,…,x im ]; also: an n x m matrix, containing the descriptions of all n objects, according to all m variables; the Cartesian product 1 × … × k × … × m of domains of all variables/attributes which are used to characterize object x;; a partition of the set of objects X = {x i }i ∈ I , often understood as the set of their indices, I, into disjoint non-empty subsets the whole set X, (clusters), P = {Aq }q=1,…,p , jointly covering i.e., ∀q Aq = ∅, ∀q1, q2 Aq1 ∩ Aq2 = ∅, q Aq = X ; a cluster (subset of I), indexed by q; q = 1,…,p, where p is the number of clusters; thus, P = {Aq }; the clusters are assumed to be disjoint and to exhaust (cover) the set I (hence, we do not consider, at least not in this book, the fuzzy or rough clusters); the partition, which is provided together with X as the datum of the concrete problem; the vector of parameters (a configuration) of the clustering procedure comprising the very procedure itself, applied to X, yielding a partition P = Z(X) of X; the universe of possible/considered vectors (configurations) Z;

do not consider here, in this book, the issue of missing data. Thus, it is assumed that for all n objects each of m variable values is specified. Although the reverse clustering paradigm applies also to the case of missing values, the book is devoted to the presentation of the main aspects and implications of the paradigm, without delving into the multiple, even if important, side issues.

1.2 The Notation

Q(.,.) –

PB – d(.,.) –

D(.,.) – A, B, … X, Y, … -

5

a measure of similarity or distance between two partitions; we shall also use notation Q for the quality functions of partitions, when referred to explicitly; the partition, obtained from the entire procedure, as supposedly the closest to PA ; the distance measure between objects; for objects, characterised in X, we admit a simpler notation: d ij , where i,j ∈ I; the distance measure between sets of objects. a general notation of subsets of I; also: general notation of the data sets, describing sets of objects.

1.3 The Elements of Vector Z: The Dimensions of the Search Space We shall now give some additional details, which are associated with the concrete implementation of the concept introduced here, according to the three aspects of the space of configurations, specified before. Thereby, we shall be specifying the content of the vector Z, composed of the individual parameters, subject to choice. The choice of the clustering algorithms and their parameters. Concerning the search with respect to the clustering algorithm, throughout this volume we shall be confined to three families of algorithms: 1. 2. 3.

The k-means-type algorithms with some of its varieties, like, e.g. k-medoids; The classical progressive merger algorithms, such as single linkage, complete linkage etc., and A representative of the local density based algorithms, in this case the DBSCAN algorithm.

No other kinds of clustering algorithms were accounted for in the experiments reported in this volume, but, actually, considering the clustering algorithms proper, the ones mentioned constitute the major part of those numerous clustering algorithms that could be included in the search. It was important for us to consider the approaches, which are by their very nature oriented at solving of the clustering problem—it should namely be mentioned, for clarification, that the metaheuristics, very often used also for clustering purposes, are by no means clustering algorithms themselves, and do not contain in themselves the rationality, oriented at a possibly good partitioning of a data set, but, quite generally, at finding an optimum solution. We do by no means provide here any review of clustering methods, this domain being the subject of a multitude of books and papers, both general, survey-like, and devoted to concrete methods and algorithms, to say nothing of a myriad of applications. For the sake of completeness we mention such general references, dealing with clustering, as Mirkin (1996), de Falguerolles (1977), Hayashi et al. (1996),

6

1 The Concept of Reverse Clustering

Banks et al. (2004, 2011), Wierzcho´n and Kłopotek (2018), Bramer (2007), Owsi´nski (2020), as well as, more focused on specific problems in clustering, Adolfsson et al. (2019), Figueiredo et al. (1999), Guha et al. (2003), or Simovici and Hua (2019). The k-means-type algorithms. The k-means algorithms are based on the following general procedure: 1. 2.

3.

4.

for the given data set X = {x i }i∈I generate in some way p points3 in E x (centroids), denote them x q , q = 1,…,p; assign each object x i from X to the closest centroid x q , thus, for each x i distances d(x i ,x q ) are calculated for q = 1,…,p, and x i is assigned to x q* , for which d(x i ,x q* ) = minq d(x i ,x q ); thereby, the clusters Aq are formed; for the obtained clusters Aq determine the new centroids x q , being the “representatives” of the clusters, e.g. as the means of the elements of clusters, assigned to clusters in the previous step; if the stopping criterion, e.g., the lack of essential changes between the centroids in subsequent steps of the algorithm, is not satisfied (yet), go to 2, otherwise terminate.

This simple procedure was initially formulated by Steinhaus (1956), and soon afterwards was also developed by Lloyd (1957), but the main impact came from Forgy (1965), Ball and Hall (1965), and MacQueen (1967). The fuzzy-set based version of the general k-means method, which became enormously popular and known as fuzzy c-means, was formulated by Bezdek (1981) (see also, for fuzzy partitions, Dunn 1974, and Bezdek et al. 1999), following which quite a number of varieties and algorithmic proposals within the k-means-like algorithm family were forwarded (see, for instance, Lindsten et al. 2011, Dvoenko 2014, the recent work of Kłopotek, 2020, or the discussions of equivalence with the Kohonen’s SOMs, originally formulated by Kohonen 2001). Nowadays, this generic procedure is being implemented in a variety of manners, differing, in particular, as to the status of the x q —whether they are chosen from among the objects x i (k-medoids version) or can be any elements of E x (the classical k-means) and the way, in which they are determined, and it is available through a number of open access and paid libraries. The procedure, along with its varieties, is known to converge quickly (in a couple or a dozen of iterations of the procedure above) to a local minimum, depending upon the starting point (the initial points, “centroids”, from step 1) and the nature of the set X. Since it converges quickly, it remains feasible to start it many times over from diverse initial sets of centroids in order to increase the chances of finding the global optimum. The local minimum that is reached through the functioning of the above procedure is, naturally, the minimum of the following criterion function: Q(P) = 3 Usually,

q

i∈Aq

d xi , x q .

instead of p we would use k, as in „k-means” but this would overlap with an earlier assumed meaning of k as an index of variables/attributes characterizing objects to be clustered.

1.3 The Elements of Vector Z: The Dimensions of the Search Space

7

The distance function used is the Euclidean metric squared, in order to preserve the properties, associated with the choice of cluster mean as the representative of the cluster. It is obvious that the above Q(P) is monotonic with respect to p, its minimum for consecutive p’s decreasing with the increase of p down to Q(P) = 0 for p = n. That is why the k-means type algorithms are applied with the number of clusters, p, specified. In the light of the above it becomes clear that the parameters of the vector Z, associated with the k-means algorithm are the very choice of the algorithm (k-means or one of its varieties, usually k-medoids as an alternative) and the number of clusters. Although the choice of the distance definition appears to have an influence on the results obtained from the k-means algorithms, it is not treated here, as considered later on in this chapter. The classical hierarchical merger algorithms. The second group of algorithms accounted for in the here reported study of the reverse clustering is the group of most classical clustering algorithms, consisting in stepwise mergers of objects and then clusters. These algorithms are all constructed as follows: 1.

2.

3. 4.

start from the set of objects, X, treating each object as a separate cluster (p = n); calculate the distances d qq’ for all pairs of objects (indices) in I; these distances are, therefore, treated in this step as inter-cluster distances, Dqq’ ; find the minimum distance Dq*q** = minqq’ Dqq’ ; join/merge the clusters, indexed by q* and q**, between which the distance is minimum, thereby forming a new partition, with p: = p − 1; check, whether p > 1; if not, terminate the procedure (all objects have been merged into one all-embracing cluster); recalculate the inter-cluster distances (i.e. the distances between the cluster resulting from the merging of Aq* and Aq** in the previous step, on the one hand, and all the other clusters on the other hand, the distance Dq*q** “disappearing”); go to 2.

This—again—very simple procedure gives rise to a variety of concrete algorithms, which differ by the inter-cluster distance recalculation step 4. The algorithms from this group find their ancestor in the so-called “Wrocław taxonomy” by Florek et al. (1956), who were the first to formulate what is now called “single-linkage” algorithm, along with some of its more general properties. The essential step in the development of the family of these algorithms came with the papers by Lance and Williams (1966, 1967). They introduced the general formula, according to which the distance recalculation step is performed: Dq ∗ ∪q ∗∗ ,q = a1 Dq ∗ q + a2 Dq ∗∗ q + bDq ∗ q ∗∗ + c|Dq ∗ q − Dq ∗∗ q | where q* ∪ q** denotes the index of the cluster resulting from the merging of clusters q* and q**, with the values of the coefficients, corresponding to the particular

8

1 The Concept of Reverse Clustering

Table 1.1 Values of the Lance-Williams coefficients for the most popular of the hierarchical aggregation clustering algorithms Algorithm

a1

a2

b

c

Single linkage (nearest neighbor)

1/2

1/2

0

−1/2

Complete linkage (farthest 1/2 neighbor)

1/2

0

1/2

Unweighted average (UPGMA)

nq* /(nq + nq* ) nq** /(nq + nq** ) 0

0

Weighted average (WPGMA)

1/2

0

Centroid (UPGMC)

nq* /(nq + nq* ) nq** /(nq + nq** ) − nq* nq** /(nq* + nq** ) 0

Median (WPGMC)

1/2

1/2

1/2

0

−1/4

0

implementations of the procedure, i.e. the particular progressive merger algorithms, shown in Table 1.1 for the most popular of these algorithms. These algorithms have become quite commonly used because of their intuitive appeal and the fact that the consecutive mergers lead to the tree-like image (the dendrogram), which, accompanied by the value of distance, for which the mergers occur, provides very valuable information. Like in the case of k-means, a choice of these algorithms is available from multiple sources. Yet, the applicability of these algorithms is negatively affected by the fact that the entire distance matrix has to be kept, searched through and updated. It must be added here that the algorithms from the group differ as to the shape of clusters they can detect or form, a clear difference separating, in particular, single linkage from virtually all other algorithms. Namely, the single linkage has a tendency towards the formation of chains of points (objects), of whatever shapes and dimensions, while the remaining algorithms tend to form compact, usually spherical groups. The obvious parameters of this group of algorithms in terms of the elements of vector Z are the above listed values of a1 , a2 , b and c. Thereby, no special distinction is necessary of the particular algorithms. However, it must be added that in many cases we allowed these coefficients to vary more freely than this is envisaged by the Lance-Williams formula and the corresponding table of coefficient values (i.e. only with some constraints on the values of these coefficients), implying, potentially, the, as of now, non-existing algorithms.4 The density based algorithms—DBSCAN. The local density-based algorithms form a much less compact and consistent group than the two previously considered types of algorithms. A more systematic approach to the density-based techniques was initiated by Raymond Tremolières (Tremolières 4 Actually,

the Lance-Williams parameterisation was extended later on in order to encompass yet more of similar algorithms, but this is of no interest for the main purpose of this book.

1.3 The Elements of Vector Z: The Dimensions of the Search Space

9

1979, 1981), but then they were virtually forgotten for a long time, mainly in view of computational issues. They gained again popularity when, on the one hand, the requirement of single-passage analysis of data sets became important (even before the time of data streams analysis), in view of the volumes of available data to consider, and, on the other hand, the new kinds of density techniques, much more computationally effective than those from before, have been proposed (see, e.g., Yager and Filev 1994, or, more recently, Rodriguez and Laio 2014). These algorithms, in principle, analyse the interrelations, based on distances/proximities of a limited number of objects. One of the most commonly used algorithms in this group is DBSCAN, due in its most popular form mainly to Ester et al. (1996), although it is claimed that already Ling (1972) proposed the algorithm that was very similar to DBSCAN. In this algorithm, the objects (points in E x ) are classified into three categories: core points (implying that they are “inside” clusters), density reachable points (which may form the “border” or the “edges” of clusters), and outliers or noise points. This classification is based on an essentially heuristic procedure, which refers to two parameters (these two parameters being, therefore, also the elements of the vector Z in our approach), namely: the radius ε, within which we look for the “closest neighbours” of a given point, and the minimum number of points, required to classify a given region in E x as “dense”, originally denoted minPts. Based on these two parameters the procedure classifies the objects into the three categories mentioned, and afterwards establishes the clusters on the basis of the notion of density connectedness. The algorithm is popular due to its fast performance and also owing to its independence of the shape of the clusters it identifies, or forms. On the other hand, it definitely strongly depends upon the choice of the two parameters, and, although a similar criticism is true for, say, k-means, and its parameter p, executing k-means for a (short) series of p’s is not a problem and may circumvent the arbitrariness of the choice of the value of p, while finding the right pair of ε and minPts is quite challenging, in general. The weighing or selection of the variables. In the search for the partition possibly similar to the given PA , operations may also be performed on the set of variables, accounted for. Thus, two alternative options can be applied: (i) weighing of each of variables, preferably on the scale between 0 (not considered at all) and 1 (considered as in the original data set), (ii) the binary choice of variables, i.e. either considered or dropped (corresponding to the choice of weights from among 1 and 0). It is definitely not typical for clustering to proceed explicitly with such operations on variables. Usually, such operation is performed in the preprocessing phase, often even without explicit consideration of clustering as a possible next phase. Yet, in the framework of reverse clustering, in some cases, this appears to be justified, especially as it may not be known where does the partition PA come from and what is its relation to the characterization of X.

10

1 The Concept of Reverse Clustering

Distance definitions. It is well known that some of the clustering procedures depend to an extent, sometimes considerably, on the distance definitions used. This is absolutely clear for the k-means family of algorithms, where squared Euclidean distance is virtually a “must”, for formal reasons, although in some variations of this algorithm this is no longer a strict requirement. Some definite implementations of specific algorithms (e.g. from the hierarchical aggregation family) also work differently, depending upon the distance definitions. The most important aspect in this regard is connected with the influence, exerted by the objects, located far away from the other ones, the impact of the increasing dimensionality on the significance of distance, or the differences in densities in various regions of E x . In view of this influence, it was assumed in the exercises in reverse clustering, illustrated in this book, that a flexible distance definition be adopted, namely the general Minkowski distance: h 1/ h xik − x jk , d xi , x j = k

where for h = 1 we get the Manhattan (city-block) metric, and for h = 2 the Euclidean metric. When h tends to infinity, the distance above approximates the Chebyshev metric, according to which, simply, d xi , x j = maxk (xik − xik ). Again, like with the Lance-Williams parameters of the hierarchical aggregation algorithms, we allow for arbitrary (non-negative) values of h, when trying to reconstruct the way PA has been obtained. Thereby, non-classical distance definitions could be ultimately used. Summing up the set of parameters, constituting the vector Z, let us enumerate them again: 1: the indicator of choice of the clustering algorithm (k-means, hierarchical merger, or DBSCAN); 2 to 5: the parameters of the clustering algorithms (maximum of 4 numbers for hierarchical merger algorithms); 6 to 6+m-1: the variables and their weights or binary indicators; 6+m: the exponent h.

1.4 The Criterion: Maximising the Similarity Between Partitions … Table 1.2 Elements of calculation of the Rand index of similarity between partitions

Numbers of pairs of objects

Partition P2

11

Partition P1 In the same cluster

In different clusters

In the same cluster

a

b

In different clusters

c

d

1.4 The Criterion: Maximising the Similarity Between Partitions P A and P B The search, realised in the space, outlined in the previous section, is performed with respect to the fundamental criterion of the difference / affinity between the two partitions, i.e. partition PA , which is given, and P, which is produced by the clustering procedure, defined by Z, that is, the partition P = Z(X). Ultimately, for the assessment of the clustering results, the classical Rand index (see Rand 1971) was selected.5 Rand index measures the similarity of two partitions, P1 and P2 , of a set of objects, in the following, simplest and highly intuitive manner, based on the categorisation of pairs of objects, which is illustrated in Table 1.2. Namely, we consider two partitions, P1 and P2 , and check, for each pair of objects from X (or I) whether they are in the same cluster or in the different clusters. Of course, a + b + c + d = n(n−1)/2. We aim at a (objects in the same clusters in both partitions) and d (objects in different clusters in both partitions) as high as possible, with b and c being as small as possible, according to the formula Q P 1, P 2 =

a+d . a+b+c+d

Thus, if the two partitions are identical, then Q(P1 ,P2 ) = 1, while Q(P1 ,P2 ) = 0 when they are “completely different” (actually, this occurs only in the sole, very specific case, for the two partitions, of which one is constituted by a single, all-embracing cluster, and the other one is composed of all objects being separate singleton clusters). In view of the probabilistic properties of this Rand index (its expected value for two random partitions is not zero), often its adjusted version (see Hubert and Arabie, 1985), denoted Qa (.,.), is being used, accounting for the deviation of the mean from the actual expected chance value. This adjusted Rand index is defined as: Qa P 1, P 2 =

a − E x p(a) Max(a) − E x p(a)

5 Some more general remarks on this subject shall be forwarded in the next chapter, when discussing

the broader background of the entire approach.

12

1 The Concept of Reverse Clustering

where Exp(a) is the expected value of the index, while the introduction of Max(a) ensures that the maximum value of the respective measure is equal 1. These two values can be calculated for two partitions, of which one consists of p1 clusters, having, respectively, n11 , n12 ,…,n1p1 elements (objects), while the other partition is composed of p2 clusters, having, respectively, n21 , n22 ,…,n2p2 elements, in the manner as follows:

p1 p2 n 1q n 2q · q=1 q=1 2 2 E x p(a) =

n 2 and

p2 1 p1 n 1q n 2q Max(a) = + . q=1 q=1 2 2 2 Denœud and Guénoche (2006) suggested that for larger datasets, this kind of adjustment increases the discriminatory power of the Rand index. Therefore, in some of the cases reported in this book, we use it as the similarity measure between partitions. Likewise, in some calculations, definite penalty terms were introduced for constraining the values of the elements of Z if the possibility arose of their uncontrolled increase. Generally, however, the original Rand index was kept to as the main criterion of the search for PB and is virtually kept in all cases as the index of quality of the solution, if not the actual optimisation criterion (in some cases boiling down to simply the number of “wrongly classified” objects).

1.5 The Search Procedure Although this book is not devoted to the analysis of numerical and computational aspects of the reverse clustering approach—a definitely very important issue—in the framework of the presentation of the gist and the interpretations of the paradigm, we shall shortly characterise here the computational aspect, as well. Thus, in view of the expected very cumbersome landscape and highly complex choice conditions (“constraints”) it was decided to use the evolutionary algorithms as the search tools. In actual experiments two kinds of evolutionary algorithms were used (see also a slightly ampler description in Sect. 4.2 of the book). The first of them was developed by one of the authors of this book (see Sta´nczak 2003) and is characterised by the two-level adaptation, namely at the level of individuals, which is standard for the evolutionary algorithms, and also at the level of operators, which are used in a highly flexible manner with respect to different individuals, depending

1.5 The Search Procedure

13

upon the history of modifications, concerning the given individual. The second evolutionary algorithm tried out in some of the experiments was the differential evolution method (see Storn and Price 1997), the version from the R package (Mullen et al. 2011; R Core Team 2014) being used. In both these versions of the evolutionary algorithms the individuals are coded in a relatively straightforward way according to the parameters of the vector Z, characterised before. The certainty of reaching the proper solution was not always satisfactory, this fact being appropriately noted and commented upon in the reports from the particular experiments, contained in the successive chapters of the book.

References Adolfsson, A., Ackerman, M., Brownstein, N.C.: To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit. 88, 13-26 (2019) Ball, G., Hall, D.: ISODATA, a novel method of data analysis and pattern classification. Technical report NTIS AD 699616. Stanford Research Institute, Stanford, CA (1965) Banks, D., McMorris, F., Arabie, Ph., Gaul, W. (eds.): Classification, clustering, and data mining applications. In: Proceedings of the Meeting of the International Federation of Classification Societies (IFCS’2004). Springer, Berlin (2004) Banks, D., House, L., McMorris, F., Arabie, Ph., Gaul, W.: Classification Clustering and Data Mining Applications. Springer, Berlin (2011) Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981) Bezdek, J.C., Keller, J., Krisnapuram, R., Pal, N.R.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. The Handbooks of Fuzzy Sets, vol. 4. Springer Verlag (1999) Bramer, M.: Principles of Data Mining. Springer, New York (2007) de Falguerolles, A.: Classification automatique: un critère et des algorithms d’échange. In: Diday, E., Lechevallier, Y. (eds.) Classification automatique et perception par ordinateur. IRIA, Le Chesnay (1977) Denœud, L., Guénoche, A.: Comparison of Distance Indices between Partitions. In: Batagelj, V., Bock, H.H., Ferligoj, A., Žiberna, A. (eds.) Data Science and Classification Studies in Classification, Data Analysis, and Knowledge Organization, pp. 21–28. Springer, Berlin (2006) Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974) Dvoenko, S.: Meanless k-means as k-meanless clustering with the bi-partial approach. In: Proceedings of the 12th International Conference on Pattern Recognition and Image Processing, pp. 50–54. Minsk, Belarus, UIIP NASB, 28–30 May 2014 Ester, M., Kriegel H.-P., Sander J. and Xu X.-w. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: E. Simondis, J. Han and U. M. Fayyad., eds., Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, 226–231. Figueiredo, M.A.T., Leitão, J.M.N., Jain, A.K.: On fitting mixture models. In: Hancock, E.R., Pelillo, M. (eds.) Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR 1999). Lecture Notes in Computer Science, vol. 1654. Springer, Berlin, Heidelberg (1999) Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H., Zubrzycki, S.: Taksonomia Wrocławska (The Wrocław Taxonomy; in Polish). Przegl˛ad Antropologiczny, 17(1956) Forgy, E.W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. In: Biometric Society Meeting, Riverside, California (1965). Abstract in Biometrics (1965) 21, 768

14

1 The Concept of Reverse Clustering

Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003) Hayashi, Ch., Yajima, K., Bock, H.H., Ohsumi, N., Tanaka, Y., Baba, Y. (eds.): Data Science, Classification, and Related Methods. Springer (1996) Hubert, L., Arabie, Ph.: Comparing partitions. J. Classif. 2(1), 193–218 (1985) Klopotek, M.A.: An aposteriorical clusterability criterion for k-Means ++ and simplicity of clustering. SN Comput. Sci. 1(2), 80 (2020) Kohonen, T.: Self-Organizing Maps. Springer, Berlin-Heidelberg (2001) Lance, G.N., Williams, W.T.: A Generalized Sorting Strategy for Computer Classifications. Nature 212, 218 (1966) Lance, G.N., Williams, W.T.: A general theory of classification sorting strategies: 1 hierarchical systems. Comput. J. 9, 373–380 (1967) Lindsten, F., Ohlsson, H., Ljung, L.: Just Relax and Come Clustering! A Convexification of kMeans Clustering. Technical Report, Automatic Control, Linköping University, LiTH-ISY-R2992 (2011) Ling, R.F.: On the theory and construction of k-clusters. Comput. J. 15(4), 326–332 (1972). https:// doi.org/10.1093/comjnl/15.4.326 Lloyd, S.P.: Least squares quantization in PCM. In: Bell Telephone Labs Memorandum. Murray Hill, NJ (1957); reprinted in IEEE Trans. Information Theory, IT-28 (1982), 2, 129–137 MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: LeCam, L.M., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium Mathematics Statistics Probability, 1965/66, pp. 281–297. University of California Press, Berkeley, I (1967) Mirkin, B.: Mathematical Classification and Clustering. Springer, Berlin (1996) Mullen, K.M., Ardia, D., Gil, D.L., Windover, D., Cline, J.: DEoptim: An R Package for Global Optimization by Differential Evolution. J. Stat. Softw. 6, 1–26 (2011). https://www.jstatsoft.org/ v40/i06/. Owsi´nski, J.W.: Data Analysis in Bi-Partial Perspective: Clustering and Beyond. Springer Verlag (2020) R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2014). https://www.R-project.org/. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66 (336), 846–850 (1971) Ríos, S.A., Velásquez, J.D.: Finding representative web pages based on a SOM and a reverse cluster analysis. Int. J. Artif. Intell. Tools 20(1), 93–118 (2011) Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 322, 1492 (2014) Simovici, D.A., Hua K.X.: Data ultrametricity and clusterability. CoRR abs/1908.10833 (2019) Sta´nczak, J.: Biologically inspired methods for control of evolutionary algorithms. Control Cybern. 32(2), 411–433 (2003) Steinhaus, H.: Sur la division des corps matériels en parties. Bulletin de l’Academie Polonaise des Sciences IV (C1.III), 801–804 (1956) Storn, R., Price, K.: Differential Evolution—a Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. J. Global Optim. 11(4), 341–359 (1997) Tremolières, R.: The percolation method for an efficient grouping of data. Pattern Recognit. 11(1979) Tremolières, R.: Introduction aux fonctions de densité d‘inertie, p. 234. Université Aix-Marseille, WP, IAE (1981) Wierzcho´n, S.ł., Kłopotek, M.: Modern Algorithms of Cluster Analysis. Studies in Big Data, vol. 34. Springer (2018) Yager, R.R., Filev, D.P.: Approximate clustering via the mountain method. IEEE Trans. Syst. Man Cybern. 24, 1279–1284 (1994)

Chapter 2

Reverse Clustering—The Essence and The Interpretations

2.1 The Background and the Broad Context In general, the very essence of clustering is to group a set of data into subsets (groups) such that the elements assigned to the particular clusters are more similar to each other than the elements assigned to different clusters. This simple, maybe even primitive definition is very powerful as it reflects an extremely wide class of problems we face in both everyday life and in virtually all of the more sophisticated acts we undertake. Cluster analysis is the field of knowledge, at the crossroads of applied mathematics and computer science, that is concerned with such a class of problems. The most straightforward and basic justification for carrying out cluster analysis of a data set is to gain insight into the (“geometrical” or “model-wise”) structure of such data set, primarily in terms of the possibility of dividing it into plausible subsets (including, potentially, singletons), or in terms of the very existence of such a division. If this is so, i.e. the division is sound and the subsets are well conditioned, one may go further in inferring the nature and very meaning of subsets, their origins, and mechanisms of appearance (“models”, “processes”,…); see, e.g., Kaufman and Rousseeuw (1990, 37–50), Gan et al. (2007, 6–10), or Xu and Wunsch (2009, 263 ff.). The essentially primeval character of the task of clustering (“putting together the similar and distinguishing the dissimilar”) does clearly represent, as we have already mentioned, a multitude of various tasks and acts. It can be illustrated by the fact that it directly corresponds to the way in which human language has developed in various populations and cultures (with clusters corresponding to notions, words, and expressions). In this context—related to an eternal human quest for attaining some best solution or choosing a best option, or at least good enough—it is clear that there is an optimization aspect to the clustering task (the effectiveness and efficiency of the division obtained, its “veracity” put apart, and then the capacity and suitability in practical use). These obvious reasons have caused that, definitely, cluster analysis has become for decades one of the most rudimentary data analysis techniques, but, at the same time, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_2

15

16

2 Reverse Clustering—The Essence and The Interpretations

it is one of the most powerful tools of data analysis; see, for instance, Ducimetière (1970), Kaufman and Rousseeuw (1990), Arabie et al. (1996), Mirkin (1996), Bramer (2007), Böhm et al. (2006), Miyamoto et al. (2008). It is most often used when no extra information is available and, thus, there is no clue as to the structure of the data set at hand. In order to actually execute a cluster analysis exercise one has to make a number of choices. First of all, the data set must be characterized in terms of attributes and their values for particular data points1 which may be task dependent. Then, a clustering technique, some measures of distance/similarity, and other parameters, either related to a chosen clustering technique, or of more general character, have to be assumed. The choice of parameters may be guided by the experience of a user, availability of data (values of the attributes to be used), some available metadata, or it may be based on the results of some preliminary data analysis; see, e.g., Torra et al. (2011). Having made these initial choices, one can run the selected clustering algorithm and obtain some groups of data points, i.e., a partition of the data set under consideration. Usually, such an exercise is repeated several times for different configurations of the above mentioned parameters in order to find an “optimal”/”correct” partition. The whole process may be seen as a kind of a transformation, which turns a data set of individual data points into groups of such data points. In this book we consider the problem, which, in one of potential interpretations, may be treated as a kind of a reverse engineering related one that is applied to the (hypothetical) results of the previously described (potential) clustering process. As we have already mentioned in Chap. 1, the very essence of reverse engineering, sometime also termed back engineering, boils down to a procedure that aims at finding out for some object or process what has been an underlying design, architecture or implementation procedures or processes that have resulted in the formation of the object in question. To say in different words, reverse engineering is about the deduction of design characteristic features and goals for the object in question without a deep or additional knowledge about which procedures and/or processes have implied those features, see Eilam (2005), Chikofsky and Cross (1990), Raja and Fernandes (2008). As a side remark, let us notice that some work, proposing the use of clustering for solving various reverse engineering problems has already been done, see, for instance, Govin et al. (2016), Kuhn et al. (2005), Raffo (2019), Raffo et al. (2020), Laube (2020), Quigley et al. (2000), Shim et al. (2020), or Travkin et al. (2011). Yet, in spite of the very general similarity of domains, involving both reverse engineering and clustering, these studies are, in fact, oriented at different, quite specific objectives than what we propose in the present volume. The general philosophy of reverse engineering is in a certain manner employed for the clustering problem considered in this book. Namely, we assume that a partition 1 It

is possible to start with a data similarity/distance matrix, if available, without an explicit characterization of the data in terms of some attribute values, and such a setting also seems to provide a reasonable context for the paradigm proposed in this paper, but we will leave this case for a possible further study.

2.1 The Background and the Broad Context

17

of the data set is given and we want to discover parameters of the process (transformation) that (may) have resulted in the given partition. Ultimately, though, we may have to settle—which is usually the case for many analyses of complex problems— for an approximation, both concerning the shape of the division and the parameters of the clustering procedure (as not being able to obtain exactly the division, given together with the data set).2 It is very common to consider the data sets that are divided into subsets (clusters, classes, groups,…) in a definite manner, which is more or less “certain” or “justified”, and to attempt the reconstruction of the given division using some “other” data than just the respective group labels. This is most often done in order to validate or check the quality of the classification, clustering, machine learning, etc. schemes. More advanced purposes in this respect may involve model building and checking, as well as—quite to the contrary—a verification of the original, prior division of the data set. There may also be other objectives, some of them quite classical, like the detection of outliers, and also very specific ones, like the assessment of adequacy of the classification data to the labels or forming some descriptions of known groups of data in terms of the values of some of the attributes characterizing them. It is easy to notice that the results of the above reverse engineering type clustering process can provide the analyst and user with a very useful information. In general, for the information obtained from a data analysis study to be implementable and usable in practice, that is, also for novice users, domain specialists, very often with a limited command of mathematics, numerical algorithms, data analysis, clustering, etc., who are now presumably the largest target group of users in virtually all real world applications, some obvious requirements and limitations should be followed. These requirements and limitations concern both the procedures and ways of attaining results and reaching conclusions, and the form of providing data and information and obtaining the results. Basically, these should observe limited cognitive capabilities of the human being (see Rashidi et al. 2011; Perera et al. 2014; Tsai et al. 2014; Tervonen et al. 2014—these references concerning the topic mentioned in the very relevant and modern context of the Internet of Things). Basically, for our purposes, the accounting for the limited and specific human cognitive capabilities boils down to the necessity of involving: • a broadly perceived granulation of data and information (see Bargiela and Pedrycz 2002; Pedrycz 2013a, b; Pedrycz et al. 2008; Lin et al. 2002); • a summarisation, of both numerical and linguistic character (see Kacprzyk et al. 2000; Kacprzyk and Yager 2001; Kacprzyk and Zadro˙zny 2005, 2009, 2010, 2016; Kacprzyk et al. 2008, 2010; Reiter and Dale 2000; Reiter et al. 2006, as well as Sripada et al. 2003; Yu et al. 2007). In this context the problem of comprehensiveness of data analysis, data mining, machine learning, etc. results (patterns) had been known for some time, and it had 2 Actually, a perfect reconstruction of the given partition P

A may be quite possible (as also illustrated by some of the cases in this book), in the sense that PB = PA , but this is by no means equivalent to finding the rationale behind this initial partition.

18

2 Reverse Clustering—The Essence and The Interpretations

been presumably Michalski who already in the early 1980s devised the so called postulate of comprehensibility whose essence can be summarized as (Michalski 1983): “…The results of computer induction should be symbolic descriptions of given entities, semantically and structurally similar to those a human expert might produce observing the same entities. Components of these descriptions should be comprehensible as single ‘chunks’ of information, directly interpretable in natural language, and should relate quantitative and qualitative concepts in an integrated fashion…”. Michalski’s vision has had a great impact on machine learning, data mining, data analysis, etc. research, and has been further developed by many authors, see, e.g., Craven and Shavlik (1995), Zhou (2005), Pryke and Beale (2004), Fisch et al. (2011), to name just a few. A recent study on the comprehensiveness of linguistic summaries by Kacprzyk and Zadro˙zny (2013) combines many ideas from those works, and recasts them in the context of a very natural representation of results via natural language. Most of the above mentioned works on the comprehensiveness of data analysis/mining results emphasize as the main reasons for the importance of comprehensibility (see, for instance, Kacprzyk and Zadro˙zny 2013), in particular, the following ones: 1.

2. 3. 4. 5.

To be confident in the performance and usefulness of the algorithms, and to be willing to use them, the users have to understand how the result is obtained and what it means, The results obtained should be novel and unexpected, in one sense or another, and these results can only be accessible to the human if they are understandable, Usually, the results obtained may imply some action to be taken, and hence their comprehensiveness is clearly crucial, The results obtained may provide much insight into a potential better feature representation which, to be meaningful, should be comprehensible, and The results obtained can be employed for refining knowledge about a domain or field in question, and the more comprehensible, the better.

The postulate of comprehensibility is recently strongly emphasized in the framework of the tools and techniques of artificial intelligence. Namely, there is much evidence that for a wide use of powerful artificial intelligence tools and techniques, the essence of which is that they provide solutions to complex problems, their use and the use of the solutions obtained is essentially subject to human acceptance. However, for the human beings the goodness of results obtained is, in itself, as a rule not enough and they want to understand how and why these results have been obtained, and why they could be trusted, and then accepted and implemented. Unfortunately, most of powerful artificial intelligence tools and techniques do not have such capabilities. For instance, deep neural networks, almost all data analysis and data mining, machine learning, etc. algorithms do show very good and increasing performance, but are generally not “transparent” to the analysts or users with respect to how and why they yield results. This dangerous phenomenon can have a detrimental effect on the use of those powerful, effective and efficient tools and techniques, and hence on proliferation of

2.1 The Background and the Broad Context

19

artificial intelligence. As a way out many ideas have been proposed, and among them the recently highly advocated concept of the so-called “Explainable AI” has enjoyed an extremely high popularity. The motivation behind the Explainable AI (XAI) is simple, it concerns methods and techniques to be used in the broadly perceived area of artificial intelligence technology, which would yield results that can be understood by the human being. Notice that it deeply departs from the concept of the “black box” in, for instance, machine learning, where even the developers and analysts cannot really explain why the particular tool and techniques arrived at a specific decision. Needless to say that a lack of such knowledge can prohibit people from applying such a tool. The concept of XAI has been quickly considered of utmost importance, and—first of all—some manifesto type statements have been issued, for instance by military agencies and institutions, as exemplified by DARPA (Defense Advanced Research Projects Agency) who has explicitly stated that „… Explainable AI—especially explainable machine learning—will be essential if future warfighters are to understand, appropriately trust, and effectively manage an emerging generation of artificially intelligent machine partners…” (Gunning and Aha 2019). Many non-military think tanks and policy making institutions have issued similar statements. These statements have been strongly supported by an extremely active and intensive research effort all over the world, at the majority of research, R&D and scholarly institutions, and among hundreds of relevant publications one can quote the papers and books, including some state of the art and position papers, by: Adadi and Berrada (2018), Doran et al. (2017), Hall and Gill (2018), Miller (2017), Molnar (2020), Murdoch et al. (2019), Ribeiro et al. (2016), Rudin (2019), Zhang and Chen (2018), Biecek and Burzykowski (2020), to just quote a few. One of the studies, devoted to the connection between interpretability and summarization of data and results of analysis is that by Lesot et al. (2016). Essentially, virtually all authors mentioned above stress that the results obtained using modern AI tools should be explainable/interpretable/transparent what goes hand in hand with the notion of comprehensibility. Thus, our approach may be seen as belonging to this line of reasoning even if “Explainable AI” is mostly focused on and concerned more explicitly with a black box type, notably deep learning kind of methodology. (Actually, one of the interpretations, of which we speak in Sect. 2.3, further on in this chapter, is exactly that the given partition PA is a kind of black box product, of which we know very little. This is also why the reverse engineering interpretation is quite to the point here.) It is, however, worth to clearly state that the AI community from the very beginning, in the framework of the traditional “symbolic AI”, inherently much more susceptible of such comprehensibility, was deeply concerned with providing the users of AI tools and techniques with some means to convince them as to the correctness of provided results, securing their high interpretability and transparent provenience. This may be exemplified with the concept and actual implementations of an explanation module in an expert system (see, for instance, Castillo and Alvarez 1991; Baker and O’Conor 1989; Berka et al. 2004, as well as, e.g. http://www.esbuilder/com/).

20

2 Reverse Clustering—The Essence and The Interpretations

It is quite clear, but will be also be further seen in greater detail, that the reverse engineering type approach to clustering can be viewed to be following at the conceptual level the above mentioned philosophy of attaining comprehensiveness. First, the clustering process itself implies an increase of comprehensiveness of data as it produces per se representations of results that are closer to human perception. Then, which is the essence of the approach proposed in this volume, we go further and try to find those parameters of the clustering algorithm (potentially) employed that have led to the results obtained. That is, an extremely important additional knowledge is derived via our approach about the algorithms, parameters, types of distance/similarity functions, etc. This all is clearly useful and can greatly help a potential user in understanding (comprehending) the intrinsic relations between many aspects of the data set analysed and, more generally, the respective analysis process. As a result, the possibility of acceptance, and hence implementation, of the result obtained can be greatly enhanced. Notwithstanding the enhanced comprehension of the data themselves and the essence of the process, the approach can be useful for quite pragmatic purposes in a variety of manners. This fact is amply illustrated in the successive chapters of the book, in which a variety of examples is treated for diverse data sets and concrete problem formulations. It is exactly with respect to the latter, i.e. the various problem formulations, that the reverse clustering paradigm offers quite broad possibilities. This is closely associated with the interpretation of the generic problem and hence of the results obtained. The next section is devoted exactly to an ampler discussion of this issue.

2.2 Some More Specific Related Work The idea of a reverse engineering approach relative to clustering as presented in Chap. 1 of the book is obviously novel. However, the procedure, formally presented here, may be seen as referring to many facets which are individually addressed in the literature by multiple authors in various contexts and settings. We shall start with some approaches or problem formulations, which reveal apparently similar overall pattern to the here considered reverse clustering. Then, we shall pass to the more specific questions, which are somehow addressed in the reverse clustering paradigm, but very often appear as separate issues, important for the data analysis procedures and hence treated through appropriate techniques. Among the few significant references, which can be provided for the approach here presented as proposing a similar overall perspective, we should indicate the LCE (Learning from Cluster Examples) of Kamishima and associates (see Kamishima et al. 1995, and, first of all, Kamishima and Motoyoshi 2003). The LCE approach starts with a number, M, of initial partitions, PA1 , . . . , PA M , of the object sets X 1 ,…,X M , and, having these partitions of sets of “analogous” objects (i.e. objects, situated in the same kind of space E X ), proposes an approach to derive a procedure for partitioning other sets of “analogous” object, Y. Thus, the obvious similarity with the

2.2 Some More Specific Related Work

21

reverse clustering paradigm lies in having some initial partition (in LCE: a number of partitions) and intending to use the knowledge therefrom to partition some other data sets3 . Similarly as reverse clustering, this problem and approach cannot be considered to constitute, nor provide, a classifier in the standard sense. Yet, there are essential differences with respect to the reverse clustering paradigm, namely: (1) the assumption of M > 1 is fundamental; (2) the partitions PA1 , . . . , PA M , are all considered to be “objectively certain”; (3) the procedure does, in fact, not involve clustering as such: based primarily on probabilistic precepts, related to co-appearance of objects of definite locations in space in the same clusters, it assigns the new objects to the same or different groups. The procedure itself is quite complicated and, even though referring to the probabilistic framework, involves quite important arbitrary assumptions, meant largely to tone down the overall complexity of the procedure. There is also another domain, which has been very dynamically explored and developed over the last two decades, in a certain manner associated with the LCE, but of a definitely broader character, namely that of “domain adaptation” and “transfer learning”. In short, the issue is in devising the principles and methods of using knowledge—in the form of rules, clusters, models etc.—acquired on the basis of a certain data set in the situation, when one deals with data, whose characteristics have either somehow changed or we know they are to an extent different. This concerns primarily the suspicion (“hypothesis” or “assumption”) that the distribution, behind the data analysed is different (but not “totally different”) from the one, on which our knowledge is based. While the problem is obvious and very general, indeed, its actual significance stems from some very specific application areas (again, as in the case of the LCE—image processing, but also, to a large extent—document analysis and information retrieval, associated with sentiment analysis). A very good survey on this domain is provided by Kouw and Loog (2019) (an evidence that research in this field has indeed quite some history is also provided by the valuable surveys of Pan and Yang 2010, or Gopalan et al. 2012). The essential difference with respect to the reverse clustering paradigm consists in the fact that the investigation domains mentioned above stem from a definite problem, which is being solved in a variety of manners, depending upon the concrete circumstances (e.g. assumptions as to the probabilistic or statistical nature of the problem). In this context, very popular are the kernel-based methodologies, but they also do not provide sufficient coverage for many of the more specialised problems in domain adaptation. Further, also similarly as in the LCE, in the language of clustering, it is assumed that: (a) PA is definitely based on the respective X, and (b) that it is “correct” (or we dispose of a concrete measure of its “correctness”). Under these assumptions we wish to obtain the new partition, PB , that would be also (similarly) “correct” for a data set Y that is somewhat different from X. Thus, obviously, the “reverse clustering” might find some application in the areas of domain adaptation and transfer learning, under certain (mild) conditions and for a definite class of 3A

typical, and, indeed, very adequate application consists in the pattern segmentation partitions PA1 , . . . , PA M , being provided by humans, and automatic generation of pattern segmentations for other patterns.

22

2 Reverse Clustering—The Essence and The Interpretations

problems, but, in fact, we deal here not only with a different kind of general task, but also a different methodological perspective. Finally, as characterised here, the latter research area borders upon the one of “incremental” or “adaptive” clustering or learning, which, in the case of clustering, refers to a situation, in which a partition PA , obtained on the basis of some data set X = {x 1 ,…,x n }, has to be adapted to a change in this data set, resulting in some X’. The difference with respect to domain adaptation/transfer learning is that we do not assume any essential change in the “nature” of X, when it turns into X’, but rather a kind of “parametric shift”, or, as this is often referred to, a “drift”. A natural choice for the techniques applied in this case is the k-means family of clustering algorithms, see, e.g. the seminal paper of Pedrycz and Waletzky (1997) (although on a more general level one might wish to consult some of the earlier works, e.g. the one of Fisher 1987), this approach being still currently applied in the conditions of massive data flows, like, e.g., in Casalino et al. (2019). The closeness of incremental or adaptive clustering to domain adaptation and transfer learning is perhaps best expressed through the manner, in which X’ may differ from X. In the simplest case, the difference boils down to addition of some new observations, so that, in fact, X’ is just a superset of X, X X , ultimately—just an addition of a single observation (x n+1 ). This, indeed, is the “classification” situation, for which k-means methodology appears to be very well fitted: starting from partition PA for X we run the procedure for X’ and find easily the solution assigning x n+1 to a cluster with a closest/most similar centroid (if not simply assigning x n+1 to it). In many cases, though, X’ simply “preserves the character” of X, but, generally, contains other observations (e.g. X∩X’ = ∅). Here, simply starting from PA may not be sufficient, even though k-means are still applicable. Going farther towards domain adaptation and transfer learning we encounter the cases, in which, e.g., the very description of observations (set of variables/attributes) changes. Such divergences may go in various directions. The examples of studies, devoted to such diverse cases of adaptive clustering are Bagherjeiran et al. (2005); Câmpan and Serban ¸ (2006); Ntoutsi et al. (2009); Rokach et al. (2009); or Shi et al. (2018). Yet another domain, worth noting, which has a definite technical association with the reverse clustering paradigm, and, indeed, with the domains mentioned before, is learning with partial supervision, to an important degree linked, in particular, with incremental learning and hence also with some of the problem formulations and methodologies referred to above (see, e.g., Bouchachia and Pedrycz 2006). With respect to the areas commented upon here, it can be stated that although reverse clustering may clearly be used as a tool in this kind of problems, and that in a variety of manners, its formulation and procedure is fundamentally different. Turning now to the more specific questions, let us note that the choice of the attributes has been thoroughly studied, in particular, in the context of classification (Hastie et al. 2009), but also in the area of more broadly meant data analysis. Many different approaches have been proposed, which are applicable for our purposes. Some of them take into account the information on classes, to which elements of X are assigned, some not. In our case, both modes are possible, as we start with a

2.2 Some More Specific Related Work

23

partition, which may be interpreted as the classification. The choice of an appropriate family of techniques may be based on the aspects discussed at length later on in this chapter. Namely, if the partition PA is to be seen as the valid one, then taking into account the information on class assignments is more justified than in the other cases. In our experiments, reported in the consecutive chapters, we associate weights with the attributes and the values of the weights are optimized during the evolutionary procedure. This may effectively lead to completely ignoring some of the attributes, characterizing the data set X. Notice that the choice of attributes has also been discussed in the literature on the comprehensibility of data analysis, data mining, machine learning, etc. results, and Craven and Shavlik (1995) may be here a good source of information. Another important decision concerns the choice of the distances/similarities from among the plethora of those proposed in the literature (Cross and Sudkamp 2002). This choice has, of course, to take into account the scale, with which a given attribute is endowed, i.e., nominal, ordinal, interval or ratio. For the latter type of attributes it may be convenient to assume a parametrized family of distances, e.g., Minkowski distance, what makes simpler the representation for the purposes of an evolutionary optimization, and what is actually done in virtually all of the experiments, reported in the book. One can even go further, using a fuller survey of (binary, in this case) similarity/dissimilarity measures, like the one presented in Choi et al. (2010), where those measures are categorised into classes, and a similar reverse type analysis is performed. This will not be, however, considered in this book and will be left as a potential direction of further studies. The essence of the problem of the reverse clustering as meant in this book is the formulation and solution of the optimization problem, described in Chap. 1. Its important component is the performance criterion, denoted Q, which is identified here with a measure of the fit between two partitions. In particular, we shall, as a rule, interpret Q in such a way that it should measure how well the partition PB , produced using Z * , matches the originally given partition PA . We have already indicated that in our experiments we refer, in general, to the Rand index. Let us, however, mention that such measures belong to a broader, and very deeply discussed family of the cluster validity measures, which are meant to evaluate the quality of the partition produced by a clustering algorithm; see, e.g. Wagner and Wagner (2006); Desgraupes (2013); Vendramin et al. (2010); Rendón et al. (2011); Halkidi et al. (2001); or Arbelaitz et al. (2013). According to Brun et al. (2007) three broad classes of such measures may be distinguished, which not necessarily refer to a golden standard (or otherwise a reference) partition, in our case denoted as PA . The first class comprises internal measures, which are based on the postulated properties of the clusters produced (such as, for instance, the classical Calinski-Harabasz index, see Calinski and Harabasz 1974; for a more general treatment see Liu et al. 2010). This class is often treated as the one of the “proper” cluster validity measures, and quite a lot of attention is devoted to this class in the literature, based on various prerequisites (see, e.g., Zhao et al. 2009;

24

2 Reverse Clustering—The Essence and The Interpretations

Zhao and Fränti 2014; Xie and Beni 1991; Van Craenendonck and Blockeel 2015; or Meila 2005).4 The second class of the relative measures “is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data” (Brun et al. 2007). And finally, the third class comprises the external measures referring to the comparison of partitions produced by the clustering algorithms and a partition known a priori, usually assumed to be a valid one. As our primary goal is the reconstruction of a cluster operator, which could have produced a given partition PA for a given data set X, then we are first of all interested in the usage of the external validity measures. However, it should be stressed that in different scenarios, discussed later on in this chapter, also other types of measures could be of use. In particular, if our belief in the validity of a given partition PA is not full, then we can define the quality criterion as a combination of an external and internal one, for instance, favoring PB which provides a balance between the matching of PA and having a high quality (e.g., internal consistency) in terms of one or more internal measures. Another parameter of the clustering procedure whose choice attracted a lot of attention in the literature is the fixed number of clusters assumed, e.g., for the kmeans family of clustering algorithms (actually, Bock 1994, proposed the issue of the number of clusters as one of the essential ones to be resolved in the domain). The choice of the value for this parameter has evidently a far reaching influence on the obtained partition while it may seem rather arbitrary. Thus, a number of approaches has been proposed which usually base on the earlier mentioned validity measures. Namely, the optimal number of clusters is recognized as the number for which a given validity measure attains its extremum or satisfies some specific formula (see Milligan and Cooper 1985; Libert 1986; Sugar and James 2003; Wagner et al. 2005, or, for the more recent publications, Charrad et al. 2014, and Patil and Baidari 2019).5 In our general approach, the actual number of clusters present in the partition PA is a natural candidate for the value of this parameter. However, such a choice may be questioned when taking into account the various assumptions as to PA , or considering the reverse engineering of PA to obtain Z * as a first step towards partitioning other, possibly much larger, datasets using Z * .6 4 It

should be noted that the so-called bi-partial approach, developed by one of the present authors (see, e.g., Owsi´nski 2020) allows for obtaining a natural solution to the clustering problem in general, without the need of referring to any (additional) clustering quality measure. It also provides the answer to the problem of the cluster number, discussed further on, without solving it explicitly, simply as a part of the global clustering solution. 5 There is, of course, the other side to the cluster number issue, namely that of the basic question “what is a cluster?”. If we knew the answer to this question, we would not have at all to determine the cluster number, see, e.g. Davies and Bouldin (1979), Chiu (1994) or Hennig (2015). For the number of partitions of a set, see Rota (1964). 6 Here, another related problem arises, namely that of preservation of validity and stability of the structure, obtained for a smaller set X, including the number of clusters, when applied to a much bigger one. This issue has been perceived a long ago, as witnessed, e.g., by the work of Guadagnoli and Velicer (1988).

2.2 Some More Specific Related Work

25

An example of a software package which combines all the above mentioned main aspects and, thus, is very relevant for our idea of the reverse engineering type clustering, is the NbClust package (Charrad et al. 2014) available on the R platform. It is primarily oriented at supporting the choice of the number of clusters. However, this package actually implements a number of: • clustering algorithms, • cluster validity indexes (measures), and • distance measures (dissimilarity measures) and makes it possible to use them in various configurations, together with a varying number of clusters, where appropriate, to search for the “best” one for a given data set. The configuration is pointed out as the best when the majority of the validity measures involved confirm its superiority. Thus, the NbClust package may be seen as a valuable and extremely relevant tool to carry out the endeavor laid out in this paper. However, our proposal provides a broader framework for the emerging type of data analysis and makes it possible to envision some interesting directions for the further research. Moreover, it adds to the analysis an important aspect of comprehensibility of results obtained.

2.3 The Interpretations As this was already mentioned in the preceding section, an essential question, which arises in connection with the new perspective or problem formulation, here forwarded, is its interpretation, closely linked with the potential use of its results. We shall show now that there exist quite a variety of ways, in which this formulation can be treated and used. Figure 2.1 puts together in a visual manner the essential components of the The data set analysed X

The clustering algorithms and the data processing parameters: Ω ={Zi}

Z

The prior partition of X, i.e. PA

The criterion Q(PA,PB): similarity of the two partitions

The resulting partition of X: PB

STOP

Fig. 2.1 The scheme of the reverse clustering problem formulation

No

Yes

The search (optimisation) procedure: maximising the Q(PA,PB)

Z*

26

2 Reverse Clustering—The Essence and The Interpretations

paradigm and their interrelations. At this point of our considerations we would like to indicate two aspects, which are of importance for the content of the present section: 1.

2.

The initial data, X and PA are not anyhow connected in the diagram: this emphasizes the possible a priori lack of knowledge of any association between the two; The ultimate result, Z* , depends upon the definition of the search space, (with particular role of the definite clustering algorithms) and the characteristics of the optimization procedure.

Relation to classification Thus, in a way, the formulation forwarded reminds of identifying the best classifier, expressed in this case through Z * . Definitely, the setting described may be perceived as typical for the supervised learning in that some known a priori grouping PA constitutes the starting point (known labels of particular objects). By identifying Z * we seem to obtain the best possible, in the class defined by , tool for classifying instances belonging to X and, possibly, also other instances later on. Still, the differences with the standard scheme of obtaining and using classifiers are quite evident, as commented upon below. Thus, first, for quite obvious reasons, we are not interested here in devising a scheme for classifying the further sequentially incoming data points, even though Z * may certainly be interpreted as the essential part of some classification scheme. Not being, in principle, meant for this purpose it can, of course, be, ultimately, used with this aim. Under this kind of circumstances, classification could be carried out for the subsequent observations x i , i = n + 1,…, in such a manner that partition PB = Z * (X ∪ {x i }) would be computed, resulting in the placement of x i in one of the clusters of PB ’ (but not necessarily identical with those of PB = Z * (X)). Further, it must be emphasized that the partition PB , generated by Z * , would be, in general, different from PA . This difference does not apply just to the content of clusters (“classification errors”), but may also concern the essential features of the partitions: first, the number of clusters, second, the potential indication of outliers. Regarding the functioning of the paradigm in the classification mode, the above two distinctions lead to the conclusion that it is, in fact, not meant for it. This will be especially true when we consider the batch classification, i.e., when {x i } is replaced with a whole new set of data points to be classified simultaneously. Yet, such a classification procedure would be usually prohibitively expensive. The excessive cost may be avoided only for some special situations, through the use of the methods, called by Miyamoto et al. (2008), Chap. 6, the inductive clustering techniques. Namely, if Z * obtained for PA refers to an inductive clustering technique (e.g., belonging to the k-means algorithms family), then new data points can be directly classified to one of the clusters of PA (using the 1-nn classification technique with respect to the centroids of the clusters of PA , in this case) with the same effect, which would result from going through the complete classification scheme. The use of some other incremental clustering techniques available for a clustering algorithm being a part

2.3 The Interpretations

27

of Z * may be also possible, even if only approximately equivalent to the carrying out the whole clustering procedure from scratch. It should be also stressed that the approach we propose, due to the adopted problem setting, will not, in general, take into account the criterion of generalization which is so important for the design of any effective classifier. Furthermore, in the classification task, there is a fixed number of classes, whereas the number of clusters for a given parameter set Z* may vary. For instance, if a new batch of data contains an outlier dissimilar from the whole training dataset the progressive-merger or density-based algorithms would generate an additional, isolated cluster. On the other hand, classifiers would try to assign it to one of the predefined categories. Hence, a standard classification procedure might be one of the interpretations, but, definitely, only a marginal one. Another interpretation of our approach may be considered, which is also related to the task of classification albeit of a rather specific type. First of all let us notice that the results of the application of our proposed approach may be understood either more narrowly, i.e. as a given Z * , obtained for the assumed space , or more broadly, i.e. as the entire procedure, leading from some PA through choices, related to and Z, down to PB = Z * (X). In both cases, such results may serve for a non-standard kind of classification: when we intend to partition different, usually much bigger, sets of data than the original set X. Under such circumstances, we (a) (b)

do not expect an absolute or ideal accuracy of results over large data sets, but we wish to preserve the essential character of the original partition PA , and would like to check (and preserve) the general validity of the Z * (up to a certain fine tuning) for various, though definitely in some way similar PA .

The prerequisites for the scope of interpretations In order to consider the potential entire scope of interpretations of the problem and the approach, let us take a look at an important aspect, related to the status of the prior partition PA . Two issues are essential here: 1. 2.

where does this partition come from (what is its relation to the data set X and what do we know about it)? and what is the degree of certainty of this prior partition (degree of our belief in its validity)?

Although these two aspects are often tightly associated in practice, they are, in general, quite independent. So, there is a whole range of situations, arising in connection with the (definitely qualitative) answers to these questions (schematically illustrated in Fig. 2.2). The situations arising can be “ordered” from some sort of extreme, “absolute” case, which takes place when (see the limiting lines in Fig. 2.2) (a)

the partitioning PA has been imposed on the data set X fully irrespectively of the values of the particular attributes except possibly for an identifier attribute which makes it possible to distinguish particular elements of X, i.e., PA has (at

28

2 Reverse Clustering—The Essence and The Interpretations Degree of independence of PA from X

b (max)

a (max) In this direction PA is increasingly based on information from X

In this direction PA becomes increasingly „just a hypothesis” to be verified or replaced

PA can be treated as some sort of random partition, generated on the basis of attributes of X

Standard classification task

Degree of credibility of PA

PA is well founded on the content of X

Fig. 2.2 The scheme of potential cases of interpreting the paradigm of reverse clustering

(b)

least apparently) nothing to do with the data set X (i.e. either it is simply given, and we do not know anything about the relation between PA and the attributes k = 1,…,m, or we know that the division was performed on the basis of the attribute(s) not accounted for in X), and the partition PA is fully certain and thus we are fully convinced of its validity—it is certainly a correct partition of X in a given context.

It is quite obvious, why these two form the “absolute” extreme: the certainty as to the validity of the partition PA comes from a source that is outside of the data set X and its specifications. In a way, then, finding of Z * (and PB ) would correspond to finding a kind of model, in the space E x , to which x i belong, or of the “criterion” or “rule” that produced PA . Under such circumstances it may happen that the partition PA has nothing to do with the set X (i.e., its characterization with the values of available attributes), and so there would be no hope of obtaining any good match between PA and PB . If, however, we are certain of PA , and it is based on the characteristics of X, then we deal with a true reconstruction of some Z A that produced PA (assuming this partition arose from a procedure that can be cast in the clustering framework).7 The latter case is, indeed, very much the one of classical, standard classification tasks. The extreme case of (a) and (b) (see Fig. 2.2), i.e. when we are certain that PA is valid and “true”, but we do not know (and will not know) anything about its connection with X (or even know that it is not related to X) is softened towards the partitions PA , which, for instance, may be produced by experts, who, 7 We

put aside here the considerations, involving the direct study of relations between the labels of x i , associated with PA , and the characteristics of x i , and we assume that the task at hand replaces such a study.

2.3 The Interpretations

(c) (d)

29

take into account, even if implicitly, the actual attributes characterizing data from X, and the opinions, appearing in the form of PA , can be put to doubt, or at least discussed; thereby, we come to the situation, in which PB may give rise to a feedback, enriching our knowledge that had led to PA .8

The scenarios that can be formulated for the thus outlined range of situations, lead to problem statements of quite varying interpretations and (potential) utility. Thus, say, if we have little faith in PA , why bother referring to it? The reason may lie in the hypothetical character of PA , which is then subject to verification, and/or modifications (hence, we would be looking for the potential support for such a hypothesis or its negation, provided other elements of the rationale of the entire reverse clustering paradigm hold). On the other hand, in the scenario arising when (a) and (d) are true, i.e., the given partition PA is somehow transcendent with respect to the set of attributes characterizing X, and, at the same time, we are not fully convinced as to its validity, we would be interested if it is possible to recover partition PA using some Z * , but we should expect that the best PB obtained may be quite different from PA , and, moreover, we can think of PB as a legitimate replacement for PA , which may be, therefore, treated just as the starting point for getting to a “real” partition of X. In yet another scenario, arising when we know or assume that (b) and (c) are true, i.e., when we treat the partition PA as a valid one and at the same time we know that it has been established with the reference to the actual values of the attributes characterizing the data set X, we will be more concerned with recovering exactly PA and the benefits from carrying out the reverse engineering procedure would be primarily related to getting a better insight into the meaning of the particular attributes and their role in justifying the real/valid partition PA . This case, then, as noted also in Fig. 2.2, is quite akin to the standard case of classification and the search for the best classifier. Yet, what we obtain, in addition, in a way, to the original partition PA , which we may treat as “correct”, is the mechanism for the relatively easy partitioning of other data sets of similar character (sets of objects located in an analogous space and originating from a similar kind of process or phenomenon). On the top of this, the mechanism obtained can be parameterised in a simple manner, quite in line with the logic of the vector Z. At the end of this section a couple of remarks are perhaps due on the very orientation at clustering within the paradigm proposed. As already mentioned in Chap. 1, once we deal with some data set X and some partition of it, PA , and we wish to reconstruct in some formal, procedural manner, the way this partition has been arrived at, the procedural manner invoked being reflected in the parameters, forming the vector Z, the choice of clustering, together with the relevant procedure, seems to be natural, for clustering is exactly meant to produce partitions of data sets. However, this partition, in the case of clustering, is directed by the primeval task of clustering, i.e. “to 8 Actually,

even in the “absolute” case, doubts may arise, if the situation resembles the one of multiple (and multiply) overlapping distributions.

30

2 Reverse Clustering—The Essence and The Interpretations

Fig. 2.3 An illustration of division of a set of objects according to the rule of “putting together the dissimilar and separating the similar”: colours indicate the belongingness to three groups: blue, red and green

put together the similar (the affine) and to separate the dissimilar (the distant)”, this formulation having definite formal and pragmatic (technical) consequences, see, e.g., Fortier and Solomon (1966), Kemeny and Snell (1960), Marcotorchino and Michaud (1979, 1982), Mulvey and Beck (1984), Rubin (1967), Wallace and Boulton (1968), or Owsi´nski (2020) for various perspectives on what this formulation leads to. Thus, it should be emphasized very strongly that although PA may be known to be directly related to X, and even result from it through a certain procedure, this is by no means to say that reverse clustering, as outlined here, would in principle be the way to reconstruct its provenience. Although broadly understood clustering appears to be a highly rational prerequisite to partitioning of a set of objects (observations), it is not unique. The partition PA may have, for instance, and to the contrary, with respect to the rationale of the reverse clustering, resulted from an opposite approach, namely “to put together the dissimilar and to separate the similar”, as this is exemplified in Fig. 2.3, which is not just a spiteful example, but the procedure, which is for definite reasons sometimes applied. Another important and relevant example is provided in Chap. 4 of the book, where PA is based on a simple categorization that is not contained in the very data, although it is supposed to be associated with it. This categorization is very simple and is based on just a few potential nominal (or, at best, ordinal) categories. It might have happened, though, that such categorization leads to a structure that can hardly be reconstructed through clustering. Yet, the use of reverse clustering may also be helpful in such situations as the ones depicted above, since it is obvious that this procedure would be applied only in case the analyst had not been aware of the source of PA , the results from reverse clustering showing, for such circumstances, the wide discrepancy between PA and the attainable PB . Finally, let us add that each of the algorithms of cluster analysis is based on a somewhat different understanding of the basic task of clustering (including, in particular, application of somewhat different criteria), and hence the search for PB

2.3 The Interpretations

31

consists in a way in looking for the basic principle that is as close as possible to the one that stands behind PA , irrespective of the fact whether it was applied directly or indirectly (the simplest example being that of k-means tending towards spherical or ellipsoidal clusters, while some of hierarchical aggregation algorithms may tend towards formation of complex chaining clusters). This interplay, depicted in Fig. 2.1, between PA , X and the principles and limitations of the individual clustering procedures, yields ultimately the partition PB , which ought, therefore, be interpreted in this quite complex perspective.

References Adadi, A., Berrada, M.: Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6, 52138–52160 (2018). https://doi.org/10.1109/ACCESS.2018.287 0052 Arabie, P., Hubert, L.J., De Soete, G.: Clustering and Classification. World Scientific (1996) Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013) Bagherjeiran, A., Eick, C.F., Chen, C.S., Vilalta, R.: Adaptive clustering: obtaining better clusters using feedback and past experience. In: Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX (2005). https://doi.org/10.1109/icdm.2005.17 Baker, V.E., O’Conor, D.E.: Expert system for configuration at digital: XCON and beyond. Commun. ACM 32(3) (1989) Bargiela, A., Pedrycz, W.: Granular Computing: An Introduction. Kluwer Academic Publishers, Boston (2002) Berka, P., Laš, V., Svátek, V.: NEST: Re-engineering the compositional approach to rulebased inference. Neural Netw. World 5(04), 367–379 (2004) Biecek, P., Burzykowski, T.: Explanatory Model Analysis. Explore, Explain and Examine Predictive Models. With examples in R and Python. Chapman & Hall/CRC, New York (2021) Bock, H.-H.: Classification and clustering: problems for the future. In: Diday, E., et al. (eds.) New Approaches in Classification and Data Analysis, pp. 3–24. Springer, Berlin (1994) Böhm, C., Faloutsos, C., Pan, J.Y., Plant, C.: Robust Information-Theoretic Clustering. In: KDD’06, Philadelphia, Pennsylvania, USA. ACM Press, 20–23 Aug 2006 Bouchachia, A.,Pedrycz, W.: Data Clustering with Partial Supervision. Data Mining and Knowledge Discovery, 12, 47–78 (2006) Bramer, M.: Principles of Data Mining. Springer, New York (2007) Brun, M., Sima, Ch., Hua, J.-P., Lowey, J., Carroll, B., Suh, E., Dougherty, E.R.: Model-based evaluation of clustering validation measures. Pattern Recogn. 40(3), 807–824 (2007) Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974) Câmpan, A., Serban, ¸ G.: Adaptive clustering algorithms. In: Lamontagne, L., Marchand, M. (eds.): Canadian AI 2006. LNAI vol. 4013, pp 407–418. Springer Verlag, Berlin-Heidelberg (2006) Casalino, G., Castellano, G., Mencar, C.: Data stream classification by dynamic incremental semisupervised fuzzy clustering. Int. J. Artif. Intell. Tools (2019) Castillo, E., Alvarez, E.: Introduction to Expert Systems: Uncertainty and Learning. Elsevier Science Publishers, Essex (1991) Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: An R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–36 (2014) Chikofsky, E.J., Cross, J.H.: Reverse engineering and design recovery: a taxonomy. IEEE Softw. 7(1), 13–17 (1990). https://doi.org/10.1109/52.43044

32

2 Reverse Clustering—The Essence and The Interpretations

Chiu, S.L.: Fuzzy model identification based on cluster estimation. J. Intell. Fuzzy Syst. 2, 267–278 (1994) Choi, S.S., Cha, S.H., Tappert, Ch.C.: A survey of binary similarity and distance measures. Syst. Cybern. Inform. 8(1), 43–48 (2010) Craven, M.W., Shavlik, J.W.: Extracting comprehensible concept representations from trained neural networks. In: Working Notes of the IJCAI’95 Workshop on Comprehensibility in Machine Learning, Montreal, Canada, 61–75 (1995) Cross, V.V., Sudkamp, Th.A.: Similarity and Compatibility in Fuzzy Set Theory: Assessment and Applications. Physica-Verlag, Heidelberg (2002) Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979) Desgraupes, B.: Clustering Indices. CRAN-R-Project (2013). https://cran.r-project.org/web/pac kages/clusterCrit/…/clusterCrit.pdf Doran, D., Schulz, S., Besold, T.R.: What does explainable AI really mean? A new conceptualization of perspectives (2017). arXiv:1710.00794 Ducimetière, P.: Les méthodes de la classification numérique. Rev. Stat. Appl. 18(4), 5–25 (1970) Eilam, E.: Reversing: Secrets of Reverse Engineering. Wiley (2005) Fisch, D., Gruber, T., Sick, B.: SwiftRule: mining comprehensible classification rules for time series analysis. IEEE Trans. Knowl. Data Eng. 23(5), 774–787 (2011) Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139–172 (1987) Fortier, J.J., Solomon, H.: Clustering procedures. In: Krishnaiah, P. (ed.) Multivariate Analysis I, pp. 493–506. Academic Press, London (1966) Gan, G., Ma, Ch., Wu, J.: Data Clustering: Theory, Algorithms and Applications. SIAM & ASA, Philadelphia (2007) Gopalan, R., Li, R., Patel, V.M., Chellappa, R.: Domain adaptation for visual recognition. Found. Trends Comput. Graph. Vision 8(4) (2012). http://dx.doi.org/10.1561/0600000057 Govin, B., du Sorbier, A.M., Anquetil, N., Ducasse, S.: Clustering technique for conceptual clusters. In: Proceedings of the IWST’16 International Workshop on Smalltalk Technologies, Prague, Czech Republic, August (2016). https://doi.org/10.1145/2991041.2991052 Guadagnoli, E., Velicer, W.: Relation of sample size to the stability of component patterns. Psychol. Bull. 103, 265–275 (1988) Gunning, D., Aha, D.: DARPA’s Explainable artificial intelligence (XAI) Program. AI Mag. 40(2), 44–58 (2019) Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inform. Syst. 17(2–3), 107–145 (2001) Hall, P., Gill, N.: An Introduction to Machine Learning Interpretability: An Applied Perspective on Fairness, Accountability, Transparency, and Explainable AI. O’Reilly Media, Inc (2018) Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009) Hennig, C.: What are the true clusters? Pattern Recogn. Lett. 64, 53–62 (2015) Kacprzyk, J., Yager, R.R.: Linguistic summaries of data using fuzzy logic. Int. J. Gen Syst 30(2), 133–154 (2001) Kacprzyk, J., Zadro˙zny, S.: Linguistic database summaries and their protoforms: towards natural language based knowledge discovery tools. Inf. Sci. 173(4), 281–304 (2005) Kacprzyk, J., Zadro˙zny, S.: Protoforms of linguistic database summaries as a human consistent tool for using natural language in data mining. Int. J. Softw. Sci. Comput. Intell. 1(1), 100–111 (2009) Kacprzyk, J., Zadro˙zny, S.: Computing with words is an implementable paradigm: fuzzy queries, linguistic data summaries and natural language generation. IEEE Trans. Fuzzy Syst. 18(3), 461– 472 (2010) Kacprzyk, J., Zadro˙zny, S.: Comprehensiveness of linguistic data summaries: a crucial role of protoforms. In: Moewes, Ch., Nürnberger, A. (eds.) Computational Intelligence in Intelligent Data Analysis, 207–221. Springer-Verlag, Berlin, Heidelberg (2013)

References

33

Kacprzyk, J., Zadro˙zny, S.: Fuzzy logic-based linguistic summaries of time series: a powerful tool for discovering knowledge on time varying processes and systems under imprecision. Wiley Interdisc. Rev. Data Min. Knowl. Discovery 6(1), 37–46 (2016) Kacprzyk, J., Yager, R.R., Zadro˙zny, S.: A fuzzy logic based approach to linguistic summaries of databases. Int. J. Appl. Math. Comput. Sci. 10(4), 813–834 (2000) Kacprzyk, J., Wilbik, A., Zadro˙zny, S.: Linguistic summarization of time series using a fuzzy quantifier driven aggregation. Fuzzy Sets Syst. 159(12), 1485–1499 (2008) Kacprzyk, J., Wilbik, A., Zadro˙zny, S.: An approach to the linguistic summarization of time series using a fuzzy quantifier driven aggregation. Int. J. Intell. Syst. 25(5), 411–439 (2010) Kamishima, T., Motoyoshi, F.: Learning from Cluster Examples. Mach. Learn. 53, 199–233 (2003) Kamishima, T., Minoh, M., Ikeda, K.: Rule formulation based on inductive learning for extraction and classification of diagram symbols. Trans. Inform. Process. Soc. Japan 36(3), 614–626 (1995). (in Japanese) Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990) Kemeny J., Snell L.: Mathematical Models in the Social Sciences. Ginn, Boston (1960) Kouw, W.M., Loog, M.: Technical Report. An introduction to domain adaptation and transfer learning (2019). arXiv:1812.11806v2 Accessed 14 Jan 2019 Kuhn, A., Ducasse, S., Girba, T.: Enriching reverse engineering with semantic clustering. In: Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05), Pittsburgh, PA, pp 1–14. IEEE Xplore (2005). https://doi.org/10.1109/wcre.2005.16 Laube, P.: Machine Learning Methods for Reverse Engineering of Defective Structured Surfaces. Springer (2020) Lesot, M.-J., Moyse, G., Bouchon-Meunier, B.: Interpretability of fuzzy linguistic summaries. Fuzzy Sets Syst. 292, 307–317 (2016) Libert, G.: Compactness and number of clusters. Control Cybern. 15 (2), 205–212 (1986) (special issue on Optimization approaches in clustering, edited by J. W. Owsi´nski) Lin, T.Y., Yao, Y.Y., Zadeh, L.A.: Data Mining, Rough Sets and Granular Computing. Springer, (Physica) (2002) Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: 2010 IEEE International Conference on Data Mining, 911–916. IEEE (2010) https://doi.org/ 10.1109/icdm2010.35 Marcotorchino, F., Michaud, P.: Optimisation en Analyse Ordinale des Données. Masson, Paris (1979) Marcotorchino, F., Michaud, P.: Aggrégation de similarités en classification automatique. Revue de Stat. Appl. 30, 2 (1982) Meila, M.: Comparing clusterings—an axiomatic view. In: Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany (2005) Michalski, R.: A theory and methodology of inductive learning. Artif. Intell. 20(2), 111–161 (1983) Miller, T.: Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38 (2017) Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985) Mirkin, B.: Mathematical Classification and Clustering. Springer, Berlin (1996) Miyamoto, S., Ichihashi, H., Honda, K.: Algorithms for Fuzzy Clustering: Methods in c-Means Clustering with Applications. Studies in Fuzziness and Soft Computing, vol. 229. Springer, Berlin (2008) Molnar, C.: Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Lulu Publisher (2020), eBook (GitHub, 2020–04-27). ISBN-13: 978-0244768522 Mulvey, J.M., Beck, M.P.: Solving capacitated clustering problems. Eur. J. Oper. Res. 18, 339–348 (1984) Murdoch, W.J., Singh, C., Kumbier, K., Abbasi-Asl, R., Yu, B.: Interpretable machine learning: definitions, methods, and applications. Proc. Nat. Acad. Sci. USA 116(44), 2207–22080 (2019)

34

2 Reverse Clustering—The Essence and The Interpretations

Ntoutsi, I., Spiliopoulou, M., Theodoridis, Y.: Tracking cluster transitions for different cluster types. Control Cybern. 38(1), 239–260 (2009) Owsi´nski, J.W.: Data Analysis in Bi-Partial Perspective: Clustering and Beyond. Springer Verlag (2020) Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) Patil, C., Baidari, I.: Estimating the optimal number of clusters k in a dataset using data depth. Data Sci. Eng. 4, 132–140 (2019) Pedrycz, W., Waletzky, J.: Fuzzy clustering with partial supervision. IEEE Trans. Syst. Man Cybern. B Cybern. 27(5), (1997) Pedrycz, W.: Granular Computing and Intelligent Systems Design with Information Granules of Higher Order and Higher Type. Springer (2013b) Pedrycz, W.: Granular Computing: Analysis and Design of Intelligent Systems. Taylor and Francis (2013a) Pedrycz, W., Skowron, A., Kreinovich, V.Y. (eds.): Handbook of Granular Computing. Wiley (2008) Perera, B., Zaslavsky, A., Christen, P., Georgakopoulos, D.: Context aware computing for the internet of things: a survey. IEEE Commun. Surveys Tutorials 16(1), 414–454 (2014). https://doi.org/10. 1109/surv.2013.042313.00197 Pryke, A., Beale, R.: Interactive Comprehensible Data Mining. In: Cai, Y. (ed.) Ambient Intelligence for Scientific Discovery. LNCS 3345, pp. 48–65. Springer (2004) Quigley, J., Postema, M., Schmidt, H.: ReVis: reverse engineering by clustering and visual object classification. In: Proceedings 2000 Australian Software Engineering Conference, pp. 119–125, Canberra, ACT, Australia (2000). https://doi.org/10.1109/aswec.2000.844569 Raffo, A.: CAD reverse engineering based on clustering and approximate implicitization. erga.di. uoa.gr/meetings/RAFFOpresentation.pdf (2019) Raffo, A., Barrowclough, O. J. D., Muntingh, G.: Reverse engineering of CAD models via clustering and approximate implicitization. Computer Aided Geometric Design, 80, June 2020, 101876 (2020) Raja, V., Fernandes, K.J.: Reverse Engineering: An Industrial Perspective. Springer (2008). ISBN 978-1-84628-856-2 Rashidi, P., Cook, D.J., Holder, L.B., Schmitter-Edgecombe, M.: Discovering activities to recognize and track in a smart environment. IEEE Trans. Knowl. Data Eng. 23, 527–539 (2011) Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press (2000) Reiter, E., Hunter, J., Mellish, C.: Generating English summaries of time series data using the Gricean maxims. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and data Mining, 187–196. ACM (2006) Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5 (1), 27–34 (2011) Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIG KDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) Rokach, L., Naamani, L., Shmilovici, A.: Active learning using pessimistic expectation estimators. Control Cybern. 38(1), 261–280 (2009) Rota, G.C.: The number of partitions of a set. Am. Math. Mon. 71(5), 498–504 (1964) Rubin, J.: Optimal classification into groups: an approach for solving the taxonomy problem. J. Theoret. Biol. 15 (1), 103–144 (1967) Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–2015 (2019) Shi, B., Han, L.X., Yan, H.: Adaptive clustering algorithm based on kNN and density. Pattern Recogn. Lett. 104, 37–44 (2018). https://doi.org/10.1016/j.patrec.2018.01.020 Shim, K.S., Goo, Y.H., Lee, M.S., Kim, M.S.: Clustering method in protocol reverse engineering for industrial protocols. Int. J. Network Manage 30(6), 1–15 (2020)

References

35

Sripada, S.G., Reiter, E., Hunter, J., Yu, J.: Segmenting time series for weather forecasting. In: MacIntosh, A., Ellis, R., Coenen, F. (eds.) Applications and Innovations in Intelligent Systems X, pp. 193–206. Springer (2003) Sugar, C.A., James, G.M.: Finding the number of clusters in a data set: an information-theoretic approach. J. Am. Stat. Assoc. 98(January), 750–763 (2003). https://doi.org/10.1198/016214503 000000666 Tervonen J., Mikhaylov K., Pieskä S., Jamsä J. and Heikkilä M. (2014) Cognitive Internet-ofThings solutions enabled by wireless sensor and actuator networks. In: 5th IEEE International Conference on Cognitive Infocommunications (CogInfoCom 2014), 97–102, IEEE Torra, V., Endo, Y., Miyamoto, S.: Computationally intensive parameter selection for clustering algorithms: The case of fuzzy c-means with tolerance. Int. J. Intell. Syst. 26(4), 313–322 (2011) Travkin, O., von Detten, M., Becker, S.: Towards the combination of clustering-based and patternbased reverse engineering approaches. In: Reussner, R.H., Pretschner, A., Jähnichen, S. (eds.) Software Engineering 2011 Workshopband (inkl. Doktorandensymposium), Fachtagung des GIFachbereichs Softwaretechnik, vol. LNI 184, 21–25 Feb 2011, Karlsruhe, Germany, 23–28. Springer (2011) Tsai, C., Lai, C., Chiang, M., Yang, L.T.: Data mining for internet of things: A survey. IEEE Commun. Surveys Tutorials 16, 77–97 (2014) Van Craenendonck, T., Blockeel, H.: Using internal validity measures to compare clustering algorithms. In: Poster from Benelearn Conference (2015) https://lirias.kuleuven.be/handle/123456 789/504705 Vendramin, L., Campello, R.J.G.B., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Wiley InterScience (2010). https://doi.org/10.1002/sam.10080 Wagner, S., Wagner, D.: Comparing clusterings—an overview. Technical Report 2006–04, Faculty of Informatics, University of Karlsruhe, TH (2006) Wagner, R., Scholz, S.W., Decker, R.: The number of clusters in market segmentation. In: Baier, D., Decker, R., Schmidt-Thieme, L. (eds.) Data Analysis and Decision Support, pp. 157–176. Springer, Springer (2005) Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(2), 185–194 (1968) Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13(8), 841–847 (1991) Xu, R., Wunsch, D.C.I.I.: Clustering. Wiley/ IEEE Press, Hoboken (2009) Yu, J., Reiter, E., Hunter, J., Mellish, C.: Choosing the content of textual summaries of large timeseries data sets. Nat. Lang. Eng. 13, 25–49 (2007), Cambridge University Press. https://doi.org/ 10.1017/s135132490500403 Zhang, Y., Chen, X.: Explainable recommendation: a survey and new perspectives (2018). eprint arXiv: 1804 11192 Zhao, Q., Fränti, P.: WB-index: a sum-of-squares based index for cluster validity. Data Knowl. Eng. 92, 77–89 (2014) Zhao, Q., Xu, M., Fränti, P.: Sum-of-squares based cluster validity index and significance analysis. In: Kolehmainen, M., et al. (eds.) ICANNGA 2009. LNCS vol. 5495, 313–322. Springer Verlag (2009) Zhou, Z.H.: Comprehensibility of data mining algorithms. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, 190–195. IGI Global, Hershey (2005)

Chapter 3

Case Studies: An Introduction

3.1 A Short Characterisation of the Cases Studied This short chapter is devoted to an introduction to and an overview of the cases studied with the reverse clustering approach and illustrated in the present book—their short characterisation and the indication of the respective interpretations of the problems formulated and hence also of the results obtained, in the vein of the framework, introduced in the preceding chapters. It should be emphasised that almost all of the cases here presented are based on concrete, real-life data. On the other hand, even though the problems in themselves are intuitively appealing and comprehensible, the individual interpretations of the problems and their solutions are not always absolutely obvious, for two kinds of reasons: – first, related to the potential formulation of the purpose or objective of the reverse clustering exercise, but also of the original problem, corresponding to the data set X and, especially, the partition PA , and. – second—the actual status of both PA and X (validity, certainty) and the knowledge of this status. Yet, it is definitely possible to draw constructive conclusions from the results obtained—as we shall see, conditional, of course, on the circumstances, mentioned above, relative to the “precision of interpretation”. It should also be emphasised that these constructive conclusions are related both to the paradigm here proposed, the reverse clustering, and to the substantive essence of the problems considered. 1. The motorway traffic case In this case, described in Chap. 4 of the book, the set of actual data on car traffic (numbers of cars passing per hour) at a point on a state road has been analysed. The data have been collected for a variety of practical purposes including long-term maintenance planning and safety evaluation by an enterprise, responsible for road servicing. One of the purposes has been to appropriately model the traffic movement for the potential case, in which a sensor, registering the traffic at a given point is © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_3

37

38

3 Case Studies: An Introduction

not functional. Knowing the “hourly traffic profile model” would help in mitigating the loss of information, related to a lack of current data from this particular sensor. (Of course, having data from other, neighbouring sensors might have constituted a basis for an appropriate adjustment, but, first, the general shapes of hourly profiles would anyway have to be identified to make a significant analysis possible.) Thus, X was constituted by the data on hourly traffic collected during a definite period of time, here equivalent to a year—numbers of cars passing per hour for 24 h of the day, meaning 24 variable values plus the date. The initial partition, PA , was based on the classification according to the days of the week, the founding assumption being that it is the day of the week that is the most important aspect for the shape of the hourly traffic profile. Traffic analyses, modelling and design constitute a very popular and important subject of research, and the Readers, who are more deeply interested in this subject, are advised to consult, out of the truly vast literature, such exemplary positions as Salter (1989), May (1990), Gazis (2002), Kerner (2004), Treiber and Kesting (2013) or Kessels (2019). 2. Chemicals in the natural environment The second of the cases here presented (in Chap. 5) was based on data originating from Germany. Namely, the data set X was constituted by the averages of measurements of definite chemical element contents in the herb layer of the counties (Kreise) of a province (Land) in Germany. The data concerned four different elements (i.e. four variables, characterising the objects—counties—in X) and, actually, there was no explicit initial partition, PA , given. Upon the inspection of the data it turned out that the distributions among the counties of the four elements differ significantly as to their character, separating very clearly two pairs of elements—one pair, for which the distributions among the counties display characteristic step-like character, and the other pair, where one cannot perceive any such quasi-regularity. Hence, the initial partition PA of the set of counties into subsets was based on the values of the two step-like distributions. Then, attempts were made to reconstruct the way this partition may be obtained on the basis of X, either taken in its entirety or without the data on elements, which served to establish the partition PA . These attempts, differed, therefore as to their significance and interpretation, and the observed differences were highly telling for the potential ultimate use of the results. This use of the results, here reported, is insofar—potentially—important as contamination of the natural environment—here: the herb layer—is one of the most significant environmental phenomena of our age, bringing enormous consequences in terms of farming, human and animal health, ecosystem dynamics and resistance, extending to leisure and recreation, and, more generally, ecosystem survival. At the same time, the treatment of the related problems is much more complicated than in, say, the case of atmospheric pollution. That is why any tangible, and verifiable results, like those obtained here, may be of very high value.

3.1 A Short Characterisation of the Cases Studied

39

3. Administrative units in Poland The clustering-type analyses of administrative units are, indeed, abounding. There are literally thousands of studies, having been conducted for decades, involving some sort of grouping of administrative units for various purposes and reasons, and in quite a variety of substantive and theoretical perspectives. These studies may concern very general issues of planning and socio-economic development (see, e.g., Senetra and Szarek-Iwaniuk 2020; or Brauksa 2013, among a multitude of instances), or very particular questions, like (to point out just few selected examples): desertification (Zolfaghari et al. 2019), soil classification, very much in line with the previous case study (Minasny and McBratney, 2007), health care incidence studies (RodriguezVillamizar et al. 2020; or Umer et al. 2018), or, say, the connection between crime incidence and economic development (Lombardo and Falcone 2011). See also Boschma and Kloosterman (2005) for general treatment of the use of clustering in spatial perspective. Hence, while the application of clustering to the sets of administrative or other spatial units is a kind of routine undertaking, the analysis we propose here has, indeed, a very specific character. So, regarding the investigations, related to reverse clustering, quite a number of exercises have been carried out in the framework of analysis of the administrative units in Poland. The primary objects are here the communes or municipalities (some 2 500 in Poland). These objects are characterised in this particular study by a set of socio-economic and spatial features (around 20, the actual number depending on the exercise), taken from the official public statistics, and amounting to the corresponding data sets X. The respective case analyses are presented in two separate chapters of this book, because of two essential differences between these two groups of analyses: first, they differed as to the spatial scale, the first group being mainly oriented at the provincial scale, while the second—mainly at the national scale, and second, they definitely differed as to the nature of the initial partition PA . Namely, in the first case (Chap. 6) attention was mainly focused on the “stiff” administrative division of the municipalities into three formal categories, having rather loose connection with the socio-economic characteristics of the municipalities, i.e. the respective set X, since they arose rather in an historical-political process than through an analytically based decision making procedure. (This, of course, does not mean that they are completely unrelated to the respective X.) In the second case considered (Chap. 7) we analysed the initial partition PA of the whole set of Polish municipalities, elaborated for purposes of planning procedures, and hence based on the explicit features of the municipalities. Although the process, leading to this initial partition was described by its authors, and so its connection to actual data on municipalities is known, this process had the character that cannot be directly translated into clustering terms (see the corresponding remarks at the end of Chap. 2, accompanied by an extreme example, shown in Fig. 2.1). So, in this case, it was expected that the results obtained might point out the possible biases of the initial partition and the potential improvements or alternatives. (A similar, smaller study on a provincial level closes the preceding Chap. 6.)

40

3 Case Studies: An Introduction

4. The academic examples Finally, in order to verify some specific hypotheses, concerning the functioning of the reverse clustering approach, a series of academic examples was treated, first, related to the well-known Iris data of Anderson (1935) and Fisher (1936), and the second— based on a set of artificial examples of data, all of them two-dimensional, consisting of several dozen points in various configurations on the plane. In the majority of configurations of the artificial data it was obvious at the first sight what was the “correct” partition of the respective data set. In the case of Iris data the respective PA was constituted by the known division into the flower varieties. These experiments are shortly reported in Chap. 8 of the book. Hence, in this manner a particular PA could be established for the cases analysed. However, regarding the artificial data in some cases the evidently “nested clusters” (i.e. clusters contained in other clusters) were subject to analysis, forming several distinct “levels of resolution”, and in these cases it was supposed that the reverse clustering would tend to uncover one of these levels, depending upon the setting of parameters in Z. The experimental calculations, carried out with the reverse clustering, revealed, for the artificial data, that in the doubtless cases the proper partition PA could be fully reconstructed, and in the cases of “nested clusters” the solutions obtained usually “focused” at one level (also usually the one that was most visible by visual inspection), while the other levels were either not identified at all, or only for the narrow intervals of quite extreme values of the controlled parameters. The results of these experiments confirmed, for both kinds of data, on the one hand, the intuitive “correctness” of the results, produced by the reverse clustering, but also, on the other hand, indicated some avenues for future research, especially regarding more complex cases, like those with “nested” cluster structures, which are not, at all, very rare in reality (similarly as the overlapping clusters).

3.2 The Interpretations of the Cases Treated As already mentioned in the short introduction of the preceding section, the cases, commented upon in this book, represented quite a variety of interpretations within the reverse clustering paradigm. These various interpretations are very roughly illustrated in Fig. 3.1, based on Fig. 2.1 from the preceding chapter. The locations, corresponding to the particular kinds of exercises, shown in Fig. 3.1, are, definitely, largely subjective. In principle, one would have to propose a measure of credibility (“likelihood”) of the partition PA , and another measure of its association with the data set X. However, in some sense paradoxically, we are exactly in the situation, in which our knowledge of the two is either limited or none. Were this knowledge closer to the possibility of formulating and estimating the values of such measures, we would definitely not be motivated to recur to the reverse clustering paradigm.

3.2 The Interpretations of the Cases Treated

41

Degree of independence of PA from X

b (max)

a (max)

Administrative units I Chemicals in the soil

Administrative units II

Standard classificaton task

Motorway traffic Academic data

Degree of credibility of PA

Fig. 3.1 Rough indication of interpretations of the cases treated against the framework of Fig. 2.1

A very telling example is provided by the second case of administrative data—even if the procedure, leading from the properties of the particular units to the proposed partition PA is, in principle, known, it can hardly be assessed, in view of the nature of this procedure, to what extent it can be considered associated with the unified data set on the units, X, to say nothing of the relation of this procedure to any potential clustering. All in all, though, the cases here presented definitely span quite some region of the space of potential interpretations, providing sufficient material for the evaluation of the utility of the reverse clustering approach for other analytic situations, which can be cast in the form of the reverse clustering paradigm as considered in this book. Side by side with the issue of interpretation, in almost each of the studies reported some technical or methodological issues arise, which are then discussed in the final sections of the respective chapters, along with the—definitely closely related—issue of interpretation.

References Anderson, E.: The irises of the Gaspé Peninsula. Bull. Am. Ir. Soc. 59, 2–5 (1935) Boschma, R.A., Kloosterman, R.C. (eds.): Learning from Clusters: A Critical Assessment from an Economic-Geographical Perspective. Springer (2005) Brauksa, I.: Use of cluster analysis in exploring economic indicator differences among regions: the case of Latvia. J. Econ. Bus. Manag. 1(1), 42–45 (2013) Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenic. 7(2), 179– 188 (1936) Gazis, D.C.: Traffic Theory. Springer (2002) Kerner, B.S.: The Physics of Traffic. Springer, Berlin, New York (2004)

42

3 Case Studies: An Introduction

Kessels, F.: Traffic Flow Modelling—Introduction to Traffic Flow Theory Through a Genealogy of Models. Springer (2019) Lombardo, R., Falcone, R.: Crime and Economic Performance. A cluster analysis of panel data on Italy’s NUTS 3 regions. Working Paper no. 12 / 2011, Department of Economics and Statistics, University of Calabria (2011) https://www.ecostat.unical.it/ May, A.: Traffic Flow Fundamentals. Prentice Hall, Englewood Cliffs, NJ (1990) Minasny, B., McBratney, A.B.: Incorporating taxonomic distance into spatial prediction and digital mapping of soil classes. Geoderma 142, 285–293 (2007) Rodriguez-Villamizar, L.A., Rojas Díaz, M.P., Acuña Merchán, L.A., et al.: Space-time clustering of childhood leukemia in Colombia: a nationwide study. BMC Cancer 20, 48 (2020). https://doi. org/10.1186/s12885-020-6531-2 Salter R. J. (1989) Highway Traffic Analysis and Design. Springer. Senetra, A., Szarek-Iwaniuk, P.: (2020) Socio-economic development of small towns in the Polish Cittaslow Network—A case study. Cities 103, 102758 (August 2020) Treiber M. and Kesting A. (2013) Traffic Flow Dynamics. Springer. Umer, M.F., Zofeen, Sh., Majeed, A., Hu, W.-B., Qi, X., Zhuang, G.-H.: Spatiotemporal clustering analysis of Malaria infection in Pakistan. Int. J. Environ. Res. Public Health 15(6), 1202 (2018). https://doi.org/10.3390/ijerph15061202 Zolfaghari, F., Khosravi, H., Shahriyari, A., Jabbari, M., Abolhasan, A.: Hierarchical cluster analysis to identify homogeneous desertification management units. Plos One (2019). https://doi.org/10. 1371/journal.pone.0226355

Chapter 4

The Road Traffic Data

4.1 The Setting The data set on vehicle traffic came from a measurement station on a state road. Individual objects (observations) x i were the days, characterized by the numbers of vehicles, passing through the station every hour. So, the vectors x i were composed of m = 24 values, corresponding to hours of the day, and x ik were the numbers of cars passing during the kth hour on day i. Thus, altogether, x i were the daily temporal profiles of traffic intensity according to hours. Besides, the days were labeled as to the day of the week, although this label was not included in the characterization of x i in terms of the analysed X. The data came from a company, which is engaged in servicing the road system, and one of the purposes of the analysis, leading to the establishment of the initial partition of traffic intensity profiles, PA , was to obtain the possibly justified daily profiles of traffic intensity so that in the situation of lack of signal from the given measurement station, its hypothetical indications could be at least approximately recovered on the basis of “model profiles” (to be potentially adjusted to those from the neighbouring stations). (For an earlier account on these experiments, see Owsi´nski et al. 2017a). In this case, the initial partition PA was based on the classification of the days of the week, holidays etc., and reflected the expert’s opinion on how these days ought to be treated exactly in terms of distinction of daily profiles of traffic intensity. This case is illustrated below in Fig. 4.1 in accordance with the assumed partition PA , coming from the experts in the field, meaning the following partition based on the days of the week: {Monday}, {Tuesday, Wednesday, Thursday}, {Friday}, {Saturday}, {Sunday}, i.e. the number of clusters in PA , denoted pA , equals 5, pA = 5. The expert’s opinion, reflected in the median profiles for the clusters, forming the initial partition PA , shown in Fig. 4.1, is better justified by the more explicit rendition of Fig. 4.2, where the profiles, corresponding to the particular days of the week are also shown on the diagram, presenting the entire week, divided into hours, with different colours corresponding to the clusters, forming PA . Note that although

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_4

43

44

4 The Road Traffic Data

Fig. 4.1 Median hourly profiles of traffic for the classes of the days of the week

Fig. 4.2 Hourly profiles of traffic intensity for individual hours of the week. Colours, assigned to successive days, denote the clusters, forming the initial partition PA

Mondays often tend to have distinct traffic patterns due to weekly commuting, it is not apparent in the data from the analyzed station. In this study, we were simply interested in checking whether the daily traffic profiles themselves, when treated through reverse clustering, would yield the partition as suggested by the experts (and, if so, under what parameters, forming Z, and, if not, what are the differences and their potential justification). In this case it can be assumed that the credibility of PA , even if by no means low, cannot be also treated as very high, similarly as its association with the data in X. Thus, in a way, we aimed at checking the initial partition as a kind of “working hypothesis”, and, if a different one were obtained—whether it might perhaps be better justified.

4.2 The Experiments

45

4.2 The Experiments We have performed two series of experiments, using various choices in these series, the two series differing primarily by the selection of the search algorithms. The following two subsections summarize the results from these two series of experiments. Experiment series 1. In this series the following assumptions have been made: the algorithms used have been from the k-means and hierarchical agglomeration families. Their implementation in the R package “cluster” by Maechler et al. (2015), based on Kaufman and Rousseeuw (1990), has been employed. The “pam” (partitioning around medoids) method is a variant of the k-means with the number of clusters, “k” (p in our notation) as the sole parameter. The algorithm “agnes” (agglomerative nesting) is a hierarchical clustering method, parameterized by the number of clusters, p, and a scalar a. The latter is used to obtain coefficients of the Lance-Williams (Kaufman and Rousseeuw 1990) formula in a highly simplified manner, as a1 = a2 = a, b = 1 −2a, and c = 0, where the coefficients a1 , a2 , b and c appear in the original Lance-Williams formula, quoted in this book in Sect. 1.3. The distances between the data points x i , x j ∈ X have been defined as weighted Minkowski distances, i.e. h 1/ h d xi , x j = wk xik − x jk , k

(4.1)

where wk are the weights assigned to the variables, which can assume the values from 0 to 1. In this manner, both the vector {wk }k , and the exponent h could be treated as parameters, defining the space of the values of Z, along with the parameters proper of the clustering algorithms. To solve the optimization problem we used the Differential Evolution (DE) metaheuristic (Storn and Price 1997), which is an evolutionary global optimization algorithm, and its implementation in the “DEoptim” R package by Mullen et al. (2011). Variants of DE are among the state-of-the-art real-parameter global optimizers. Their various modifications and applications are surveyed in Das and Suganthan (2011) and, more recently Das, Mullick and Suganthan (2016). Chromosomes are represented as vectors of real numbers Z = (π , a, v, w1 ,…,wn ) ∈ . After rounding, the first parameter represents the number of clusters p = round(π ), the second is the parameter a, related to the Lance-Williams formula (which is not used in the k-means algorithm), the third, ν, is the exponent h of the Minkowski distance, and the subsequent ones are weights w1,…, wn of the variables, cf. (4.1). The search space was defined by constraining the possible values of individual elements of Z to: p ∈ [1, 10], a ∈ [0, 1], h ∈ [0.1, 4], and wi ∈ [0, 1] and limiting the choice of clustering algorithms to the algorithms “pam” and “agnes”, as previously mentioned.

46 Table 4.1 Summary of results for the first series of experiments with traffic data

4 The Road Traffic Data Algorithm

Optimized parameters (Z)

Adjusted rand index

Rand index

pam

p, h, w1 ,…,w24

0.600

0.821

p

0.581

0.823

agnes

p, a, w1 ,…,w24

0.654

0.850

p, a

0.625

0.837

The distinctive feature of classical DE is the differential mutation operator, which boils down to the addition of a scaled difference between two randomly chosen individuals Z 2 and Z 3 to a third randomly picked vector Z 1

Z 1 = Z 1 + F(Z 2 − Z 3 )

(4.2)

where F is a scalar parameter, which is known as the scaling factor, with typical values F ∈ [0.4, 1). By assuming that the creation of new individuals is based on differences between population members, we arrive at the adaptation of the search direction to the local shape of the objective function. In all experiments the fitting function was identified with the Rand index for the partition PA and a partition PB , resulting from carrying out clustering on the basis of a given vector of parameters Z. For each dataset and algorithm we have run the DE algorithm for 1 000 iterations. In Table 4.1 we provide the best values of the Rand index and its adjusted variant (see Chap. 1, Sect. 1.4) obtained. For this setting we performed two series of calculations—in the first one only the optimal values of the basic parameters of algorithms (i.e. p and a) were sought for. In the second series we have also optimized the exponent v from the Minkowski distance measure and the vector of weights of the attributes, w. In all cases treated the optimization of the whole configuration Z of the clustering procedure (i.e. the second series of calculations, mentioned above) leads to better results. Sometimes the difference is significant. We have also observed that the use of hierarchical clustering allows for a more accurate reconstruction of the reference partitions in our datasets than the use of k-means (in the form of “pam”). The optimal value of the exponent h was varying in the solutions obtained over the range from 0.8 to 2.3, which confirms that the whole family of distance measures is useful, despite the loss of formal properties for the values of h below 1. This is in agreement with the results from the paper by De Amorim (2015), in which feature rescaling with the use of Lp norm proves useful for the hierarchical clustering. The regularization with the L1 norm resulted in the selection of attribute weights in the traffic dataset which has implied some of them to be zero. Table 4.2 provides highly interesting results for the traffic data dealt with using the full Z and choosing the “agnes” algorithm, showing, in particular, the correspondence between the clusters in PA , i.e. Aq , and the clusters in PB , i.e. Bq .

4.2 The Experiments

47

Table 4.2 Results for traffic data for the entire vector of parameters Z, with the use of hierarchical aggregation (values of Rand index = 0.850, of adjusted Rand = 0.654). The upper part of the table shows the coincidence of patterns in particular Aq , based on the days of the week, and obtained Bq Days of the week, Aq :

Clusters obtained, Bq 1

2

“Errors”

3

4

5

Totals

Friday

1

2

42

0

3

48

6

Monday

45

2

0

0

2

49

47

Saturday

0

1

0

46

1

48

2

Sunday

0

1

0

1

47

49

2

Tu-We-Th

140

3

0

0

4

147

7

Totals

186

9

42

47

57

341

64

Parameters p

5

w1 through w24 —weights of variables, corresponding to the consecutive hours of the day

a

0.78

h

0.91

w1 −w6

0.47

0.45

w7 −w12

1.00

0

w13 −w18

0.07

0.09

w19 −w24

0.30

0.43

0.79

0.62

0.17

0.48

0.84

0.90

0.58

0.83

0.33

0.48

0

0

0.96

0.53

0.25

0.90

Thus, this result, to a large extent induced by the conditions of computation, rather than a “natural” tendency of the method, shows that perhaps the expert’s opinion as to the original classes, ought to be verified (Monday traffic intensity profiles being classified along with those for Tuesday, Wednesday and Thursday). This result, not only interesting, but also telling in practical terms, is strongly corroborated by the slightly “worse-off”, but still close to optimality, results, obtained with the “pam” algorithm, which are shown in Table 4.3. What is also highly interesting in the case of Table 4.3 is that the “pam” algorithm “got rid” of quite a proportion of variables (seven variable weights being equal zero), and still obtained very valuable results. The result from Table 4.2 is illustrated graphically in Fig. 4.3 where, indeed, it can be clearly seen that the hourly traffic distribution on Mondays is not at all very different from the distributions for other weekdays except for Friday. This is exactly the instance of the potential feedback from the reverse clustering that we mentioned before. Cluster 2 is visualized in Fig. 4.3 as subplot (e). In this case, apart from the typical morning and afternoon peaks we observe a very high traffic intensity late in the night. The days, for which such an unusual phenomenon is observed are scattered throughout the year and represent different days of the week. This type of traffic intensity curve could constitute an effect of measurement errors and has therefore been subject to the additional plausibility checks. However, this can also be—given the relatively systematic character—a true to life case of feast

48

4 The Road Traffic Data

Table 4.3 Results for the traffic data obtained with the “pam” algorithm Days of the week, Aq :

Clusters obtained, Bq 1

2

“Errors”

3

4

5

Totals

Friday

3

2

42

0

1

48

Monday

47

2

0

0

0

49

6 47

Saturday

0

1

0

46

1

48

2

Sunday

7

0

0

2

40

49

9

Tu-We-Th

142

3

0

0

2

147

5

Totals

199

8

42

48

44

341

69

Parameters k

5

h

1.29

w1 through w24 —weights of variables, corresponding to the consecutive hours of the day

w1 −w6

0.76

0.00

1.00

0.72

0.18

0.74

w7 −w12

0.61

0.85

0.85

0.50

0.53

0.19

w13 −w18

0.84

0.05

0.73

0.28

0.21

0.00

w19 −w24

0.00

0.00

0.00

0.00

0.00

0.75

or special event days, when people tend to, say, come back home late at night. If it were so (and, definitely, such an explanation appears to be quite plausible), then the identification of such a cluster brings, indeed, valuable additional knowledge, since the traffic on such occasions can, potentially, also be characterized by other, “special” features (e.g. associated with the degree of safety). A separate remark can here be made, concerning the treatment of this case in terms of “anomalies”. In the case of the traffic intensity data, anomalies are typically detected through the assessment of specially trained operators. The ability to approximate the partitions of the daily profiles by means of an appropriately tuned clustering algorithm makes it possible to automate this procedure. The anomaly detection differs from the classification because the training samples consist nearly entirely of typical observations. Moreover, a new observation can be anomalous in a great variety of ways. This case represents a one-class learning problem consisting of identifying whether a given observation is typical rather than distinguishing between different classes. Experiment series 2. In this series three kinds of clustering algorithms have been used: DBSCAN, the classical k-means, and the general progressive merger, as defined by the entire LanceWilliams formula. The evolutionary algorithm that served to find the partition PB , composed of clusters Bq , has been developed by one of the present authors (Sta´nczak 2003). The use of specialized genetic operators requires the application of a selection method to execute them in all iterations of the algorithm. The traditional method with a small probability of mutation and a high probability of crossover is not applicable

4.2 The Experiments

Fig. 4.3 Visual interpretation of clusters described in Table 4.2

49

50

4 The Road Traffic Data

in this case, because the number of operators in use is greater than two and their properties cannot be easily described as either exploration or exploitation (often deemed to be realized by, respectively, mutation and crossover operators). In the approach used here, following Sta´nczak (2003), it is assumed that an operator that generates, so far, good results should have a higher probability of execution and more frequently affect the population. But it is very likely that the operator, which is proper for one individual, would give worse effects for another one, for instance because of its location in the domain of possible solutions. Thus, each individual may have its own preferences. So, each individual has a vector of floating-point numbers in addition to the encoded solution. Each number corresponds to one genetic operation. It is a measure of quality of the genetic operator (a quality factor). The higher the factor, the higher the probability of using the operator. The ranking of qualities becomes the basis for computing the probabilities of appearance and execution of the particular genetic operators. The simple normalization of the vector of quality coefficients turns it into a vector of operators’ execution probabilities. This set of probabilities can be treated as a base of experience of each individual and according to it, an operator is chosen in each epoch of the algorithm. Due to the experience gathered one can maximize the chances of its offspring to survive. The essential criterion used was simply the number of “misclassified” objects for a given data set, i.e. with respect to the prior PA (this number being, of course, minimized), in practical terms equivalent to the original Rand index (due to the correspondence of clusters and categories). Because of the differences, related to the implementations of the evolutionary procedures, the content of the parameter vectors Z, as well as the form of the criterion, somewhat different results have been obtained than in the series 1 of experiments (although, like in series 1, all variables have been explicitly weighted, and the distance exponent has varied as well). For brevity, we will characterize these result below in more general terms. The results obtained with DBSCAN, parameterized with the number of neighbors and the maximum distance, have been the poorest among the clustering algorithms tried out, namely in the “best” partitions, obtained with DBSCAN, not less than 88 objects were misclassified, out of the total of 341. For the classical k-means, the results, with respect to the criterion assumed, but also quite intuitively, have been much better: 57 misclassified objects—with the “optimum” number of clusters equal 5, resulting from the operations, related to the treatment of particular clusters (empty and close-to-empty ones). One can compare these results with 64 misclassified objects, shown in Table 4.2 for series 1, when the “proper” number of clusters, i.e. 5, was actually forced for a better comparability with the initial partition PA . In the case of the hierarchical merger procedures the difference with series 1 has additionally involved the complete parameterization of the procedure, according to the Lance-Williams formula (five coefficients with varying values, subject only to some quite “liberal” constraints). The results obtained are comparable with those for the k-means in terms of their “quality”: 60 misclassified objects (with 6 clusters as the “optimum” number).

4.3 Conclusions

51

4.3 Conclusions Substantive interpretation. First, let us note that the experiments with the road traffic data ended with a success from the substantive point of view. It was shown that the expert-based categorization of the daily traffic intensity profiles can be improved for the analysed measurement station, also keeping, to quite a high extent, to the same basic classification principle, i.e. reference to the days of the week. These experiments proved that the expertdefined partition of the daily traffic profiles into 5 clusters, according to days of the week, cannot be assumed to be the result of a disciplined application of a clustering algorithm belonging to an assumed class. The “identified” new class of profiles constituted an interesting subject of additional analysis, meant to verify whether these are “anomalous” profiles, resulting from an error or being quite incidental, or rather truly represent a small, “exceptional” class of days. In this perspective, the best result for k-means in the second series of experiments, showing 5 clusters, ought also to be considered as indicative. All in all, while we certainly have not been dealing with a “classification” case, the suggested broader interpretation of the reverse clustering approach turned out to be quite adequate in this case, with PA treated as a “leading hypothesis”, having a partial relation to the content of the data set X (the initial partition being devised with the data from X in mind, but along the lines—a definite division of the days of the week—that are not explicitly associated with the traffic intensity profiles themselves). Technical aspects. It turned out obvious that the most important influence on the quality and character of results obtained was exerted by the choice of the clustering algorithm and then its parameters. In this particular case hierarchical aggregation and k-means type algorithms turned out to bring results of similar quality. The local density algorithm DBSCAN appeared as yielding worse results, which is not surprising, as this algorithm is meant for large sets of data, for which mainly dense local groups are identified, rather than for smaller sets, for which more refined distinctions may be significant. On the other hand, even though optimization with respect to variable weight and distance exponent proved to be rational in the sense of quite clear choices made—e.g. zero weights of variables, exponent values well beyond the range of [1, 2]—their influence on the ultimate results appear to be rather limited. It can be, therefore, supposed that the criterion of similarity of the two partitions (in whatever of its forms) might be quite “flat” with respect to the latter variables, and this could be expected, since usually some variables are to a definite extent redundant, but the selection of those “leading” ones and the “echoing” ones is not uniquely performed by the different combinations of evolutionary and clustering algorithms. Similarly, in many tasks a limited change of the Minkowski distance exponent has very little influence on the results obtained.

52

4 The Road Traffic Data

As this case was the first one to treat through an extended series of experiments, the above conclusions, although appearing as quite plausible, were to be verified in the other cases, reported in the subsequent chapters of the book.

References Das, S., Suganthan, P.N.: Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 15(1), 4–31 (2011) Das, S., Mullick, S.S., Suganthan, P.N.: Recent advances in differential evolution–an updated survey. Swarm Evol. Comput. 27, 1–30 (2016) De Amorim, R.: Feature relevance in ward’s hierarchical clustering using the Lp Norm. J. Classif. 32, 46–62 (2015). https://doi.org/10.1007/s00357-015-9167-1 Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990) Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K., Studer, M., Roudier, P., Gonzalez , et al.: R package, version 2, 3 (2015) Mullen K.M., Ardia, D., Gil, D.L., Windover, D., Cline, J.: DEoptim: An R package for global optimization by differential evolution. J. Stat. Softw. 40 (6), 1–26 (2011). https://www.jstatsoft. org/v40/i06/. Owsi´nski, J.W., Kacprzyk, J., Opara, K., Sta´nczak, J., Zadro˙zny, Sł.: Using a reverse engineering type paradigm in clustering: an evolutionary programming based approach. In: Torra, V., Dalbom, A., Narukawa, Y. (eds.) Fuzzy Sets, Rough Sets, Multisets and Clustering, pp. 137–155. Springer, Heidelberg, 2017a. ISBN 978–3–319–47556–1; https://doi.org/10.1007/978-3-319-47557-8. Sta´nczak, J.: Biologically inspired methods for control of evolutionary algorithms. Control Cybern. 32(2), 411–433 (2003) Storn, R., Price, K.: Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341–359 (1997)

Chapter 5

The Chemicals in the Natural Environment

5.1 The Data and the Background The subsequent case, treated in the framework of the reverse clustering paradigm, concerned the content of definite chemicals in the herb layer of a set of administrative units in Germany. A broader description of this particular study is provided in Owsi´nski et al. (2017b). This illustrative case study was based on the data, provided by the courtesy of Dr Rainer Brüggemann (see Brüggemann et al. 1998). The particular data set obtained by us has the convenience of having been the subject of several analytical and comparative studies, this adding to its value as the testbed (see, for instance, De Loof et al. 2008, or Brüggemann, Mucha and Bartel 2012), as well as to the comparative value of the present example. The data reflect the selected chemical characterizations, namely herb layer pollution levels in terms of total concentrations of four chemical elements: Pb, Cd, Zn and S (corresponding to variables) in mg/kg of dry weight, for n = 59 areas, deemed uniform with this respect, in Baden-Württemberg in Germany. The entirety of the data, used in the illustrative calculations, is provided in Table 5.1. In this study, we treat the “areas” inside the land of Baden-Württemberg as the objects. These objects are described by the data given in Table 5.1. We are looking for a categorisation of the areas according to the levels of pollution (chemical content), shown in Table 5.1. Yet, these data, by themselves, do not provide nor contain any initial partition nor classification of the areas, just the variable values. Thus, for purposes of this exercise, a simple, but straightforward hypothesis was formulated that it is possible to classify the areas on the basis of perhaps just one or two indicators (chemical element concentrations), out of the four available. The feasibility of such a hypothesis was verified by plotting the values from the table in an increasing order along the “areas” for each of the elements separately. The results are shown here in the four consecutive figures (i.e. Figs. 5.1, 5.2, 5.3 through 5.4).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_5

53

54

5 The Chemicals in the Natural Environment

Table 5.1 Pollution data for Baden-Württemberg (Germany), used in the exemplary calculations: total concentrations, in mg/kg of dry weight (Pb-Lead, Cd-Cadmium, Zn-Zinc, S-Sulphur) Areas

Pb

Cd

6

1

0.07

8

1.5

0.07

7

1.2

0.09

17

0.6

9

0.09

16

1

22 18

Zn

S

Areas

Pb

Cd

Zn

29

1750

36

27

1750

46

28

1600

50

1.4

0.06

36

1820

53

0.2

850

580

45

0.12

32

1520

54

1

0.03

28

2150

0.5

0.43

28

4030

30

0.8

0.08

27

23

1.1

0.04

15

0.9

0.1

14

1

5 28

S

1.2

0.05

31

1570

0.8

0.09

33

1680

0.13

29

1730

1

0.12

36

1750

1.5

0.17

45

1780

0.7

0.1

26

1750

59

1.3

0.13

26

1470

60

1

0.2

32

2160

1610

58

1

0.11

28

1980

42

2000

57

1.7

0.15

39

1850

24

1670

35

0.08

0.24

720

1960

0.17

34

1830

34

0.14

0.39

950

400

1.1

0.1

32

1990

33

0.16

0.26

800

530

0.9

0.05

34

1670

25

0.9

0.09

35

1460

39

1

0.1

38

1740

12

0.16

0.23

910

1460

40

0.7

0.06

34

1770

21

0.06

0.24

830

620

29

0.6

0.14

27

1680

11

0.9

0.08

27

1720

41

0.7

0.17

39

1840

2

0.7

0.14

27

1770

42

0.7

0.1

33

1690

1

1

0.04

21

1540

27

0.1

0.12

26

1600

10

1

0.03

29

1780

38

1.7

0.18

34

1720

20

1.5

0.14

32

1730

49

0.8

0.11

37

1680

24

1.7

0.18

39

1740

37

0.6

0.12

33

1580

31

1.1

0.15

28

1740

47

1.1

0.11

25

1650

32

1.2

0.03

35

1820

48

2.3

0.42

33

1600

19

0.8

0.01

18

4030

51

0.8

0.14

22

1640

43

0.5

0.11

39

4030

4

0.8

0.02

26

1790

44

0.8

0.08

38

1800

3

0.8

0.14

31

1710

52

2

0.23

36

4030

13

0.18

0.18

1160

350

56

1

0.11

34

1970

26

0.8

0.05

19

1620

Source Lfu Baden Württemberg, after Brüggemann et al. (1998), courtesy of Dr Rainer Brüggemann

When looking at Figs. 5.1, 5.2, 5.3 through 5.4 one should remember that in each of these illustrations the objects are ordered according to the value of concentration of the given element, and so in each case the horizontal axis corresponds to a different sequence of objects (i.e. “areas”).

5.1 The Data and the Background

55 Pb levels

2,5

Pb levels

2 1,5 Pb levels 1 0,5 0 Areas along increasing Pb levels

Areas

Fig. 5.1 Concentration levels for Pb: areas in the order of increasing Pb concentrations

Values for Cd 0,5 0,45 0,4

Cd values

0,35 0,3 0,25 0,2 0,15 0,1 0,05 0

Fig. 5.2 Concentration levels for Cd: areas in the order of increasing Cd concentrations

The character of the respective distributions is also illustrated in two following figures, Fig. 5.5a and b. They show the respective histograms and the pairwise distributions. It is clear that there are essential differences between the distributions, on the one hand, of Zn and S, and, on the other hand, Pb and Cd. Already this observation provides an important information, which might be of high value in terms of environmental policy, but we intend to go in our study well beyond this straightforward conclusion.

56

5 The Chemicals in the Natural Environment Zn concentrations 1400

Concentrations

1200 1000 800 Zn concentrations 600 400 200 0 Areas along Zn concentrations

Fig. 5.3 Concentration levels for Zn: areas in the order of increasing Zn concentrations

S concentrations 4500 4000 3500 3000 2500 S concentrations 2000 1500 1000 500 0 Areas along concentration of S

Fig. 5.4 Concentration levels for S: areas in the order of increasing S concentrations

5.2 The Procedure: Determining the Partition P A On the basis of the illustrations provided, and especially Figs. 5.1, 5.2, 5.3 through 5.4 it can be easily seen that, indeed, the levels for Zn and S appear to indicate clearly distinct “categories” (of “areas”), respectively, for these two elements, two and three of such hypothetical categories (this is also indicated by the different colours appearing in Fig. 5.5). One might, therefore, expect that there exist (at most) two times three = six mixed categories, based on values, corresponding to these two elements, say, defined by the conditions:

5.2 The Procedure: Determining the Partition PA

57

Fig. 5.5 The distribution of points (“areas”) in the space of concentrations for a particular elements and pairwise; b enlarged for Zn and S (upper box) and of Pb and Cd (lower box); see the text further on for the interpretation of colours

58

5 The Chemicals in the Natural Environment

Table 5.2 Numbers of areas in the classes, defined for the elements Zn and S contents in the herb layer Elements and their concentrations Zn

S 2500

100

5

2

0

for Zn: 1: < 100 and 2: > 100, and for S: 1: < 1000, 2: between 1000 and 2500, and 3: > 2500. After applying these conditions to the data from Table 5.1, one obtains, actually, only four categories, since two out of the potential six are empty (for the numbers of areas in the particular thus defined classes, see Table 5.2). Thus, having the 59 areas partitioned among 4 categories (say, 5 areas in category 1, 48 in category 2, 2 in category 3, and 4 in category 4, according to Table 5.2), that is—having the thus determined partition PA , we could perform the exercise, consisting in the attempt of obtaining clusters that would be as close to these categories as possible. If we obtained such clusters, well approximating the categories, especially if we did this on the basis of data other than those for the two elements, used to define the initial partition, it would mean that, on the one hand, the categories are sound, and, on the other hand, they can be reconstructed effectively by the inverse clustering approach, providing the basis for a hypothetical broader mechanism of categorization. Figure 5.5, however, suggests that this might, indeed, be difficult to achieve.

5.3 The Procedure: Reverse Clustering For purposes of this book we shall only cursorily illustrate the exercises and their results, mainly in order to show the functioning and the effectiveness of the approach, as well as to indicate the possibility of drawing the substantive conclusions. Series 1 of calculations. The first series of calculations was performed with the evolutionary algorithm, developed by one of the authors (Sta´nczak 2003). The calculations were performed with kmeans, hierarchical agglomerative and DBSCAN clustering algorithms. The parameter values sought for each of those were (besides the algorithmic parameters themselves) the weights of the variables, the exponent of the Minkowski distance, and the number of clusters. The k-means algorithm. For the k-means method, “optimized” for the data set X, concerning all four chemical elements, the weights of the respective variables were:

5.3 The Procedure: Reverse Clustering

59

Table 5.3 Contingency table for the partition PA assumed and the one obtained in Series 1 of calculations, PB , with the k-means algorithm and data only for Pb and Cd Categories: obtained, PB : Initial, PA :

1

2

1

5

0

2

0

48

3

2

0

4

1

3

Pb: 0.28, Cd: 0.06, Zn: 0.40, S: 0.26. This result comes, indeed, to some extent as surprise, since the weights of Pb and Cd could have been close to or equal to zero. Evidently, however, the perfect fit of PA and PB could be achieved without zeroing of the weights of two “additional” variables. The Minkowski exponent obtained was 0.49, the number of clusters was, indeed, 4, with no misclassifications, meaning that the assumed partition PA was reconstructed perfectly. However, if we gave up the two variables, underlying the initial partition, and optimized only with respect to the limited data set X based on data for Pb and Cd, the results got, naturally, worse. The weights of the two variables were, respectively, 0.45 and 0.55, the value of the Minkowski exponent was 3.54. Only two clusters were obtained, with 6 misclassified areas. The respective contingency matrix is provided in Table 5.3. Thus, even if 6 objects (10%) were misclassified and instead of four—two clusters were obtained, the result is striking in that not only the partition obtained for a different data set than the one used to determine PA is so close to the original, but also only one of the original clusters (no. 4) got split between those two forming the partition PB . The hierarchical aggregation algorithm. The results, obtained with the use of the general agglomerative clustering algorithm, in the case, when all the variables were accounted for, were also perfect in terms of lack of misclassifications, and the number of clusters was four. However, since a different type of algorithm, with different parameters, was used, the details of the solution obtained were also different. Thus, for instance, the weights of the variables were: Pb: 0.000, Cd: 0.333, Zn: 0.333, and S: 0.333 (or, alternatively, Pb: 0.333, Cd: 0.000, Zn: 0.333, and S: 0.333, also without misclassifications). Here, as before, it appears that it was not necessary to assign non-zero weights only to the variables, associated with Zn and S, in order to obtain the perfect reconstruction of the assumed partition PA . (In addition, of course, the coefficients of the Lance-Williams formula were also obtained.) Now, for the case, when only the data for Pb and Cd were used, again, the results contained definite misclassifications (altogether five of them), and the number of clusters this time was 5, as this is shown in Table 5.4. DBSCAN. The third clustering method tried out was DBSCAN. This method led to the worst results for the case of all four variables considered, although two misclassifications do still constitute quite a plausible result. Three clusters were obtained, and the respective contingency matrix is given in Table 5.5.

60

5 The Chemicals in the Natural Environment

Table 5.4 Contingency table for the partition PA assumed and the one obtained in Series 1 of calculations, PB , with the hierarchical aggregation algorithm and data only for Pb and Cd Categories: obtained, PB : Initial, PA :

1

2

3

4

5

1

5

2

0

0

0

0

0

47

1

0

3

0

2

0

0

0

0

4

0

1

0

2

1

Table 5.5 Contingency table for the partition PA assumed and the one obtained in Series 1 of calculations, PB , with the DBSCAN algorithm and data for all four elements Categories: obtained, PB : Initial, PA :

1

2

3

1

5

0

0

2

0

48

0

3

2

0

0

4

0

0

4

It is interesting to note that one of the original categories, namely A3 , was not identified by this method in the case considered, and was, in fact, aggregated with the original category A1 . The results for the second case—when two variables, not considered in the establishment of PA , were accounted for, i.e. Pb and Cd, were identical in terms of clusters to those characterized in Table 5.5. Series 2 of calculations. This series of experiments was performed with the use of the differential evolution algorithm from the R library. We shall report here only the results from the hierarchical merger algorithm, which fared the best in Series 1, but, definitely, quite differently in Series 2. Namely, as many as 13 clusters were obtained! Actually, side by side with one big cluster, corresponding to the initial category 2, all the remaining clusters contained just one object each, according to the contingency table, Table 5.6. This would confirm (and extend) the treatment of the objects not included in the big, dominating cluster, suggested by DBSCAN in series 1, as the “outliers”. Table 5.6 Contingency table for the partition PA assumed and the one obtained in Series 2 of calculations, PB , with the hierarchical merger algorithm and data for all four elements Categories: obtained:initial:

1

2

3

4

5

1

0

2

46

3 4

6

7

1

0

0

0

0

1

0

1

0

1

0

0

0

0

0

0

0

0

1

1

0

1

0

0

0

0

8

9

10

1

1

0

0

0

0

0

0

1

0

0

0

11

12

13

1

0

0

0

0

0

0

0

0

0

1

1

5.3 The Procedure: Reverse Clustering

61

Definitely, upon the visual inspection of Figs. 5.1, 5.2, 5.3 through 5.4, as well as Fig. 5.5, one might have the impression that there is, indeed, rather little premise for any regularity as to potential groups outside of the dominating one.

5.4 Discussion and Conclusions The case here presented is quite specific, first of all in that the partition PA was not obtained from an “external” source, but resulted from the analysis of data in the form of an explicit hypothesis as to what the respective partition might look like (based on two out of four variables, for which the data were available). The results obtained in terms of PB were, actually, beyond expectations. They were obtained for two quite distinct sub-cases, namely. (I)

(II)

when reverse clustering was run for the complete set of data—i.e. for all four variables; in this case we can postulate that the partition PA is quite closely associated with the data set X, being, in fact a direct reflection of a part of it; and when the reverse clustering was run for the other two variables than those, which served to produce partition PA ; in this situation we may rightly postulate that this partition has “nothing to do” with the analysed data set X.

It ought to be very strongly emphasized that in both cases very promising results were obtained—i.e. promising in terms of the ultimate goal of the whole exercise: determination of categories of “areas” for the actions, related to contamination of their ecosystems. It appears that this exercise is not quite “in line” with the inner logic of the reverse clustering, since it is quite far from the “classification paradigm”. In fact, this case appears to be closer to correlation analysis, factor or principal component analysis— but this, exactly, demonstrates that the reverse clustering paradigm can be indeed applied to a very wide scope of analytic situations. Another comment, which is motivated by this particular study, is associated with the interpretation of the results against the background of the full use of the vector Z, especially its component, related to variable choice or weights, and the Minkowski exponent. It can be justly pointed out that such an intervention changes to a very high extent the initial geometry of the given data set. This is definitely true, and ought to be accounted for, when substantive interpretation of the results is formulated. However, given that we do not know, in principle, the origin of PA , we are justified in trying to figure out the various shapes and subspaces, in which we might be getting closer to this partition. In any case, we are not introducing any artificial divisions and twists in space (like locally valid distances, e.g. changing from cluster to cluster).

62

5 The Chemicals in the Natural Environment

References Brüggemann, R., Voigt, K., Kaune, A., Pudenz, S., Komoßa, D., Friedrich J.: Vergleichende ökologische Bewertung von Regionen in Baden-Württemberg. GSF-Bericht 20/98. GSF, Neuherberg (1998) Brüggemann, R., Mucha, H.J., Bartel, H.G.: Ranking of polluted regions in South West Germany based on a multi-indicator system. MATCH Commun. Math. Comput. Chem.69, 433–462 (2012) De Loof, K., De Baets, B., De Meyer, H., Brüggemann, R.: A hitchhiker’s guide to poset ranking. Comb. Chem. High Throughput Screening 11, 734–744 (2008) Owsi´nski, J.W., Opara, K., Sta´nczak, J., Kacprzyk, J., Zadro˙zny, S.: Reverse Clustering. Toxicological & Environmental Chemistry, An Outline for a Concept and Its Use (2017). https://doi.org/ 10.1080/02772248.2017.1333614 Sta´nczak, J.: Biologically inspired methods for control of evolutionary algorithms. Control Cybern. 32(2), 411–433 (2003)

Chapter 6

Administrative Units, Part I

6.1 The Background: Polish Administrative Division and the Province of Masovia This chapter and the next one are devoted to the experiments related to the data, concerning Polish administrative units (some of the early results, reported in these two chapters, have been already provided in Owsi´nski et al. 2018). The administrative breakdown of Poland is, in its essence, a three-tier one: – the whole country is divided into 16 provinces (voivodships), for purposes of the EU statistics, basically, classified as NUTS 2 regions, – the provinces, in turn, are divided into counties (poviats)—altogether 380 such units, formally constituting the local administrative units (LAU) of level 1, which, for purposes of the EU statistics are grouped into NUTS 3 regions, and, finally, – the counties are divided up into municipalities (or communes, gminas)—altogether some 2,500 of them, for purposes of the EU statistics referred to as LAU 2. There are formally also yet smaller units, having some administrative functions, corresponding roughly to villages in the countryside and to quarters in towns, as well as units bigger than provinces, entirely “virtually” formed for statistical purposes only, but neither of these is usually accounted for in terms of socio-economic or other kinds of analyses (although the smallest units, the distinct parts of municipalities, do play an administrative role). Out of 380 counties 66 are constituted by single towns—the county-towns, and in this case a municipality is in a way identical with the county. For these towns, there exist also the so-called landed counties, i.e. the area around, outside of the town, divided into corresponding municipalities. Given that Poland is a country of some 38.5 million inhabitants and its surface is 312,000 km2 , an average municipality has several thousand residents and some 150 sq km of surface, meaning that it is equivalent to a circle of roughly 7 km of radius. Yet, of course, municipalities are very strongly differentiated—mainly along the urban–rural axis. They are formally, administratively categorised into three categories: urban, rural, and urban–rural. The last category is composed of two-part © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_6

63

64

6 Administrative Units, Part I

municipalities, namely the ones consisting of an urban and a rural part, governed to some extent together. Such municipalities are usually formed when the urban part is constituted by a really very small town. There are—as of the time of this writing—1533 rural municipalities, 302 urban municipalities and 642 urban–rural ones. The fact that a locality is formally a town is the effect of a political and historical process that is, of course, correlated with the character of the locality, but at the edge of the category (smallest towns and biggest rural municipalities) one can easily encounter definite inconsistencies. Thus, just to give an example—the populationwise biggest rural municipality has some 25,000 inhabitants, while the smallest formal town has only 330 of them! (Of course, this exceptionally small town is a part of an urban–rural commune. The fact that it currently has the status of a town is entirely due to historical reasons.) Hence, we can already see a motivation for a study of the reverse clustering type: the initial partition, PA being the one resulting from the formal categorisation of communes into three categories mentioned, and the data set X appropriately describing the character of the communes analysed. It would certainly be interesting to see how far the potential reconstruction of this PA may go and what are the differences between PA and PB and their potential reasons. We shall be dealing with this problem in the present chapter. However, first, we shall be dealing with it at the provincial level, i.e. for the set of municipalities of a single province, but, second, we shall also deal, at the same level, with a different problem, in which the initial partition shall be determined by the experts in the field. The province, for which the here reported experiments shall be performed, is the biggest of Polish provinces, Masovia, its capital, Warsaw, being at the same time the capital of Poland. For purposes of characterisation of the set of objects we shall be dealing with here (the municipalities of Masovia), we bring in Table 6.1, containing the data on these municipalities, presented in the setting, which was established by the partition into functional types of municipalities, mentioned above, determined by the experts from the Institute of Geography and Spatial Organization of the Polish Academy of Sciences, mainly for reasons, related to planning. Thus, in the light of Table 6.1 we can see that the task, related to the reconstruction of the administrative categorisation of municipalities means trying to classify the total of 314 municipal units in three categories, while the other task—namely to reconstruct the categorisation, presented in Table 6.1, would amount to trying to classify these 314 units in 9 categories.

6.2 The Data Regarding the first of the tasks mentioned it is obvious that there are no “a priori” given corresponding data, forming the set X and so it is up to the analyst to conceive the most appropriate characterisation of the municipalities that would possibly allow for their categorisation in the administrative framework, and at the same time constitute

6.2 The Data

65

Table 6.1 Functional typology of municipalities of the province of Masovia (data as of 2009) Type Description of municipalities

Name

Number of units

In thousand

Population number % In towns

Area (km2 )

Population density (persons per 1 km2 )

1. Core of the national and provincial capital (Warsaw)

MS

1

1 714.4

100.0

517

3 315

2. Suburban zone PSI of Warsaw

27

725.1

72.9

1 297

559

3. Outer suburban zone of Warsaw

PSE

31

393.0

33.6

2 897

136

4. Cores of the urban areas of subregional centres

MG

5

526.4

100.0

293

1 797

5. Suburban zones of subregional centres

PG

20

182.8

4.5

2 236

82

6. County seats

MP

22

433.9

82.7

1 871

232

7. Intensive development of nonagricultural functions

O

29

241.1

24.7

3 529

68

8. Intensive development of farming

R

112

615.1

4.9

13 912

44

9. Extensive development, mainly farming

E

67

390.3

4.2

9 006

43

64.6

35 558

147

Totals

314

5 222.1

´ Source Courtesy of Sleszy´ nski and Komornicki (2009), Institute of Geography and Spatial Organization of the Polish Academy of Sciences

a possibly wholesome, even though “shorthand”, description of these municipalities. With this in mind, we used the set of features as given in Table 6.2. It must be noted that an effort has been obviously made to reflect in the set of selected variables such essential characteristics of the communal units as: (a) the degree and the nature of their urban / rural character, (b) their demographic features, and (c) their socio-economic characteristics. In addition, no absolute quantities are contained in the vectors x i , i = 1,…,314, as specified in Table 6.2, like, say, total population, area of a municipality, sales value etc.

66

6 Administrative Units, Part I

Table 6.2 Variables describing municipalities, accounted for in the study No. Variable

No. Variable

1

Population density, persons / sq. km

10

Share of registered employed persons, %

2

Share of agricultural land, %

11

Number of businesses per 1,000 inhabitants

3

Share of overbuilt areas, %

12

Average employment in a business indicator

4

Share of forests, %

13

Share of businesses from manufacturing and construction, %

5

Share of population in excess of 60 years, 14 %

Number of pupils and students per 1,000 inhabitants

6

Share of population below 20 years, %

15

Number of students in above primary schools per 1,000 inhabitants

7

Birthrate for the last 3 years

16

Own revenues of the municipality per capita

8

Migration balance rate for the last 3 years 17

Share of revenue from Personal Income Tax in municipal revenues, %

9

Average farm acreage indicator, hectares

Share of expenditures into social care in total of expenditures from municipal budget, %

18

Source Own elaboration

The situation is different with respect to the initial partition PA as characterized in Table 6.1, i.e. the expert-provided partition into functional types of municipalities. This partition was performed on the basis of a definite set of data, according to a welldescribed analytical procedure. Yet, for purposes of our experiments, we adopted the same data set as presented in Table 6.2. This decision was based on two important premises: 1.

2.

The procedure, which led to the determination of the functional types from Table 6.1, was not a simple “linear” analytic procedure, based on a unified set of data—it involved a number of decision points, depending upon specific, threshold like, or nominal values; We wished to preserve as much as possible of comparability with the other case here analysed, that of the administrative categorization.

6.3 The Analysis Regarding the Administrative Categorization of Municipalities

67

6.3 The Analysis Regarding the Administrative Categorization of Municipalities We shall provide here the results for some selected experiments, with the distinction of the clustering methods and the evolutionary search methods. And so, the contingency table for the best overall result, attained with the evolutionary algorithm of one of the present authors (Sta´nczak 2003), is shown in Table 6.3. This result comes, indeed, as striking in its clarity: there is no misspecification between the urban and rural municipalities, while the mixed character of the urban– rural ones definitely calls in many concrete situations for an adjustment or correction (involving, though, all of the three categories). For comparison, but also in order to corroborate the above result, we provide the one obtained with the use of the hierarchical aggregation algorithm in Table 6.4, similarly striking as to its clarity and facility of interpretation. Finally, let us also quote the same contingency matrix for the DBSCAN algorithm (Table 6.5). Concerning the total number of objects assigned to particular classes in the last case (313 instead of 314) it must be remembered that DBSCAN classifies explicitly some of the objects as “outliers”, not necessarily “pushing” them into the classes determined. Table 6.3 Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with own evolutionary algorithm using k-means Clusters: Obtained, PB → Initial, PA ↓

1

1. Urban municipalities

33

0

2. Rural municipalities

0

3. Urban–rural municipalities

1 34

Totals

2

3

Totals 2

35

217

11

228

20

30

51

237

43

314

Table 6.4 Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with own evolutionary algorithm using hierarchical aggregation Clusters: Obtained, PB → Initial, PA ↓

1

1. Urban municipalities

34

0

2. Rural municipalities

0

3. Urban–rural municipalities

0 34

Totals

2

3

Totals 1

35

216

12

228

22

29

51

238

42

314

68

6 Administrative Units, Part I

Table 6.5 Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with own evolutionary algorithm using DBSCAN Clusters: Obtained, PB → Initial, PA ↓

1

1. Urban municipalities

35

0

0

35

2. Rural municipalities

3

216

8

227

3. Urban–rural municipalities

4

25

22

51

42

241

30

313

Totals

2

3

Totals

Thus, although the results, obtained for the own evolutionary method with the k-means algorithm proved to be, in this case, on a par with the result from the hierarchical aggregation algorithm, all three algorithms clearly indicated the vagueness of the “urban–rural” category and the need of treating it in a separate perspective. Just for the sake of comparison let us quote two results, obtained with the differential evolution (DE) algorithm, one for the “pam” (partitioning around medoids) algorithm, from the k-means family, provided in Table 6.6, and another, for the hierarchical aggregation “agnes” algorithm, provided in Table 6.7. With respect to the results from the DE algorithm it must be pointed out that the one obtained with “agnes” had a better value of the Rand and adjusted Rand index. In any case, these results emphasised once more the uncertain status of the “urban–rural” category of municipalities. This status may also be well illustrated by the diagram in Fig. 6.1, showing the locations of the objects, according to their assignment to the three initial categories, Table 6.6 Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with DE algorithm using “pam” Clusters: Obtained, PB → Initial, 1 PA ↓

2

Totals

1. Urban municipalities

32

3

35

2. Rural municipalities

29

199

228

3. Urban–rural municipalities

19

32

51

Totals

80

234

314

Table 6.7 Contingency matrix for the administrative breakdown of municipalities of the province of Masovia in Poland and reverse clustering performed with DE algorithm using “agnes” Clusters: Obtained, PB 1 → Initial, PA ↓

2

3

4

5

6

7

8

Totals

5

1

6

1

0

1

35

1. Urban municipalities

16

5

2. Rural municipalities

0

206

18

0

0

3

1

0

228

3. Urban–rural municipalities

1

36

12

0

0

2

0

0

51

17

247

35

1

6

6

1

1

314

Totals

6.3 The Analysis Regarding the Administrative Categorization of Municipalities

69

Fig. 6.1 Data on municipalities of the province of Masovia with administrative categorisation into three categories on the plane of the first two principal components (colours refer to the results from Table 6.6). Note See, e.g., Comrey and Lee (1992)

on the plane of the first two principal components. In this figure, numbers correspond to the initial categorisation into three administrative categories, while colours—to the reverse clustering results, reported in Table 6.6. It is particularly well visible how the (majority of the) urban municipalities distinguish themselves from the “cloud” to the right, within which it is, definitely, difficult to discern the two remaining categories. Another illustration is provided by the map of the province of Masovia, shown in Fig. 6.2, corresponding to the best of the results here reported (see Table 6.3). The red units, assigned to the “urban” category 1, form a broader area in the centre, corresponding to Warsaw and some of its neighbouring municipalities, as well as dispersed “urban islands” throughout the entire province. It is highly telling, though, that the yellow units, corresponding to “something, which is neither urban (1) nor rural (2)” do also tend to form compact areas, including very distinct ones in the vicinity of Warsaw. This would imply that indeed, the approach interprets these units as intermediate, but in a different sense from that of the official administrative breakdown, since an important part of them appears to have a suburban character, while some other correspond to the highly productive farming areas. It should be added at this point that all of the communes considered were characterised by their

70

6 Administrative Units, Part I

Fig. 6.2 Map of the province of Masovia with the indication of the municipalities classified in three clusters resulting from the reverse clustering according to the data from Table 6.3. Red area in the middle corresponds to Warsaw and its neighbourhood, the bigger red blobs correspond to subregional centres (Radom, Płock, Siedlce and Mi´nsk Mazowiecki)

overall features, and no distinction was made of the potential urban and rural parts, regarding the cases of the urban–rural municipalities. It may be interesting at this point to also shortly characterise some of the remaining aspects of the found composition of the vector Z, especially regarding weights of variables and the Minkowski exponent. Thus, concerning the weights of variables in the calculations performed with the own evolutionary algorithm it can be stated that the weights of variables displayed high lability, presumably in view of the numerous high correlations among them (different variables possibly representing variable groups). Yet, in the majority of runs the variable weight values could be quite clearly classified into three groups of importance, as this is shown in Table 6.8 (weights of

6.3 The Analysis Regarding the Administrative Categorization of Municipalities

71

Table 6.8 Examples of variable weights for two runs of calculations, presented in Tables 6.3 and 6.4 Calculations illustrated in: Most important variables

Important variables

Unimportant variables

Nos. of variables from Table 6.2; weight values Table 6.3 (k-means)

No. 15; 0.178

No. 3; 0.109

Table 6.4 (hierarchical aggregation)

Nos. 1, 14; 0.436, – 0.319

Remaining variables; 0.000–0.099 Remaining variables; 0.001–0.030

all variables add to 1). The values of the Minkowski exponent ranged generally from 2 upwards to not quite 4.

6.4 A Verification In the context of the study, described in the preceding section, a kind of verification was performed. Namely, the vector of clustering parameters, Z, which was found to perform the best in the case of the province of Masovia, was applied directly to the data for some other province. Out of the 15 remaining Polish provinces the one of Wielkopolska, with its capital in Poznan, was selected. This choice is justified by the fact that Wielkopolska is also a relatively large province, composed of 226 municipalities, featuring quite a diversity of characteristics of these municipalities, with a large agglomeration as its capital. It can be said that what we were looking for in this verification exercise was the comparison of PA (Wielkopolska) and P(X Wielkopolska ,Z Masovia ). The respective result for Z Masovia , which was established for the k-means algorithm, is shown in Table 6.9. Because both k-means and hierarchical agglomeration fared virtually equally well in the study of Masovia, also the results for the clustering of the municipalities of Wielkopolska with the vector Z, determined for Masovia for the hierarchical agglomeration algorithm are shown here, in Table 6.10. Table 6.9 Contingency matrix for the administrative breakdown of municipalities of the province of Wielkopolska in Poland and clustering performed with the Z vector obtained for Masovia in the case shown in Table 6.3 (k-means algorithm) Clusters: Obtained, PB → Initial, PA ↓

1

1. Urban municipalities

18

1

0

19

2. Rural municipalities

0

95

20

115

3. Urban–rural municipalities

4

43

45

92

22

139

65

226

Totals

2

3

Totals

72

6 Administrative Units, Part I

Table 6.10 Contingency matrix for the administrative breakdown of municipalities of the province of Wielkopolska in Poland and clustering performed with the Z vector obtained for Masovia in the case shown in Table 6.4 (hierarchical aggregation algorithm) Clusters: Obtained, PB → Initial, PA ↓

1

1. Urban municipalities

18

1

0

19

2. Rural municipalities

0

109

6

115

3. Urban–rural municipalities

0

74

18

92

18

184

24

226

Totals

2

3

Totals

It can easily be seen that the results are almost as good as for Masovia – the misclassifications between “urban” and “rural” municipalities being indeed very rare. Thus, the clustering procedure, determined through the evolutionary algorithm for Masovia turned out to be well applicable to another, though in general terms similar, data set. Hence, the verification procedure confirmed the potential of the reverse clustering paradigm.

6.5 The Analysis Regarding the Functional Categorization of Municipalities In this section we shall show the results for the second experiment, in which the functional categories of municipalities, characterised in Table 6.1, were treated as forming PA , and reverse clustering was applied to reconstruct them. Somewhat surprisingly, these results were not as strikingly clear as in the preceding case, even though it could rightly be said that the partition PA was definitely related to the data, characterising the municipalities, even if: (a) the data, used in designing the typology were somewhat different, (b) the procedure, as mentioned, was quite not straightforward, involving decision points and (actually) nominal variables, and (c) the number of categories to reconstruct was definitely higher. Now, then, Table 6.11 shows the contingency matrix for this case, as obtained for the own evolutionary method and the k-means algorithm, while Table 6.12 contains the same for the hierarchical aggregation algorithm. In this case, again, the k-means algorithm gave better results than the other two algorithms accounted for, also including the identification of the “correct” number of clusters. The comparison with the hierarchical clustering is provided by Table 6.12, where also the somewhat more complicated correspondence between the Bq and Aq is indicated. Notwithstanding the number of “errors” (close to 1/3 of all objects are “misclassified” when hierarchical aggregation is applied), we can speak here of a qualitative reconstruction of the functional typology of municipalities. Yet, even this debatable result—its reasons having been mentioned above—provides quite some light on the issue of “functional typology”.

6.5 The Analysis Regarding the Functional Categorization of Municipalities

73

Table 6.11 The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with own evolutionary method using the k-means algorithm Categories: obtained, PB → initial, PA ↓

1

2

3

1 (MS)

1

0

2 (PSI)

1

19

3 (MP)

0

4 (MG)

0

5 (PSE) 6 (PG)

4

5

6

7

8

9

0

0

0

1

1

5

0

16

2

0

1

4

0

2

1

0

0

0

7 (O)

0

0

8 (E)

0

9 (R)

0

Totals

2

Totals

0

0

0

0

1

0

0

0

0

27

1

0

3

0

0

22

0

0

0

0

0

5

2

13

11

2

0

0

31

0

1

16

1

1

1

20

1

1

1

8

11

3

4

29

0

0

0

0

8

3

49

7

67

0

0

0

0

9

0

8

95

112

21

20

10

21

52

20

61

107

314

Table 6.12 The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with own evolutionary method using hierarchical aggregation algorithm Categories: obtained, PB → initial, PA ↓

1 (1 + 2)

2 (3 + 4)

3 (5)

4 (6)

5 (7 + 8)

6 (9)

Totals

1 (MS)

1

0

0

0

0

0

1

2 (PSI)

24

1

2

0

0

0

27

3 (MP)

0

18

1

0

3

0

22

4 (MG)

0

5

0

0

0

0

5

5 (PSE)

5

0

21

1

3

1

31

6 (PG)

0

0

6

8

4

2

20

7 (O)

0

4

1

0

17

7

29

8 (E)

0

0

0

4

51

12

67

9 (R)

0

0

1

0

16

95

112

Totals

30

28

32

13

94

117

314

Thus (see Table 6.11), first, some of the initial categories are being reconstructed with no or only quite limited doubt. These are: • the core of the provincial (and national) capital, MS • the cores of the subregional centres, MG followed by. • the suburban zones of the subregional centres, PG, and, finally • the county seats, MP.

74

6 Administrative Units, Part I

Table 6.13 The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with DE using “pam” algorithm Categories: obtained, PB → initial, PA ↓

1

2

1 (MS)

1

0

2 (PSI)

15

11

3 (MP)

2

8

4 (MG)

1

4

5 (PSE)

15

6 (PG)

2

7 (O)

3

4

Totals

0

0

1

1

0

27

11

1

22

0

0

5

2

12

2

31

0

9

9

20

0

1

21

7

29

8 (E)

2

0

17

48

67

9 (R)

1

0

28

83

112

Totals

39

26

99

150

314

With respect to the county seats, MP, which have been “recognised” with a not bad rate, let us mention that if we do not apply any nominal distinctions—these urban centres tend, in an obvious manner, to be intermingled with other urban, urban-like and even the exceptional rural units. Against this background, even the seemingly heavily biased distinctions among the rural units (other than the suburban ones) come out as relatively well-founded, this being particularly true for the ones characterised as featuring intensive farming (R). Regarding the suburban zone of Warsaw there is, definitely, a doubt as to its actual reach (note that the data set here used does not include any variable, describing the actual connections between the units, like, e.g., job and school commuting, shopping etc.). Such doubt is a natural phenomenon, since various criteria can be applied in order to determine the reach of the suburban zones. For purposes of comparison, we shall also quote the analogous contingency tables, obtained with the DE method for both k-means like (“pam”) and hierarchical agglomeration (“agnes”) algorithms, provided in Tables 6.13 and 6.14. Of particular interest is the result shown in Table 6.14, suggesting an actual return to a three-partite categorisation, namely into urban-like, intermediate and rural units. Yet, altogether, the results obtained with DE were definitely farther away from the imposed PA than those obtained with the own evolutionary algorithm. However, if we refer to the image from Fig. 6.1, the doubts, illustrated by Tables 6.11 and 6.12, concerning the potential division among as many as nine groups of municipalities, become quite justified. It definitely appears that the urban-suburban-rural1 axis is so much dominating that any additional division appears to be quite superficial and to a high extent subjective. 1 This remains valid in spite of the obvious existence of the rural communes, featuring high intensity

of farming production and mixed types of economy, since these units are mostly neighbouring urban or suburban areas.

6.5 The Analysis Regarding the Functional Categorization of Municipalities

75

Table 6.14 The contingency matrix for the functional typology of municipalities of Masovia from Table 6.1 and reverse clustering with DE using “agnes” algorithm Categories: obtained, PB → initial, PA ↓

1

2

3

Totals

1 (MS)

1

0

0

1

2 (PSI)

26

1

0

27

3 (MP)

11

5

6

22

4 (MG)

5

0

0

5

5 (PSE)

17

10

4

31

6 (PG)

2

7

11

20

7 (O)

1

11

17

29

8 (E)

2

11

54

67

9 (R)

1

13

98

112

Totals

66

58

190

314

Concerning the variable weights and the Minkowski exponent similar observations can be forwarded for this series of experiments as for the administrative division of the set of municipalities, namely: variable weights took distinctly three different kinds of values: the dominating one, the few, mostly 2–3 important ones, and the rest, including those of no significance at all; the Minkowski distance definition exponent varying over a broader interval, from roughly 0.5 to well above 2. This apparently, had not exerted any substantial influence on the quality of results, conform, anyway, to the previously forwarded comments on this subject. We shall end the presentation of the results for this case, i.e. the functional typology of municipalities for the province of Masovia, with two maps of the province, one corresponding to the results, characterised in Table 6.11 (own evolutionary method and k-means algorithm) and the other one, corresponding to Table 6.12 (own evolutionary method and hierarchical aggregation algorithm). These two maps constitute Figs. 6.3 and 6.4. In both these figures several highly characteristic features can be observed: The first one is the very pronounced area of influence of Warsaw, reaching well beyond the agglomeration and perhaps even well beyond the functional area (as, for instance, defined in the report, constituting the basis for Table 6.1). The second, quite similar, is the distinct appearance of the areas of subregional cities and their zones of influence (see, especially, the one for the city of Radom in the South of the province). Then, against the background of the rural units (in the map of Fig. 6.3 clearly split into two or three sub-categories), and The third one is the emergence of the compact belts or sequences of municipalities, in some cases associated with transport routes, evidently displaying definite intermediate characteristics (some of the darker green belts in Fig. 6.3 and, in a very pronounced manner, some of the darker blue ones in Fig. 6.4).

76

6 Administrative Units, Part I

Fig. 6.3 Map of Masovia province with the partition PB from Table 6.11

6.6 Conclusions and Discussion The two separate cases, treated in this chapter, but concerning the very same dataset X, gave quite different, in both quantitative and qualitative terms, and, at the same time, interesting results. First, the attempt of reconstructing the official administrative breakdown into three categories of municipalities for the province of Masovia, the capital province of Poland, indicated very high degree of agreement with the classification into the “urban” and “rural” categories, while suggesting the necessity of distinguishing and perhaps different classification of numerous cases from the hybrid “urban–rural” category. These results were achieved in spite of the fact that the partition imposed, namely the formal one, had, in principle, formally nothing to do with the socio-economic and spatial data on the municipalities concerned.

6.6 Conclusions and Discussion

77

Fig. 6.4 Map of Masovia province with the partition PB from Table 6.12

On the other hand, the second case, pertaining to the functional typology of municipalities of the same province, Masovia, in which the initial partition was definitely based on the data for the respective municipalities, turned out to produce through the reverse clustering the partitions, which were similar to the initial one merely in qualitative sense. Yet, this fact could be explained by both the data-related factors and the quite restrictive requirement of dividing the set of objects into as many as nine definite clusters. Still, even in this case, substantive conclusions, of importance for the subject matter, could be formulated. Regarding the technical aspect, it turned out again that for the data sets of limited dimensions and quite clear interpretations of content, the local density algorithm DBSCAN fared much worse than the other two kinds, i.e. k-means and hierarchical agglomeration. Likewise, it also turned out that the selection, or weighting, of variables has an influence on the results obtained, but only in a general sense, that is—a

78

6 Administrative Units, Part I

very clear choice of variable weights was performed on each occasion, but the indication of variables changed very much. In most cases, the variables were weighted in such a manner that three classes of them could be distinguished: (i) (ii)

(iii)

the dominant variable (defining the “axis’ of categorisation); the important variables: two–three variables, less important than the dominant variable, but still exerting quite an influence on the result (modifying variables), and the remaining ones, most of them having, actually, no influence on the results obtained.

The reverse clustering approach demonstrated, therefore, also in this case its usefulness in terms of reaching the telling results, having quite interesting substantive implications for the problems considered.

References Owsi´nski, J.W., Sta´nczak, J., Zadro˙zny, Sł.: Designing the municipality typology for planning purposes: the use of reverse clustering and evolutionary algorithms. In: Daniele, P., Scrimali, L. (eds.) New Trends in Emerging Complex Real Life Problems. ODS, Taormina, Italy, 10–13 Sept 2018. AIRO Springer Series, vol. 1. Springer, Cham (2018) ´ Sleszy´ nski, P., Komornicki, T.: Courtesy of data and internal report on functional typology of municipalities of the province of Masovia (2009) Sta´nczak, J.: Biologically inspired methods for control of evolutionary algorithms. Control Cybern. 32(2), 411–433 (2003) Comrey, A.L., Lee, H.B.: A First Course in Factor Analysis, 2nd edn. Lawrence Erlbaum Associates, Hillsdale, N.J. (1992)

Chapter 7

Administrative Units, Part II

7.1 The Background In this chapter, we shall be analysing the data similar to those we considered in the preceding chapter, that is—the data on the administrative units in Poland. Yet, our analysis shall focus here on the national level, even though we shall be dealing, as before, with the communes (municipalities). Let us remind that there are some 2,500 municipalities in Poland, these units constituting the elements of the higher administrative level of counties, which, in turn, compose the provinces. In the preceding chapter we have analysed the data sets on municipalities at the provincial level. In this chapter we will be analysing them for the entire country—some 2,500 units. Similarly as in one of the two cases, considered in the preceding chapter, we shall be looking at the initial partition, PA , that was prepared by the experts from the Institute of Geography and Spatial Organisation of the Polish Academy of Sciences, ´ see Sleszy´ nski and Komornicki (2016). It actually was meant for a similar purpose to that typology, which was used in the preceding chapter for the Masovian province. The similarity extends also to the procedure, which led to the establishment of this typology. Of foremost importance is the fact that both procedures were based on both a definite data set and a sort of branching procedure. This sort of procedure is sometimes the source of data that are used for quite pragmatic purposes, as this is schematically exemplified in Fig. 7.1 for a specific case of social care/unemployment benefits. The resulting data sets are, therefore, not so easily amenable to the reverse clustering analysis, since the prior partition, PA is the reflection not only of the data themselves, but also of some specific decisions, referring to definite thresholds and/or nominal values. The result of the procedure, which leads to the typology mentioned, and which is ´ described in detail in Sleszy´ nski and Komornicki (2016), is shown in Table 7.1. It can easily be noticed that the categorisation in question reminds very much the one of Table 6.1 from the preceding chapter, mainly because of the similarities, mentioned above. The primary difference consists in the appearance of some special cases, like those of the commune type no. 6 (pronounced transport functions of a commune), © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_7

79

80

7 Administrative Units, Part II

Fig. 7.1 Two examples of the procedures, leading to the potential prior categorization of the sort similar to the one of interest here

but also no. 7 (actually, this type contains municipalities, in which such activities are located of significant scale, as opencast mining, on the one hand, and recreation and leisure activities, on the other hand). The total number of categories is bigger than in the previous case only by one, while the categories have to be allocated over the entire territory of the country. For purposes of this analysis, the set of variables, presented in Table 6.2, of which all were in a way “relative”, such as shares in the area of a municipality or in population number, was extended with some additional variables, the first two having an absolute character, namely population number and overbuilt area in a municipality, and the third one, referring to the category no. 6—the share of transport surfaces in the municipality (see Table 7.3 further on). The addition of the two absolute variables turned out to be of essential importance for the results, as we shall see further on. Thus, altogether, in this study—in the majority of calculations—we used 21 instead of 18 variables.

7.2 The Computational Experiments The calculations followed the path of those, presented in the preceding chapter. We shall be reporting from some of these calculations, involving the evolutionary method from Sta´nczak (2003) and the k-means as well as hierarchical aggregation algorithms. We start the presentation of these results with Table 7.2. It can be easily seen from this contingency matrix that the initial partition, initially outlined in Table 7.1,1 may only be considered to be very vaguely qualitatively reconstructed. Indeed, for most 1 The

quantitative content of Table 7.2 in terms of PA does not fully follow the one of Table 7.1, a part of communes being placed in different categories (that is also why we use somewhat different wording for the description of the categories). The difference results from the dating of the respective typological source studies.

7.2 The Computational Experiments

81

Table 7.1 Functional typology of Polish municipalities Functional types

Number of units

Population

Area

No.

%

in ‘000

%

‘000 km2

%

Persons/km2

33

1.3

9 557

24.8

4.72

1.5

2 025

2. Outer zones of 266 urban functional areas of provincial capitals

10.7

4 625

12.0

27.87

8.9

166

3. Cores of the urban areas of subregional centres

55

2.2

4 446

11.6

3.39

1.1

1 312

4. Outer zones of urban areas of subregional centres

201

8.1

2 409

6.3

21.38

6.8

113

5. Multifunctional urban centres

147

5.9

3 938

10.2

10.39

3.3

379

6. Communes having 138 pronounced transport function

5.6

1 448

3.8

20.06

6.4

72

7. Communes having 222 pronounced non-agricultural functions

9.0

1 840

4.8

33.75

10.8

55

8. Communes with intensive farming function

411

16.6

2 665

6.9

55.59

17.8

48

9. Communes with moderate farming function

749

30.2

5 688

14.8

93.83

30.0

61

10. Communes featuring extensive development

257

10.4

1 878

4.9

41.59

13.3

45

Totals for Poland

2 479

100

38 495

100

312.59

100

123

1. Urban functional cores of provincial capitals

Population density

´ Source Sleszy´ nski and Komornicki (2016)

of the original categories most of their objects are placed in different clusters than they “should”. Altogether, only about half of objects, i.e. municipalities, are placed “correctly”. Evidently, the errors are the smallest for the urban units (especially the original categories 3 and 5), but also, quite surprisingly—there is a relatively not too big error for the rural category 8.

Functional urban areas of subregional centres

External zones of fua’s of subregional centres

Multifunctional urban centres (other)

Communes with developed transport functions

Communes with other developed non-farming functions (tourism and large-scale functions, including mining)

Communes with intensive farming functions

Communes with moderate farming functions

Extensively developed communes (with forests or nature protection areas)

3

4

5

6

7

8

9

10

a

Fua: functional urban area

Sums

0

Outer zones of fua’sa of provincial capitals

2

17

0

0

0

0

0

0

0

3

14

Urban functional areas of provincial capitals

1

1 0

109

0

1

0

1

0

0

8

0

99

2

83

0

0

0

1

0

6

1

46

16

13

3 0

322

10

80

15

11

24

5

101

0

76

4 0

149

5

7

0

14

4

94

6

6

13

5 0

241

18

57

18

19

32

33

29

0

35

6

0

8

0

316

105

56

0

95

17

2

33

7

Clusters obtained forming partition PB

Types of communes in partition PA

No.

43

30

0

7

0

7

0

576

32

98

359

8 0

29

30

2

15

0

11

649

92

366

104

9

10

0

0

0

9

0

0

1

0

0

0

10

1

0

0

0

0

0

0

0

0

0

1

11

5

0

0

0

0

0

0

0

0

0

5

12

2,478

1,272

262

299

137

127

105

48

100

9

166

19

Errors

Table 7.2 Contingency table for the proposed functional typology of Polish municipalities and the reverse clustering partition obtained with own evolutionary method using k-means algorithm

82 7 Administrative Units, Part II

7.2 The Computational Experiments

83

The phenomena, which can be observed from Table 7.2, and which are worth noting, are: • the disappearance of the category 10, “distributed” mainly among the categories 7 and 9 (no surprise, indeed), • the clear distinction of a singleton category (B11 in Table 7.2), composed only of the capital of the country (no wonder, one might say) • and the appearance of the category B12 , composed of the functional urban areas (fuas) of the biggest metropolises, besides the national capital; • further, the very specific initial category 6 got actually almost entirely “washed away” and that despite the introduction of the relevant variable. Table 7.3, on the other hand, shows the weights of variables, obtained for the same solution. It now becomes obvious why it was important to include the two (first, here) absolute variables: their joint weight account for 45% of the total weight of all 21 variables! While only one variable got its weight explicitly equalled to zero, in fact many other became only marginally important (there are altogether 10 variables with weights below 0.020). The 12 variables with the lowest weights have the total weight accounting for a mere 13%. On the other hand, there appears another “factor” of high importance, namely the variables “Registered employment indicator” and “Registered businesses per 1,000 inhabitants”, accounting together for more than 25% of total weight. We shall be yet returning to the issue of variable weights and the related implications. Table 7.3 Variable weights in the solution illustrated in Table 7.2 Variable

Weight Variable

Weight

Population

0.358

Average farm acreage indicator

0.023

Overbuilt area

0.091

Registered employment indicator

0.160

Share of transport-related areas

0.001

Registered businesses per 1 000 inhabitants

0.096

Population density

0.013

Employment-based average business scale indicator

0.018

Share of agricultural land

0.018

Share of businesses from manufacturing and construction

0.024

Share of overbuilt areas

0.024

Number of pupils per 1 000 inhabitants 0.004

Share of forest areas

0.009

Number of over-primary pupils per 1 000 inhabitants

0.039

Share of population over 60 years

0.000

Own revenues of municipality per inhabitant

0.009

Share of population below 20 years 0.019

Share of revenues from personal 0.037 income tax in own communal revenues

Birthrate for last 3 years

0.010

Share of social care expenses in communal budget

Migration balance for last 3 years

0.037

0.009

84

7 Administrative Units, Part II

Figure 7.2, showing the map of Poland with an indication of municipality boundaries, illustrates the scale of agreement/disagreement of the results obtained (like those from Table 7.2) with the initial partition. It is highly interesting to note that the municipalities placed “correctly” and “incorrectly” form, in their vast majority, compact areas rather than a haphazard mosaic. This—especially against the background of the next map, that in Fig. 7.3, indicates that the “errors” concerned no so much individual communes as their subclasses, often forming compact territories in space. Such a phenomenon calls, definitely, for a more in-depth substantive analysis (e.g. oriented at the number of clusters).

Fig. 7.2 Map of Poland with indication of municipalities, which belonged in the solution of Table 7.2 to the “correct” categories from the initial partition and those that belonged to the other ones (“incorrect”)

7.2 The Computational Experiments

85

Fig. 7.3 Map of Poland, showing the partition of the set of Polish municipalities obtained with the own evolutionary method and the k-means algorithm, composed of 12 clusters, corresponding to Table 7.2

The hierarchical aggregation algorithm gave similar results to those of k-means (k-means: 1272 wrong classifications, hierarchical aggregation: 1240 wrong classifications,2 i.e. hierarchical aggregation fared in this case slightly better than k-means) in terms of similarity between PA and PB . The results, obtained with the generalised hierarchical aggregation algorithm are interesting in themselves and qualitatively quite different from those obtained with k-means. They are characterised in Table 7.4. First, and foremost, it must be noted that the best number of clusters, determined for this algorithm, was 5, i.e. just half of those in PA . The table very strongly suggests a much simpler categorisation of the totality of Polish communes, namely into: (A) 2 If

urban cores (including some of the suburban municipalities);

the wrongly placed objects (communes) are counted with respect to the clusters, forming the obtained partition PB , which are in two cases the aggregates of the original categories, then the number of these “erroneously” assigned communes dwindles to 908, see Table 7.4.

86

7 Administrative Units, Part II

Table 7.4 Contingency table for the proposed functional typology of Polish municipalities and the reverse clustering partition obtained with own evolutionary method using hierarchical aggregation algorithm No.

Types of communes in partition PA

Errorsb

Clusters obtained forming partition PB 1

(1,3,5)a

2

(4,6,7,9)a

3

(10)a

4

(2)a

5

(8)a

1

Urban functional areas of provincial capitals

26

1

0

6

0

7

2

Outer zones of fua’s of provincial capitals

24

68

14

152

7

113

3

Functional urban areas of subregional centres

46

0

3

6

0

9

4

External zones of fua’s of subregional centres

11

85

25

60

20

116

5

Multifunctional urban centres (other)

90

25

3

8

16

52

6

Communes with developed transport functions

6

74

10

8

39

63

7

Communes with other developed non-farming functions (tourism and large-scale functions, including mining)

8

103

69

4

38

119

8

Communes with intensive farming functions

0

115

1

6

374

122

9

Communes with moderate farming functions

9

518

6

25

107

147

10

Extensively developed communes (with forests or nature protection areas)

2

108

104

3

45

158

222

1097

235

278

646

908

Sums

2478 a

Hypothetical corresponding clusters from the initial partition; in bold: numbers of municipalities from the hypothetically corresponding clusters in the initial partition b Calculated with respect to the newly established categories, i.e. B q

(B)

(C) (D)

suburban municipalities, primarily those of the larger agglomerations, along with some other ones, most presumably of similar socio-economic characteristics; rural municipalities around smaller urban centres, together with those featuring not too intensive farming; rural municipalities with intensive farming; and

7.2 The Computational Experiments

(E)

87

the extensively developed rural municipalities, with forests and nature protection areas.

Some other comments on this result will be forwarded in the discussion section of this chapter. A good illustration of the characteristics of Polish municipalities across the territory of Poland, supporting some of the observations, based on Tables 7.3 and 7.4, is provided in Fig. 7.3, presenting the map of Poland, produced with the calculation run, performed with the k-means algorithm, in this case resulting in 12 clusters, forming the respective PB , and characterised before in Table 7.2. Namely, this image shows very clearly two aspects of the set of Polish communes: I. II.

the natural gradation from urban to rural to peripheral units and areas; and the very distinct difference in spatial character between the north-western and the south-eastern parts of Poland (visual domination of the blue areas in the former and of the green territory in the latter).

This second aspect disturbs very much the possibility of a “linear” classification of municipalities across the entire territory of Poland. In a very rough statement: the West of Poland is much more urbanised than the East, but it is by no means more densely populated—its settlement system is simply definitely different. Actually, south-eastern Poland is more densely populated than the north-western one. At the same time, the shares of forests are much higher in north-western Poland.

7.3 Discussion and Conclusions The situation from the second series of experiments, illustrated and commented upon in the preceding chapter, was repeated here in that the solutions obtained were quite far from the initial partition, at least in purely formal terms (the number of misclassified units being at around half). Like there, though, the qualitative character of the initial partition was to a significant extent preserved—with some telling exceptions, which could be used, in particular, for drawing substantive conclusions. Quite technically, it turned out again that k-means and hierarchical aggregation outperformed DBSCAN. It was highly important to obtain the explicit weights of variables in this particular exercise, for these weights indicated the “main direction”, along which the initial categories were defined, namely—and quite naturally—the “urban-rural” direction (like in some other experiments before—the dominating variables were identified, followed by some complementary ones, and then the group of really unimportant variables, roughly half of them). No wonder, therefore, that the “diverging” clusters (like category no. 6, associated with transport) could be identified (or “reconstructed”) with less certainty, or not at all. Another conclusion of a similar character concerns the fact that most probably the population of municipalities of rural character—some 1 500 units or more— does not feature any distinct divisions into clear subcategories, but rather constitutes

88

7 Administrative Units, Part II

a continuum in the socio-economic and spatial dimensions. That is why, apparently, although—in distinction from the above commented “diverging” groups—it is definitely somehow distributed along the very general “urban-rural” axis, any partition of this large group has to bear an arbitrary or subjective character, at least to a significant extent (although, again, a reference to Owsi´nski 2012, may be recalled for a potentially “objective” approach to the division of the thus distributed set of objects).

References Owsi´nski, J.W.: On dividing an empirical distribution into optimal segments, SIS (Italian Statistical Society) Scientific Meeting, Rome, June 2012. http://meetings.sis-statistica.org/index.php/sm/ sm2012/paper/viewFile/2368/229 ´ Sleszy´ nski, P., Komornicki, T.: Functional classification of Poland’s communes (gminas) for the needs of the monitoring of spatial planning (in Polish with English summary). Przegl˛ad Geograficzny 88, 469–488 (2016) Sta´nczak, J.: Biologically inspired methods for control of evolutionary algorithms. Control Cybern. 32(2), 411–433 (2003)

Chapter 8

Academic Examples

8.1 Introduction This chapter, closing the series of presentations of application cases of the reverse clustering paradigm, is devoted to these experiments, whose applied side was for various reasons actually void. Naturally, the purpose of such experiments was to test the capacities of the approach and the setting we used for it, as well as the role of definite parameters, composing the vector Z. There were not so many such experiments carried out, as the ones, based on real-life data, with potentially applicable results and conclusions, provided quite ample material for testing the methodology and its technical details. First, a couple of remarks are forwarded, concerning the very initial tests, performed with the classical Fisher’s Iris dataset (see Fisher 1936, and Anderson 1935). Although the data are empirical, they are treated merely as an academic testbed, mainly because of the ample knowledge on this data set and broad comparative material. Then, a bit more detailed account is provided on a series of artificial data sets, of similarly small dimensions as the Iris data, but featuring other kinds of potential difficulties. In particular, some of these data sets were clearly composed of the “nested clusters”, i.e. smaller clusters forming bigger ones. It was interesting to attempt determining the parameters of Z, for which the different “levels of nested clusters” could be recovered.

8.2 Fisher’s Iris Data Since this data set, due to E. Anderson and R. A. Fisher, is very well known, we shall only shortly report here on the results obtained from the tests of the reverse clustering approach, based on this data set. We speak here, namely, of n = 150 observations, characterised each by m = 4 variables, describing flowers, and the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_8

89

90

8 Academic Examples

initial partition, PA , is the one into three varieties, Iris setosa, Iris virginica and Iris versicolor. The calculations, performed with differential evolution method (DE) using “pam” (partitioning around medoids) and the hierarchical aggregation “agnes” algorithms yielded the results as shown in the series of tables—Tables 8.1, 8.2 and 8.3. These experiments were first performed with a rather narrow composition of the vector Z and it soon became obvious that in order to obtain better results a possibly broad selection of parameters and their values is necessary. Thus, Tables 8.2 and 8.3 contain the results for the extended composition of the vector Z. It can be easily seen that in this case the hierarchical aggregation algorithm provided better results than the one, belonging to the k-means family. Actually, in the framework of experiments with the DE method, also the standard fuzzy clustering algorithm “fanny” (from the R package, again, following Kauffman and Rousseeuw 1990) was tried out, yielding, in this particular case, yet somewhat better results. This algorithm, though, was not used in any of the remaining cases, here commented upon. Like in several other situations, reported in this book, the own evolutionary method (Sta´nczak 2003) fared altogether rather distinctly better than the DE from the R package, the respective results being summarised in Table 8.4. Table 8.1 The results obtained for the Iris data with the DE method—comparison of “pam” and “agnes” algorithms and two selections of vector Z components (notation as in Table 4.1) Algorithm

Optimized parameters

Adjusted Rand index

Rand index

pam

p, h, w1 , . . . , w4

0.758

0.892

agnes

p

0.730

0.880

p, a, h, w1 , . . . , w4

0.922

0.966

p, a

0.759

0.892

Table 8.2 Contingency table for the DE method applied to the Iris data with the “pam” algorithm

Table 8.3 Contingency table for the DE method applied to the Iris data with the “agnes” algorithm

Iris varieties

Clusters obtained 1

Iris setosa

2

3

50

0

0

Iris versicolor

0

48

2

Iris virginica

0

14

36

Iris varieties

Clusters obtained

Iris setosa

50

0

Iris versicolor

0

50

0

Iris virginica

0

14

36

1

2

3 0

8.2 Fisher’s Iris Data

91

Table 8.4 The reverse clustering results for the Iris data obtained with the own evolutionary method using DBSCAN, k-means and hierarchical merger algorithms Clustering method/variant

Variable weights

Minkowski exponent

Number of misclassified objects

DBSCAN—1

0.369, 0.028, 0.229, 0.967

0.95

6

DBSCAN—2

0.470, 0.027, 0.217, 0.898

1.33

6

DBSCAN—3

0.040, 0.041, 0.076, 0.908

0.53

5

k-means

0.052, 0.051, 0.673, 0.224

0.42

3

hierarchical aggregation—1

0.158, 0.193, 0.439, 0.210

3.27

2

hierarchical aggregation—2

0.116, 0.174, 0.560, 0.150

2.64

3

The different variants of DBSCAN and hierarchical aggregation algorithms are the witnesses to the attempts, undertaken for this initially treated data set, of specifying the variants of the algorithms and of the conditions on parameter combinations that would be used for further studies. Like in the case of DE, the best results were obtained for hierarchical aggregation (although those for k-means are only slightly worse). This is not really very surprising, since the hierarchical aggregation algorithms can overcome the limitation, proper for the k-means algorithms, of containing the clusters in hyperspherical or hyperellipsoidal shapes. In any case, the comparison of these results with those usually obtained by various clustering and classification methods, when tested against the Iris data, demonstrated that the approach, equipped with the techniques and their parameters assumed, can work quite properly in terms of an adequate reconstruction of the initial partition PA . This was one of the important signals for continuing the study as reported in this book.

8.3 Artificial Data Sets The essential character of the artificial data sets, which were the subject of analysis in this series of calculations, is shown in Figs. 8.1 and 8.2. The fundamental issue, as indicated already, was related to the “nested” nature of the respective clusters. Thus, depending upon the “level of perception” or of “resolution”, in the case of the pertinent Fig. 8.1, one could speak of four, eight, or even more (say, 15) clusters. It was, then, of interest, which, if any, of the “resolution levels” would be reconstructed through reverse clustering, and, if such a reconstruction turned out to be at

92

8 Academic Examples

Nested clusters - 2D - three levels 8 7 6

x2

5 4 3 2 1 0 0

1

2

3

4

5

6

7

8

x1

Fig. 8.1 An example of the artificial data set with “nested clusters”, subject to experiments with reverse clustering

Linear broken structure - 2D 8 7 6

x2

5 4 3 2 1 0 0

1

2

3

4

5

6

7

8

x1

Fig. 8.2 An example of the artificial data set with “linear broken structure”, subject to experiments with reverse clustering

8.3 Artificial Data Sets

93

all possible, whether appropriate tuning of the parameters in vector Z would allow for the identification of the different, successive “resolution levels”. Concerning the “nested” structure, experiments were performed for a number of its variants, differing by the mutual positioning and separation of the “partial clusters”, appearing in Fig. 8.1, implying more or less distinct division of the bigger clusters into the smaller ones. In the extreme case, there would be no visual distinction inside the four “main” clusters. The calculations were performed using the k-means algorithm and the conclusions could be summarised as follows: • It turned out to be trivial to obtain the structures required, using the k-means algorithm when the steering parameter was simply the number of clusters—the algorithm identified the “proper” clusters without any errors for partitions into 1, 4, 8, 15, 16 and 60 clusters; in addition—once the number of clusters was established, the remaining parameters played either absolutely no role whatsoever or only a truly marginal one; • On the other hand, obtaining the different level of granulation (“resolution”) for the variable (optimised) number of clusters, through manipulation of other parameters turned out to be quite a difficult task; actually, it turned out to be possible to obtain only partitions into 1, 4 and 60 clusters; it is not excluded, of course, that a much finer mesh for the parameter values used might still yield the other “resolution levels” (e.g. partitions into 8 and 15 clusters); • In an obvious manner, the problems with obtaining “less distinct” (other than mentioned above) “resolution levels” is also associated with the inherent characteristic of the k-means algorithm, namely the monotone dependence of its implicit objective function (sum over clusters of the cluster-proper sums of distance of objects in the cluster to the cluster representative) on the number of clusters, decreasing along with the number of clusters; this causes that only the extreme or very distinct structures can be identified by the algorithm as the “best” ones. Another kind of example analysed is shown in Fig. 8.2—in this case the essential issue was in the degree of separation of the segments of the supposedly “linear” structure. If one refers to the case analysed in the preceding chapter (administrative units—case II), one might easily see the association of the two: the municipalities being roughly distributed in the universe of socio-economic and spatial characteristics along the “urban-rural” axis. There, however, the main problem consisted not so much in the gaps between the subgroups (as in Fig. 8.2), but mainly in the (potential or even only hypothetical) divergence from the main axis mentioned, featured by some specific kinds of municipalities. Another interpretation of this kind of data set is oriented at chronological data series and the changes of behaviour in the underlying model over time. The results obtained for the test data, whose example is provided in Fig. 8.2, were very similar, in qualitative terms, to those reported and summarised before for the “nested clusters” case. Only the most distinct clusters (four clusters in the case of Fig. 8.2) were identified, along with the extreme ones (the one, all-embracing cluster and the set of singletons). This, again, has to be mostly attributed to the

94

8 Academic Examples

specific features of the k-means algorithm, which provided the best results also for this dataset.

8.4 Conclusions The here discussed analyses were primarily oriented at the verification of the basic capabilities of the reverse clustering paradigm. We do not report the detailed results for this group of experiments, since this is not of primary interest: the main issue in all of these experiments was to verify the capacity of the methodology engaged in the paradigm (the search procedures, the clustering algorithms, and the sets of parameters optimised) in reconstructing the basic features of the respective initial partitions. In these terms, the results of the tests were altogether positive, with an indication of some reservations, which ought to be kept in mind when applying the approach. The latter concerned mainly the usefulness of the particular clustering algorithms and their variants, as well as parameterisations. The conclusions drawn therefrom are in agreement with the experiences derived from other experiments. Thus, in particular, k-means and hierarchical aggregation come out as definitely better than DBSCAN, while the inner limitations of, for instance, k-means, become apparent for the appropriately constructed test data, which, actually, may correspond to some specific real-life data sets, with which one might deal in practice.

References Anderson, E.: The irises of the Gaspé Peninsula. Bull. Am. Iris Soc. 59, 2–5 (1935) Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenic. 7(2), 179– 188 (1936) Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990) Sta´nczak, J.: Biologically inspired methods for control of evolutionary algorithms. Control Cybern. 32(2), 411–433 (2003)

Chapter 9

Summary and Conclusions

We shall now try to sum up the experiences from the study of the reverse clustering paradigm, whose essential results are presented in this book. Conforming to the purpose of this book, announced at the outset, we shall concentrate on the sense and interpretations of the reverse clustering approach, and hence on the potential significance and use of its results, but will also devote some attention to the methodological aspect of the respective procedure. The purely technical, computational side will at the moment be treated rather marginally.

9.1 Interpretation and Use of Results Thus, the very first conclusion from the experiments, here reported, is the wide scope of potential and actual interpretations, and therefore uses, of the paradigm. Contrary to what many might think, when first confronted with the idea, reverse clustering is not just another approach to the determination of classifiers. This is well illustrated by the cases, in which the initial, reference partition PA had a specific relation to the data set X, e.g. based on a feature that was not present in X (traffic data, environmental contamination data), and not necessarily easily identifiable through clustering. On the other hand, the approach enabled a reasoned analysis of the situations, in which PA was apparently just a hypothesis, or resulted from a procedure that could not be brought to the framework of clustering (administrative units). The variety of situations treated, which was announced in Fig. 3.1, allowed for quite thorough verification of the capacities of the approach. Although the clarity and strength of results, in terms of their interpretation, differed quite widely among the experiments, in each case an additional insight into the data set and its structure was gained, in some cases boiling down to the “better partition than that of PA ” or “general confirmation of the PA , with definite reservations”. So, it can be stated that reverse clustering is a new, versatile tool of data analysis, which can be used for a wide variety of problems, in which one of the essential aspects is partitioning of the set of data objects. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. W. Owsi´nski et al., Reverse Clustering, Studies in Computational Intelligence 957, https://doi.org/10.1007/978-3-030-69359-6_9

95

96

9 Summary and Conclusions

We would like to emphasise that in some cases we obtained truly surprising results, regarding both their nature (relation of the shape and character of partitions PB and PA ) and quality (i.e. either similarity of PB and PA or “internal quality” of the clustering itself). This concerns, first of all, the results for the road traffic data, where not only a better partition PB was found, notwithstanding the fact that PA was based on a feature not contained in the data set, a new category could be identified, in this case composed of “outliers” or “anomalous patterns”. Similarly, in the case of environmental contamination data, where no a priori PA was given, the one, determined on the basis of superficial visualisation of the data for some of the variables in X, could be quite appropriately reconstructed on the basis of the remaining ones. Thereby, a strong confirmation was provided for the possibility of categorising the units analysed according to the levels of heavy metal content in the ecosystems of the units considered. Moreover, it turned out that the obtained and confirmed categorisation is quite simple and effective—only few categories, having quite obvious interpretations, adequate from the point of view of environmental management. Another case of surprisingly good results was that of the administrative categorisation of municipalities in Poland. Not only the three formal categories of municipalities (“urban”, “rural” and “urban–rural”) could be quite well reconstructed on the basis of socio-economic and spatial data on these units, but it turned out that the vector Z, established for one province of Poland (Masovia), produces, when applied in terms of defining a clustering procedure, almost as good clustering results for another province (Wielkopolska). This demonstrated that one of the major intended uses of the approach—i.e. classification of entire data sets different than the original X is actually not only possible, but has been verified on quite not simple empirical material. The feasibility of the concept behind the reverse clustering paradigm, as it was described in the initial chapters of the book, was confirmed by the calculations, related to what we called the “academic data”, including, in particular, the famous Iris data set of Fisher and Anderson, commented upon in the preceding Chap. 8. Not only was it possible to reconstruct the imposed initial partitions with acceptable precision, but this verification-oriented application of reverse clustering also allowed for some preliminary assessment of the more technical details of the procedure, first of all concerning the composition of the vector Z. Moreover, positive results—even if not entirely in agreement with visually-based intuitions—were obtained for the case of “nested clusters”, quite specific as to the character of hypothetical clusters and related alternative partitions. Notwithstanding this confirmation of the very concept, some of the cases treated turned out to produce results, which, in purely quantitative terms, were highly “unsatisfactory”. Definitely, with only about half of units getting “correctly” classified in the best attainable PB the results can hardly be referred to as “adequate”. This was the case with several experiments, concerning the functional typologies of municipalities in Poland (Chaps. 6 and 7). Yet, the respective contingency matrices for PA and PB demonstrated that in qualitative terms the results were not as “bad”, and were indeed telling in some respects. First, some of the initial categories were relatively

9.1 Interpretation and Use of Results

97

well reconstructed (the respective errors being at the levels of 10–20% rather than 50%). Second, these relatively well reconstructed categories were easily interpreted in quite obvious intuitive terms (“urban cores” of various degrees, for instance). Third, the biggest errors, and, in fact, even lack of recognition of a definite category, occurred for the initial categories that “by definition” were highly uncertain, very likely constituting segments of a continuum, from which they were separated by the imposition of some thresholds or use of external criteria.1 This applied, in particular, to such general categories as: (i)

(ii)

suburban municipalities, forming various concentric areas around bigger agglomerations, with vague assignment to vaguely defined zones (here the results provided very interesting alternatives to the reaches and shapes of the respective “zones of influence” of particular agglomerations), and, rural municipalities, with different degrees of farming intensity and also different degrees of importance of farming for the local economy, usually meaning mixed local economies (along with different degrees of the advancement of such processes as ageing and depopulation).

Another category, which could hardly fit into the framework of reverse clustering and was poorly identified within the PB ’s obtained, was constituted by “special kinds” of units, determined on the basis of selected sectors and branches of economy or other features, characteristic for a group of spatial units, such as, for instance: (iii) (iv) (v)

municipalities, hosting special kinds of economic activities, such as a large scale strip mining and associated industries; municipalities, with high intensity of recreation and leisure activities, or municipalities, in which the function of transport played an important role (junctions, storage and logistic facilities, etc.).

In addition, in some cases such distinctions (“criteria of classification”) were put together in conjunction (like (iii) and (iv) above), which made their reconstruction in the framework of any clustering-based PB very difficult, if at all possible. In the light of these remarks it becomes obvious that for several of similarly determined categories, distinguished in the initial partitions, a second thought is necessary, concerning their definitions, and here the reverse clustering approach may, and actually does, provide quite significant material for consideration, pertaining to the substantive, and not only just “mechanical”, aspect of the particular categorisations.

1 It

is, theoretically, possible to divide in a supposedly “optimal” manner a relatively continuous distribution, as this is shown in Owsi´nski (2012), but this requires applying a special procedure and a specially devised objective function, and is always based on the use of some divergences from the smoothness and continuity of the relevant distribution, which was in this case not feasible.

98

9 Summary and Conclusions

9.2 Some Final Observations Concerning the technical side of the reverse clustering procedure, we shall forward here some remarks on the roles of particular elements of the vector Z, as resulting from the experiments described in the book. It is quite obvious that the most important parameter of Z is the clustering algorithm, meaning, in our case, the choice between the k-means-type algorithm, the general hierarchical aggregation procedure and the DBSCAN algorithm, as the representative of the local density algorithms. Quite systematically, for the cases here considered, it would turn out that DBSCAN gave the worst results, as measured by the similarity of PA and PB , even if in some cases these results were in themselves of some interest (and also in some cases featured similar quality to one of the two other kinds of techniques considered). The two other ones, k-means-type and hierarchical aggregation, gave often the results of similar quality, although there were cases, in which a distinct difference could also be observed. Along with the choice of the clustering algorithm, the key parameter was, naturally, the number of clusters, p. In many situations, in order to shorten the calculations, this number was fixed, usually being equal to that of PA . Yet, such a limitation very strongly influenced the final solution, which, when p was subject to optimisation, could have been different, even—surprisingly—though the PB would be more similar to PA . This is the outcome of the interplay—mentioned at the end of Chap. 2— between the data, i.e. PA and X, on the one hand, and the various principles, constraints and parameters, characterising the clustering algorithms, referred to through Z. The number of clusters is an explicit parameter for the k-means and the hierarchical aggregation algorithms, while it is simply an element of output for DBSCAN. Yet, other algorithmic parameters were also used, like the Lance-Williams coefficients in the case of hierarchical aggregation algorithms and the local distance/density coefficients in the case of DBSCAN. This aspect was not treated in depth in this book, for the sake of shortness and clarity, and the only comment we shall forward here with respect to these elements of Z is that while there were also subject to optimisation in most of the cases treated, their influence was usually limited, except for some special situations (e.g. clusters of very specific shapes, densities etc.). Through a number of specifically oriented experiments it was demonstrated that the use of the complete vector Z, including all its elements, listed in Chap. 1, yields better results than the use of only the most important parameters. Thus, in virtually all cases optimisation was applied to the weights of variables and the Minkowski exponent of the distance definition, along with other parameters. This proved to be fully justified in some of the reported cases, in which quite a proportion of variables were neglected or marginalised, and a distinct structure of importance of variables could be observed (e.g. one-two leading variables, few less important ones, and the rest of no or barely visible importance). Yet, a high degree of lability was observed both for the actual values of the weights of variables (even if the structure of weight values, mentioned above, was preserved), and even more so with respect to the Minkowski exponent—often ranging within the

9.2 Some Final Observations

99

same experiment and for similar other conditions of calculations over a wide interval of values (e.g. for one of the experiments—between 0.6 and 3.7). This high degree of changeability of values is due, first, to low sensitivity of results, especially in terms of the Rand index values, with respect to some of the parameter values, and also to the substitutability among variables within their definite groups (the highly correlated ones). Computational limitations, related to computation times and facilities, have not allowed us to carry out a more detailed analysis of the respective phenomena, but, all in all, they were not so important from the point of view of the interpretation of the results obtained. Still, given the conclusion that the use of the complete vector Z yields better results than when only algorithmic parameters are being optimised, this issue definitely requires further study. As a kind of a (closing, in fact) illustration for the latter statement we provide Fig. 9.1, which is related to the study, reported in Chap. 6 of the book—the study of categories of municipalities within a single province. This figure shows the map of the Polish province of Masovia, with the national capital, Warsaw, as its centre, analysed in the reverse clustering experiments, performed with the DBSCAN algorithm, the one that fared the worst in this series of experiments, the respective results being shortly characterised in Table 9.1. It can be easily seen from this table that instead of the initial nine functional types the algorithm was capable of producing only five, and not easily attributable to the initial ones, at that, so that even quite qualitatively one finds it difficult to reconstruct the initial partition. The “error” proportions are, indeed, very significant in the case of this algorithm, and one hardly finds the result acceptable, definitely so in quantitative terms, but, then, also in the qualitative ones, as well. Notwithstanding all these shortcomings and the related criticism, the map of Fig. 9.1 clearly shows a distinct and well-justified spatial structure, with the large agglomeration area of Warsaw in the middle, very distinct urban functional zones of the subregional centres, and some complementary municipality types, which, Table 9.1 Contingency matrix for the typological categorisation of the municipalities of the province of Masovia in Poland obtained with reverse clustering using own evolutionary algorithm and the DBSCAN algorithm (for explanations see Chap. 6) Categories: obtained → initial functional types↓

1 (MS, PSI)

2 (MP, MG)

3 (PSE, PG, O, E, R)

4 (?)

5 (?)

1 (MS)

1

0

0

0

0

2 (PSI)

26

0

1

0

0

3 (MP)

5

9

8

0

0

4 (MG)

2

3

0

0

0

5 (PSE)

8

0

13

0

7

6 (PG)

1

0

15

4

0

7 (O)

2

0

27

0

0

8 (E)

3

0

63

0

0

9 (R)

1

0

111

0

0

100

9 Summary and Conclusions

Fig. 9.1 Map of the province of Masovia showing the municipality types, obtained from the reverse clustering performed with DBSCAN algorithm, characterised in Table 9.1

contrary to what one might think by just looking at Table 9.1, are by no means “outliers”, but rather much narrower classes of units, at which, perhaps, special attention ought to be devoted. Thus, also this result provides a valuable insight and cannot be simply rejected on the basis of poor quantitative index value. The striking clarity and simplicity of this spatial image is, indeed, very telling. In this context two aspects might—or perhaps—ought to be indicated: • if this result has really a substantive value in itself (and not just as an element of a technical analysis and debate) then what is the role of the initial partition PA and the reverse clustering procedure in obtaining it, as opposed to (compared with) the potentially straightforward application of a clustering procedure to the respective data on municipalities (or, in fact, any other procedure, leading to categorisation of municipalities)?

9.2 Some Final Observations

101

The data set analysed X

The clustering algorithms and the data processing parameters: Ω ={Zi}

The prior partition of X, i.e. PA

Z

The criterion Q(PA,PB): similarity of the two partitions

The resulting partition of X: PB

STOP No

Yes

The search (optimisation) procedure: maximising the Q(PA,PB)

Z*

Relation of PB(Z*) to PA and the purpose of producing it, and quality assessment

Fig. 9.2 The meta-scheme of application of the reverse clustering paradigm

• are we capable of reconstructing the (quantitative and qualitative) rationale for this image? this quite fundamental question (closely related to the one above, having more qualitative character) calls for some kind of approach that would actually truly close the superior-level loop (see Fig. 9.2), beyond the one, in which we now work with the reverse clustering, and which would (more) explicitly give the answer to the question: is PA correct / appropriate / optimal, and if no, what should it be?

Reference Owsi´nski, J.W.: On dividing an empirical distribution into optimal segments, SIS (Italian Statistical Society) Scientific Meeting, Rome, June 2012. http://meetings.sis-statistica.org/index.php/sm/ sm2012/paper/viewFile/2368/229