Intelligent Systems and Applications in Business and Finance (Studies in Fuzziness and Soft Computing, 415) 3030936988, 9783030936983

This book presents a selection of current research results in the field of intelligent systems and draws attention to th

127 38 5MB

English Pages 225 [221] Year 2022

Table of contents :
Preface
Contents
Attitude-Based Multi-expert Evaluation of Design
1 Introduction
2 Semantic Differential
3 Interval-Valued Semantic Differential
4 Evaluation Using the Interval-Valued Semantic Differential, Consensus of Evaluations
5 Conclusion
References
An Investigation of Hidden Shared Linkages Among Perceived Causal Relationships in Cognitive Maps
1 Introduction
2 Data, Theoretical Aspects, and Methodology
2.1 Data Sample
2.2 Cognitive Maps
2.3 Inter-causal Relationships
2.4 Set-Theoretic Consistency and Coverage Measures
2.5 Hypothesis Models
2.6 Research Process
3 Results and Discussion
3.1 Inter-causal Relationships Under H1
3.2 Inter-causal Relationships Under H2
4 Conclusion and Future Directions
References
A New Framework for Multiple-Criteria Decision Making: The Baseline Approach
1 Introduction
2 Definition of a Standard MCDM Problem
3 Definition of the New MCDM Problem Addressed in This Paper—Decision Making Under the Existence of a Baseline
3.1 The Basic Difference Between the Standard Formulation of the MCDM Problem and the Newly Proposed Generalized Formulation
3.2 Assumptions and Notation for the Newly Proposed MCDM Problem Formulation
4 Decision-Making Styles Available with the New MCDM Problem Formulation
4.1 ``Appreciating Possessed'' Decision-Making Style
4.2 ``Craving Unavailable'' Decision-Making Style
4.3 A Combined Style—``Appreciating Possessed'' in the Baseline and ``Craving Unavailable'' in Alternatives
5 Behavioral Effects in the New MCDM Problem Formulation When the Combined Style of Criteria Weights Determination is Used
5.1 Cost of Switching Effect
5.2 (Un)Satisfaction Effect
5.3 Constant Switching Effect
5.4 New Feature Effect
6 Conclusions
References
Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited by Using Intra-class Entropy and a Normalized Scaling Factor
1 Introduction
2 Methods
2.1 Class-Wise Fuzzy Similarity and Entropy (C-FSAE) Feature Selection
2.2 Class-Wise Fuzzy Entropy and Similarity (C-FES) Feature Selection
3 Data
3.1 Artificial Example Data Sets
3.2 Real-World Data Sets
4 Training Procedure
5 Results
6 Conclusion
References
A Region-Based Approach for Missing Value Imputation of Cooling Technologies for the Global Thermal Power Plant Fleet Using a Decision Tree Classifier
1 Introduction
2 Data and Methodology
2.1 Power Plant Data
2.2 Decision Tree Classifier
2.3 Different Approaches to Account for the Geographical Location of Power Plants
2.4 Overcoming the Problem of Data Scarcity
2.5 Validation of Results
3 Results
3.1 Country Model Versus Minor Region Model
3.2 Performance of Hybrid Model
4 Conclusion
References
A Neural Network Based Multi-class Trading Strategy for the S&P 500 Index
1 Introduction
2 Data
2.1 Data Gathering and Feature Engineering
2.2 Missing Values
2.3 Target Variables
3 Feature Selection
3.1 Information Gain
3.2 Pearson Correlation
3.3 Random Forest Feature Importance
3.4 Feature Importance Results
4 Model Performance Evaluation
4.1 Artificial Neural Networks
4.2 Model Evaluation
4.3 Model Selection
5 Results
5.1 Trading Strategies and Threshold
5.2 Performance
6 Conclusion
Appendix
References
Predicting Short-Term Traffic Speed and Speed Drops in the Urban Area of a Medium-Sized European City—A Traffic Control and Decision Support Perspective
1 Introduction
2 A Brief Overview of Previous Studies
3 Data Set and Applied Methods
3.1 Speed Prediction in Helsinki
3.2 Models Used in the Analysis
4 Results for Traffic Speed Prediction and Their Discussion
4.1 ARMA Models for Traffic Speed Prediction
4.2 Linear Regression Models for Traffic Speed Prediction
4.3 K-Nearest Neighbors Method for the Prediction of Traffic Speed
4.4 XGBoost for the Prediction of Traffic Speed
4.5 Comparison of the Performance of ARMA, Linear Regression, KNN and XGBoost Models for Traffic Speed Prediction
5 Speed Drop Prediction
5.1 Speed Drop Prediction Capabilities by the Analysed Models
5.2 A Decision Tree Based Prediction of Traffic Jams
6 Conclusions
References
Hedging Effectiveness of Currency ETFs Against WTI Crude Oil Price Fluctuations
1 Introduction
2 Review of the Literature
3 Methodology
3.1 ARMA-GARCH Modelling
3.2 The Copulas
3.3 Hypotheses and Testing
4 Results
4.1 Descriptive Statistics
4.2 Margins Modeling with EGARCH
4.3 Copula Modeling Results
5 Conclusion
References

Recommend Papers

Soft Computing Applications in Business (Studies in Fuzziness and Soft Computing, 230) 3540790047, 9783540790044

Soft computing techniques are widely used in most businesses. This book consists of several important papers on the appl

99 4 10MB Read more

Intelligent Multimedia Processing with Soft Computing (Studies in Fuzziness and Soft Computing, 168) 354023053X, 9783540230533

Soft computing represents a collection of techniques, such as neural networks, evolutionary computation, fuzzy logic, an

113 8 32MB Read more

Fuzzy Chaotic Systems: Modeling, Control, and Applications (Studies in Fuzziness and Soft Computing, 199) 3540332200, 9783540332206

Bringing together the two seemingly unrelated concepts,fuzzy logic andchaos theory,isprimarilymotivatedbytheconceptofsof

98 93 18MB Read more

Evolutionary Computation in Economics and Finance (Studies in Fuzziness and Soft Computing, 100) 9783790825121, 9783790817843, 3790825123

After a decade's development, evolutionary computation (EC) proves to be a powerful tool kit for economic analysis.

124 73 Read more

Recent Advances in Soft Computing and Cybernetics (Studies in Fuzziness and Soft Computing, 403) 3030616584, 9783030616588

This monograph is intended for researchers and professionals in the fields of computer science and cybernetics. Nowadays

100 98 12MB Read more

Soft Computing in Web Information Retrieval: Models and Applications (Studies in Fuzziness and Soft Computing, 197) 9783540315889, 3540315888

This book presents recent studies on the application of Soft Computing techniques in information access on the World Wid

103 0 4MB Read more

Innovations in Fuzzy Clustering: Theory and Applications (Studies in Fuzziness and Soft Computing, 205)

Clustering has been around for many decades and located itself in a uniquepositionasafundamentalconceptualandalgorithmic

114 71 3MB Read more

Soft Computing for Information Processing and Analysis (Studies in Fuzziness and Soft Computing, 164) 3540229302, 9783540229308

Search engines, with Google at the top, have become the most heavily used online service, with millions of searches perf

122 86 62MB Read more

Production Engineering and Management under Fuzziness (Studies in Fuzziness and Soft Computing, 252) 9783642120510, 3642120512

Production engineering and management involve a series of planning and control activities in a production system. A prod

111 85 8MB Read more

Fuzzy Probabilities: New Approach and Applications (Studies in Fuzziness and Soft Computing, 115) 3540250336, 9783540250333

In probability and statistics we often have to estimate probabilities and parameters in probability distributions using

105 5 6MB Read more

Intelligent Systems and Applications in Business and Finance (Studies in Fuzziness and Soft Computing, 415)
3030936988, 9783030936983

Author / Uploaded
Pasi Luukka (editor)
Jan Stoklasa (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Studies in Fuzziness and Soft Computing

Pasi Luukka Jan Stoklasa Editors

Intelligent Systems and Applications in Business and Finance

Studies in Fuzziness and Soft Computing Volume 415

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Fuzziness and Soft Computing” contains publications on various topics in the area of soft computing, which include fuzzy sets, rough sets, neural networks, evolutionary computation, probabilistic and evidential reasoning, multi-valued logic, and related fields. The publications within “Studies in Fuzziness and Soft Computing” are primarily monographs and edited volumes. They cover significant recent developments in the field, both of a foundational and applicable character. An important feature of the series is its short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at https://link.springer.com/bookseries/2941

Pasi Luukka · Jan Stoklasa Editors

Intelligent Systems and Applications in Business and Finance

Editors Pasi Luukka Lappeenranta, Finland

Jan Stoklasa Lappeenranta, Finland

ISSN 1434-9922 ISSN 1860-0808 (electronic) Studies in Fuzziness and Soft Computing ISBN 978-3-030-93698-3 ISBN 978-3-030-93699-0 (eBook) https://doi.org/10.1007/978-3-030-93699-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This book presents various intelligent systems and applications in the field of business and finance. It summarizes the results of recent research in intelligent systems and presents novel methods and tools for analytics in business, industry and finance. The first chapter concentrates on the issue of multi-expert evaluation under the presence of uncertainty. It aims at multiple-criteria evaluation problems where less tangible criteria or more subjective assessments can be expected and where more evaluators need to be used. It proposes an attitude-based multi-expert evaluation tool to be used in the management of the design process. Evaluation is a crucial step in any design, construction or engineering process, and it is particularly needed for its efficient management. The authors propose an adaptation of the interval-valued semantic differential method for the purpose of design evaluation under the presence of less tangible criteria (incl. emotions, attitudes, etc.) and define different types of consensus for multi-expert evaluation. As such the chapter introduces the concepts of strong and weak consensus on a general or criterion-specific level and provides means for the assessment of consensus of uncertain multidimensional expert evaluations. The second chapter introduces a methodology for the analysis of structures and concepts that can be represented by causal relationships in cognitive maps. It focuses on the analysis of the co-existence of specific perceived causal relationships and as such provides a fresh view on the information contained in cognitive maps using the set-theoretic concepts of consistency and coverage. The authors show how to extract and examine the inter-causal relationships from the cognitive maps, that is the relationships of cause and effect in the co-existence of pairs of causal relationships in cognitive maps of a chosen group of decision-makers. This allows for a more complex interpretation of the information contained in the structure of cognitive maps, compared to methods that deal with the causal relationships separately. The applicability of the framework is shown in the context of strategic decision-making. The third chapter continues in the investigation of decision-making problems. It goes back to the very definition of the multiple-criteria decision-making problem and provides a fresh perspective on the multiple-criteria decision-making problem formulation. The “standard” MCDM problem treating all the alternatives in the same way is generalized by assigning a specific role to one specific alternative from the v

vi

Preface

pool of alternatives. This so-called baseline can be treated differently from the other alternatives and can be understood as a reference solution for the problem. Based on this idea, the authors allow for different weight-formation mechanisms to be used for the baseline and for the other alternatives and discuss the reasoning behind such generalization. The proposed generalized formulation of the multiple-criteria decision-making problem and the allowed variability in the determination of the weights for the baseline and other alternatives results in more customizability in the modelling of real-life decision-making. It also allows for the modelling of several decision-making (behavioral) biases that have been treated as deviations from the normative predictions of the standard models so far. The chapter potentially proposes a unifying modelling framework for many standard decision-making problems and many so-called decision-making biases. The ability to model and predict a decision bias can result in the re-labeling of these biases as “standard outcomes of decisionmaking” as they no longer need to deviate from the prediction obtained by the model. The first chapters provide new methods, ideas and models for multiple-criteria and multi-expert evaluation and decision-making under the presence of uncertainty, under complex relationships and under the presence of potential behavioral biases. These can find their applicability in many decision-making and expert evaluation problems and are definitely relevant in business and finance where the human factor plays a key role. The remaining chapters adopt a more data-driven perspective and deal with the applications and development of machine learning methods in various contexts. The fourth chapter proposes a new type of feature selection method that generalizes the fuzzy similarity and entropy-based filter for feature selection by taking into consideration intra-class entropy and similarity together with a normalized scaling factor. The authors also generalize the fuzzy entropy and similarity filter in a similar way. The methods obtained by the proposed generalization show competitive performance on several chosen datasets and provide an interesting addition to the feature selection toolbox of filter methods for supervised feature selection. The following chapters focus less on the design of new machine learning tools and rather deal with a specific problem and specific context and provide machine learning-based solutions to these problems. The fifth chapter introduces a hybrid decision tree-based classification model for the prediction of cooling technology of individual power plants globally. Correct information on the current water demand of individual power plants is crucial for the planning of future energy systems, particularly with respect to the limited natural resources needed for these plants, such as water. The limited availability of data on the type of cooling technology for power plants in different regions thus poses a significant planning problem and may affect the sustainability of the proposed solutions and/or threaten the whole ecosystems in the given area. The authors propose a hybrid decision tree-based classification model to predict the cooling technology, one that yields the highest accuracy compared to other existing approaches and offers a solution to the lack of data. The model is built as a general one that can be applied to different datasets and used in similar

Preface

vii

contexts. It shows the ability of machine learning approaches to contribute to the sustainability-linked goals and the related long-term planning. The sixth chapter applies neural networks in a multi-class trading setting, analyzing the viability and performance of different trading strategies applied to the S&P 500 index. The authors clearly show how well-fitted machine learning method is able to deal with the COVID crash in 2020. The authors extensively study the effect of transition from two- to three-class signals within trading strategies and show that the three-class approach outperforms the binary-class approaches on the chosen data. The seventh chapter utilizes several machine learning methods to predict shortterm traffic speed and speed drops in the urban area of a medium-sized European city, Helsinki, for the first time. In this chapter, traffic control and decision support perspective is taken into account. The authors set the goal of analyzing simple time series prediction models and their performance in traffic speed and speed drop prediction. The results suggest that decision trees applied to the prediction of traffic jams (speed drops below a given threshold) might provide useful results. On the other hand, the ability of the analyzed models to predict sudden speed drops accurately is concluded to be questionable, as the size of the sudden drops is systematically underestimated. The eighth and final chapter studies the problem of hedging WTI crude oil fluctuations using several different currency ETFs by applying intelligent time series modelling systems. Ten currencies ETFs are analyzed and an ARMA-EGARCH model is utilized to obtain margins for static and time-varying copulas for the examination of static and time-varying dependence between the currency ETFs and the WTI crude oil. Several ETFs were identified as capable of providing a safe-haven solution during the extreme movements of oil prices. A positive correlation between the volatility of several ETFs and the volatility of the oil prices is also reported. Lappeenranta, Finland

Pasi Luukka Jan Stoklasa

Contents

Attitude-Based Multi-expert Evaluation of Design . . . . . . . . . . . . . . . . . . . . Jana Stoklasová, Tomáš Talášek, and Jan Stoklasa An Investigation of Hidden Shared Linkages Among Perceived Causal Relationships in Cognitive Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahinda Mailagaha Kumbure, Pasi Luukka, Anssi Tarkiainen, Jan Stoklasa, and Ari Jantunen

1

17

A New Framework for Multiple-Criteria Decision Making: The Baseline Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Stoklasa and Mariia Kozlova

37

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited by Using Intra-class Entropy and a Normalized Scaling Factor . . . . . . . . Christoph Lohrmann and Pasi Luukka

61

A Region-Based Approach for Missing Value Imputation of Cooling Technologies for the Global Thermal Power Plant Fleet Using a Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alena Lohrmann, Christoph Lohrmann, and Pasi Luukka

93

A Neural Network Based Multi-class Trading Strategy for the S&P 500 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Leo Soukko, Christoph Lohrmann, and Pasi Luukka Predicting Short-Term Traffic Speed and Speed Drops in the Urban Area of a Medium-Sized European City—A Traffic Control and Decision Support Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Teemu Mankinen, Jan Stoklasa, and Pasi Luukka Hedging Effectiveness of Currency ETFs Against WTI Crude Oil Price Fluctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Muhammad Naeem and Sheraz Ahmed

ix

Attitude-Based Multi-expert Evaluation of Design Jana Stoklasová , Tomáš Talášek , and Jan Stoklasa

Abstract Evaluation is an important part of any design, construction or engineering process. The selection of the most appropriate form can prove crucial for the success of the product or solution that is being designed or developed. The selection of criteria, their measurability and the ability to estimate future values of some criteria can influence the result of the evaluation, as well as the definition of the overall goal and the selection of evaluators. Apart from measurable criteria, also emotions and attitudes can play an important role in the success of the final solution. This paper suggests an adaptation of the interval-valued semantic differential method proposed by Stoklasa et al. in 2019 for the purpose of design evaluation. We discuss the process of design evaluation using this tool and suggest its use in multi-expert evaluation setting. We propose a definition of consensus in this context, discuss the issue of the ease of achieving consensus, and suggest an approach that can provide information on the level of agreement of the evaluators in terms of the evaluation expressed as a object in the semantic space. We also suggest a methodology for the selection of the best design based on the aggregated multi-expert evaluations. Keywords Design · Evaluation · TRIZ · Semantic differential · Interval-valued · Semantic design · Consensus

J. Stoklasová (B) · J. Stoklasa School of Business and Management, LUT University, Yliopistonkatu 34, 53850 Lappeenranta, Finland e-mail: [email protected] J. Stoklasa e-mail: [email protected] T. Talášek Department of Mathematics, Faculty of Education, Palacký University Olomouc, Žižkovo nám. 5, 771 40 Olomouc, Czech Republic e-mail: [email protected] J. Stoklasa Faculty of Arts, Department of Economic and Managerial Studies, Palacký University Olomouc, Kˇrížkovského 8, 771 47 Olomouc, Czech Republic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_1

1

2

J. Stoklasová et al.

1 Introduction The evaluation of design is an integral part of the design process. It allows for the selection of appropriate solutions, identification of desirable and undesirable characteristics and also for the efficient management of the design process. The criteria to be used for the evaluation of design need to be selected in accordance with the desired functionality. Obviously, different perspectives require different criteria and also consider different criteria to be the most important. Kozlova et al. [15] discuss the economic aspects of design alternative complex evaluation. Hsu et al. [10] provide evidence that using the same criteria, the evaluations of a group of designers and a group of users of the product differ. But the user/designer perspective is not the only one to be considered. We can also differentiate, for example, among the economic, functionality and appearance perspective; we can also find measurable (objective) criteria, for which the desired values (or ranges of values) can be even specified in the technical specification (TS), and less-tangible criteria concerning emotions, attitudes etc. The methods used for the evaluation of design include classic multiple-criteria evaluation methods such as AHP [1, 24], VIKOR [32, 36], TOPSIS [6] and many others and their combinations. Crisp methods (i.e. methods not dealing with uncertainty) were frequently applied. However, alternative approaches better suited for the representation of uncertainty which is frequently present in the process of evaluation of design alternatives by laymen (users) have been utilized as well. Fuzzy-set based methods [1, 18] were applied in the evaluation of design, but as Zhang et al. [35] point out, the requirement of correct specification of the membership functions of the fuzzy sets that are being used might be rather restrictive for inexperienced users of fuzzy sets. Vague-set based methods [8] and methods based on the rough set theory framework [32] can also be found in the context of evaluation of design alternatives in the scientific literature, the latter category bypassing the necessity of defining membership functions. All these approaches are usually based on the criteria included (or reflected) in the technical specification. These tangible aspects present in the TS can be considered to represent the “minimum requirements” on the solution, but they usually do not constitute a comprehensive set of evaluation criteria, since other (less tangible) aspects can also be important for the users of the solution [22]. Quantitative evaluation of the design based on TS criteria is therefore sometimes carried out separately from the qualitative assessment based on “softer” additional criteria (e.g. [22] provides an example from the machine tool design context, [35] presents a rough-set DEMATEL design alternative evaluation approach). The issue of design evaluation has thus also been approached in a way that takes into account perceptions, attitudes and emotions. Among other approaches reflecting the soft aspects in the design process, product semantics focusing more on the information conveyed by the design [22] and the Kansei engineering approach [12, 23] have been recently adopted in the field of product design. Also more structured approaches like for example TRIZ seem to be attracting the attention of academics [5, 19]. Regardless of the underlying ideas and

Attitude-Based Multi-expert Evaluation of Design

3

mechanisms, all these approaches can benefit from a functional design alternative evaluation framework focusing not only on the TS aspects, but also on the “softer” aspects of the designed solution. The methods focusing on the less tangible “soft” aspects of design evaluation frequently use tools originating in psychology and related fields. These tools are suitable for obtaining soft evaluation data, they are considered to be easy to use and interpret. One of the frequently used methods to obtain evaluations while reflecting the emotional or connotation aspects of the design as perceived by its users (or other evaluators) is the semantic differential—a method relying on simple bipolar scales to obtain the “measurement” or assessment of a given object by the respondent. In this paper we focus on the use of this method, or to be precise of its generalization by Stoklasa et al. [31] which allows for interval-type evaluations, in the design evaluation setting. Osgood et al. proposed the method of semantic differentiation in [26] to identify the connotative meaning or attitudes of individuals towards given objects and it has been frequently used in research concerning attitudes in social psychology and related fields. Shortly after its introduction, it was adopted also in the field of economics and marketing [21, 27], sociology [2] and public opinion measurement [4]. The use of (and research on) the semantic differential continues in psychology and social sciences until today, see e.g. [3, 13, 20, 28]. Recently, semantic differentiation has been applied even in the theory building framework in psychology in the context of reward assessment [7]. Currently, the use of the semantic differential has also spread outside the social sciences and humanities into the field of information systems [33] and machine design [22]. The semantic differential and the concept of distance in the semantic space have found their way even to the fuzzy modelling domain [25] as a basis for the definition of a formal language for fuzzy linguistic reasoning. Semantic differential is also no stranger to the product design and design evaluation field. Hsiao and Chen [9] propose a product design method implemented in the CAD environment, where continuous semantic-differential scales are utilized to assess the contribution of the identified design-components to the impression of the final product (office chair). You and Chen [34] present a three-dimensional model (affordance, perceptual information and symbol) in the interaction design context and discuss the connections between their suggested model and the semantic interpretation of products. Semantic differential scales are also applied in [11] in the connection with Kansei adjectives to find out the links between consumers’ Kansei needs and products in emotional design. Lin et al. [17] apply semantic differentiation along with other methods to study the semantic meaning of products. There seem to be an inclination towards the use of the semantic differential in scientific literature concerning design and design evaluation. However, the benefits of the simplicity of the tool as suggested by Osgood et al. [26] seem to outweigh some of the well known issues connected with the use of standard semantic differential, such as the concept-scale interaction [33] and possible low item-relevance problems. This is even more surprising in the connection with emotional design [11], Kansei engineering [11, 12, 23] and other semantic-based design approaches, since as Zhang et al. [35] point out, there is usually lack of precision and the assessments/evaluations

4

J. Stoklasová et al.

Fig. 1 A representation of semantic space as defined by Osgood et al. [26]. The position of an object Obj is represented by its coordinates (E Obj , PObj , A Obj ) in the three-dimensional Cartesian space defined by the evaluation factor (good-bad), potency factor (strong-weak) and activity factor (active-passive)

of decision makers are uncertain when design is to be evaluated, particularly in the early stages of design. Still the fuzzy approach which is well equipped for dealing with uncertainty is discouraged by these authors [35] due to the problems with the definitions of the membership functions of the fuzzy sets used in the methods. Stoklasa et al. [31] recently introduced an interval-valued modification of the semantic differential that allows for the reflection of uncertainty in the semantic differential just by including a second set of scales for the assessment of scale relevance for the given purpose and suggested its possible use in the basic-emotion based semantic differential method [11, 30]. In this chapter we first recall the original method of semantic differentiation proposed by Osgood et al. [26] and the generalization of the semantic differential proposed by Stoklasa et al. [31]. In the following section we adopt the interval-valued semantic differential approach and discuss its use in the context of multi-expert evaluation of design. We suggest the definition of existing consensus and possible consensus in terms of interval-valued evaluations represented by “boxes of uncertainty” in the semantic space. We discuss the issue of multi-expert evaluation of design and the implications and possible gains of the use of interval-valued semantic differential in design evaluation and in the design process using various design-support methods.

2 Semantic Differential The semantic differential was introduced in [26] as a tool for the measurement of connotative meaning and quickly found its place in the social-psychological library of methods for the assessment of attitudes. It obtains input data through a simple

Attitude-Based Multi-expert Evaluation of Design

5

method of Likert-type questionnaire. Unlike in the Likert setting (where the degree of agreement/disagreement with a given statement is usually expressed, see e.g. [16, 29]), the endpoints of the scales are represented by bipolar-adjective pairs in semantic differentiation. The decision-maker then selects where on the continuum between the two adjectives of the given scale his/her assessment of the given object (e.g. design alternative) lies. Both discrete and continuous scales can be used. Without any loss of generality, we will consider continuous scales such that their numerical values form a [−r, r ] interval symmetrical around 0, r ∈ R. In the original paper by Osgood et al. [26], seven-point discrete numerical scales with values {1, 2, . . . , 6, 7}. The avoidance of the value 0 as a neutral element and of negative values of the scale is a result of a negative “connotation” of negative numerical values. Mathematically speaking, the scale can be converted into a discrete 7-point scale symmetrical around zero by a simple linear transformation resulting in {−3, −2, −1, 0, 1, 2, 3}. The method of semantic differentiation applies a specified number (n) of bipolaradjectives scales s1 , . . . , sn for the assessment of the given object. Based on factor analysis, the most important factors can be identified and the factor loadings of the scales for these identified factors can be computed. Figure 1 summarizes the results obtained by Osgood et al. [26], i.e. the identified three significant factors evaluation (E), potency (P) and activity (A). These factors account for most of the variability and define the three-dimensional Cartesian semantic space. For each of the factors we can calculate the respective coordinate of the object Obj in the given dimension (factor) using the factor loadings and the scale values provided by the respondent. The research in design evaluation using the crisp version of semantic differential (as introduced in [26]) has confirmed the existence of three factors—e.g. Lin et al. [17] identified design creativity, practical/decorative tendency and color worth as the factors determining the cognition of product semantics and You and Chen [34] work with symbol, affordance and perceptual information as three design dimensions. Hsu et al. [10] study the perception of design by designers and users of the design and conclude that evaluation, shape and activity factors are the most important ones for designers, while evaluation, time and shape/activity factors are considered the most relevant ones by the users of the design. For designers, the three factors account for 90.4% of the total variance and for design-users the respective three factors account for 85% of the total variance. Both for designers and for the users of the design, the evaluation factor covers most of the total variance (over 55% in both groups). The use of three factors therefore seems to be justified in the context of design evaluation. Also note, that the semantic space seems to carry the evaluation information stored in one of its dimensions (E) and provides two additional dimensions to be used to differentiate among design alternatives. For the sake of simplicity (but without any loss of generality) we will be using the factors E, P and A identified by Osgood et al. [26] in this paper. If the numerical values of the scales within [−r, r ] are considered, the coordinates of any object Obj in the semantic space can thus be computed using (1), where xsi ∈ [−r, r ] ⊂ R, r > 0, is the value of the scale si specified by the respondent (in [26] discrete numerical scales were used, i.e. xsi ∈ {−r, −r + 1, . . . , −1, 0, 1, . . . , r − 1, r }, where r ∈ N, most frequently 7-point scales were used, i.e. r = 3), FsEi ,

6

J. Stoklasová et al.

FsPi , FsAi ∈ [−1, 1] are the factor loadings of the scale xsi to factors E, P and A respectively and S = {s1 , . . . , sn } is the set of scales (bipolar adjectives) used to assess the position of Obj in the semantic space. ⎞ xsi FsEi xsi FsPi xsi FsAi si ∈S si ∈S ⎟ ⎜ si ∈S (E Obj , PObj , A Obj ) = ⎝ E , P , A ⎠ F F F si si si ⎛

si ∈S

si ∈S

(1)

si ∈S

If we use the nonnegative values of the scales only (i.e. if we apply the [1, 2r + 1] universe instead of [−r, r ] for each scale), the coordinates of the object Obj have to be computed by (2): P P A A ⎞ x˙sEi FsEi x˙si Fsi x˙si Fsi ⎜ si ∈S si ∈S si ∈S ⎟ (E Obj , PObj , A Obj ) = ⎝ E , P , A ⎠ , F F F si si si ⎛

si ∈S

f

f

si ∈S

(2)

si ∈S

f

where x˙si = xsi if Fsi ≥ 0 and x˙si = 2r − xsi + 2 otherwise, for all f ∈ {E, P, A}. The transformation from the [1, 2r + 1] universe to the [−r, r ] universe is straightforward. We will therefore be using (without any loss of generality) the scales with values symmetrically distributed around 0 in this paper, which also allows for the less-heavy notation in the computational formulas—the notation used in (1) will therefore be adopted further in the text. Once the position of two different objects in the semantic space is determined, the distance of these two objects can be computed. Since the objects are represented as points in the semantic space, standard Euclidean distance can be applied to determine their distance in this space. Let us consider two objects o1 and o2 and their representations in the semantic space o1 = (E o1 , Po1 , Ao1 ) and o2 = (E o2 , Po2 , Ao2 ). Their normalized semantic distance d(o1 , o2 ) can be computed using (3):

d(o1 , o2 ) =

(E o1 − E o2 )2 + (Po1 − Po2 )2 + (Ao1 − Ao2 )2 . √ 2r 3

(3)

The process of semantic differentiation and of finding the distance of representations of different objects (e.g. design alternatives) in the semantic space is easy to understand and use. There are, however, several problems stemming from the crisp representation of the object in the semantic space and from the fact that the appropriateness of the bipolar-adjective pairs might be perceived differently by different respondents (evaluators) for the purpose of design evaluation. The incorrect understanding (interpretation) of the endpoints of the scales can also constitute a possible problem in semantic differentiation and its use in the evaluation of design alternatives—e.g. Huang et al. [11] identify problems with the correct understanding of Kansei words, such as “meaningful” and “meaningless” and so do Kobayashi and Kinumura [14], Hu et al. in [10] also deal with evaluators not being clear about the

Attitude-Based Multi-expert Evaluation of Design

7

meaning of the image-words they are using in their study. The real-number-valued evaluation xsi on a semantic differential scale si , the meaning of which is not well understood or which is not considered to be entirely appropriate for the given purpose by the given evaluator, therefore seems to be lacking an important piece of information. Such an evaluation is represented as a precise number, even though there was uncertainty involved in the process of its determination. Particularly under the evidence of different understanding of the linguistic expressions used to anchor the endpoints of the scale in design evaluation [35], it seems only reasonable to use a modified version of the semantic differential. Stoklasa et al. [31] provide such a tool, which remains simple enough to be applied for design evaluation and still allows for the reflection of uncertainty in the process. This method will be briefly summarized in the next section.

3 Interval-Valued Semantic Differential In this section, we will briefly summarize the interval-valued modification of the semantic differential proposed in [31]. We will assume three dimensions of the semantic space (and denote them E, P and A as in the original semantic differential) as confirmed in the research listed above, but the generalization to more dimensions is straightforward, if needed. For three factors defining the dimensionality of the semantic space, we can, however, obtain a graphical representation of the evaluated design alternatives as an up-to three-dimensional object in the semantic space. These representations will help us define the idea of consensus in design-alternative evaluations. The interval-valued semantic differential proposed in [31] allows for the reflection of the perceived relevance of the bipolar-adjective pairs for the given purpose (which is the evaluation of design alternatives in a given context in this paper). The input procedure in the standard semantic differentiation consists of providing the evaluator with a list of bipolar-adjective pairs and asking him/her to mark the position of the evaluated concept within the universe defined by each bipolar-adjective pair (numerically, a [−r, r ] interval can be considered to represent the universes). In general we can consider n bipolar-adjective scales si , i = 1, . . . , n being used in the semantic differential, the scales need to be chosen in a way that all the identified dimensions are saturated sufficiently (i.e. the factor loadings of all the scales to any single identified dimensions of the semantic space should not be zero). The interval-valued modification of the method also requires information on the perceived relevance of each scale for the purpose of the evaluation of the given design alternative. The relevance of each bipolar-adjective pair si for the evaluation of the given design alternative (in the given context) is assessed on another scale r si numerically represented by a [0, 1] interval, where 0 means no relevance at all and 1 means complete relevance, resulting in the perceived relevance of the i-th scale xr si ∈ [0, 1]. These values are subsequently linearly transformed into values ysi on a [0, 2r ] universe, ysi = 2r · xr si .

8

J. Stoklasová et al.

Fig. 2 Example of an input as provided by the evaluator through the interval-valued semantic differential method proposed in [31] (top) and the notation used in the computations (bottom). The evaluation of a generic design alternative da is considered

An example of this input procedure and the conversion of the obtained values into the values of the variables used further in the computations is presented in Fig. 2. The magnitude of the irrelevance of the scale (expressed as 1 − xr si ), as perceived by the evaluator, is understood as uncertainty of the evaluation xsi ∈ [−r, r ] provided as a response for the scale si . It is therefore transformed into an uncertainty region [xsLi , xsRi ] ⊆ [−r, r ] around xsi . The uncertainty region is defined as symmetrical w.r.t xsi as long as [xsLi , xsRi ] ⊆ [−r, r ]. If such a symmetrically defined uncertainty region would not fit in the [−r, r ] interval, it is shifted so that it lies all within the required evaluation interval [−r, r ]. In other words [31] defines these uncertainty regions by (4). Obviously, if a scale si is perceived to be 100% relevant, then [xsLi , xsRi ] = [xsi , xsi ], i.e. there is no uncertainty involved and the interval-valued procedure reduces to the original procedure suggested by Osgood et al. [26], as far as this scale is concerned. ⎧ wsi ⎪ ⎨[xsi − 2 , xsi + L R [xsi , xsi ] = [−r, −r + wsi ] ⎪ ⎩ [r − wsi , r ]

wsi 2

] for (xsi − for (xsi − for (xsi +

wsi 2 wsi 2 wsi 2

) ≥ −r and (xsi + ) < −r, )>r

wsi 2

) ≤ r,

(4) Note that wsi = |[xsLi , xsRi ]| = 2r − ysi and as such the length of the interval of uncertainty wsi represents the “amount” of also possible alternative values of xsi . Also note

Attitude-Based Multi-expert Evaluation of Design w

9

that irri = 1 − xr si = 2rsi ∈ [0, 1], i = 1, . . . , n, is a measure of the irrelevance of scale si for the evaluation of the given design alternative as perceived by the given evaluator. The irrelevance is understood as a source of uncertainty of the value xsi provided by the evaluator. The idea here is that if the scale is perceived as (partially) irrelevant, then the value xsi might be misspecified with respect to the value that would be appropriate under full relevance of the scale. The irrelevance is therefore transformed into a “region of also possible values of xsi ” around the xsi value. As such, the higher the perceived irrelevance of the scale, the wider the interval of also possible values becomes. A completely irrelevant scale (xr si = 0 = ysi ) results in the whole range of the scale [−r, r ] to be considered “also possible values”. The reasoning behind this is that if there is no connection (even indirect one) that the evaluator can see between the bipolar adjective scale and the object that is being evaluated, then the evaluation expressed on this scale is arbitrary and should not be considered precise at all. Any other value of the scale might be as good an answer as the one actually expresses under these circumstances. Obviously, the perceived irrelevance of the scale is not the only possible source of uncertainty of the evaluations. It is one that can often come up under the original requirement by Osgood et al. [26] for the scale not to have a denotative (descriptive) meaning for the object that is being assessed. Since the bipolar adjective scales are required to have a mere metaphorical link to the object that is being assessed, it is reasonable to expect that their irrelevance can become an issue for some evaluators that would prevent them from providing precise and reliable evaluations. Other sources of uncertainty of the evaluations might be stemming, for example, from the lower familiarity of the evaluator with the object, from misunderstandings of the endpoints of the scales (their meaning), from the fact that the evaluator is not fully confident that the expressed value is correct, from fatigue, and many other sources. In practical applications the scale used to introduce uncertainty in the model, or to be more precise to reflect the uncertainty that is already present, can be customized and defined in such a way that allows it to reflect the most relevant source(s) of uncertainty. The conversion of the values of the scale to the “intervals of also possible values” does not need to be linear either [31]. We will, however, keep the “relevance” interpretation and the linear transformation of irrelevance into uncertainty for the purpose of this chapter. Let us now consider the factor loadings FsEi , FsPi , FsAi ∈ [−1, 1] of the scale si to the factors E, P and A to be known. Let us also consider the interval-valued evaluations w.r.t. each of the bipolar-adjective pairs [xsLi , xsRi ] to be available for all i = 1, . . . , n, as well as the crisp values xsi . The position of the object (i.e. design alternative) in the semantic space can now be computed, as suggested in [31] using f f f f f f (5), where [xsLi , xsRi ] · Fsi = [Fsi · xsLi , Fsi · xsRi ] if Fsi ≥ 0 and [xsLi , xsRi ] · Fsi = [Fsi · f f xsRi , Fsi · xsLi ] if Fsi < 0, for all f ∈ {E, P, A} and for all si ∈ S.

10

J. Stoklasová et al.

Fig. 3 Different dimensions of uncertainty reflected in the interval-valued semantic differential. The assessment of Obj1 is uncertain in a single dimension (A, left sub-figure), for Obj2 some scales saturating the A and E dimensions were considered partially inappropriate (hence a 2-dimensional area of uncertainty around the crisp representation of Obj2 , middle sub-figure) and scales saturating all three dimensions/factors are considered partially inappropriate for Obj3 (resulting in a 3-dimensional box of uncertainty around the crisp representation of Obj3 in the right sub-figure)

⎞ L R L R [xsLi , xsRi ] · FsEi [xsi , xsi ] · FsPi [xsi , xsi ] · FsAi si ∈S si ∈S ⎟ ⎜ si ∈S int int , , (E Obj , PObj , Aint E P A ⎠ Obj ) = ⎝ F F F ⎛

si

si ∈S

si ∈S

si

si ∈S

si

(5) Here the use of [−r, r ] universes simplifies the notation significantly. If the universe [1, 2r + 1] is considered to avoid negative and zero values, i.e. if the inter val evaluations [xsiL , xsiR ] ⊆ [1, 2r + 1] are considered, then (5) transforms into (6), f f f f where [x˙sLi , x˙sRi ] f = [Fsi · xsiL , Fsi · xsiR ] if Fsi ≥ 0 and [x˙sLi , x˙sRi ] f = [|Fsi | · (2r − f f xsiR + 2), |Fsi | · (2r − xsiL + 2)] if Fsi < 0, for all f ∈ {E, P, A} and for all si ∈ S. L R L R ⎞ [x˙sLi , x˙sRi ] E [x˙si , x˙si ] P [x˙si , x˙si ] A si ∈S si ∈S ⎟ ⎜ si ∈S int int int (E Obj , PObj , A Obj ) = ⎝ E , P , A ⎠ F F F ⎛

si ∈S

si

si ∈S

si

si ∈S

(6)

si

If all the scales are considered completely relevant, then (1) and (5) provide the same representation of the evaluated design alternative (object)—as a single point in the semantic space (and so do (2) and (6) for [1, 2r + 1] scales). The interval-valued version of the semantic differential can thus be considered a generalization of the original semantic differential method. Let us now consider only the case of [−r, r ] underlying evaluation scales. Each evaluated object can be represented by its crisp (single-point) representation computed using (1) as C Obj = (E Obj , PObj , A Obj ). If some of the scales (bipolar-adjectives pairs) are considered less than completely relevant by the evaluator for the purpose of evaluation of the given object Obj, int int , PObj or Aint then E Obj Obj are intervals with nonzero cardinality. In this case, an area int int of uncertainty defined as D Obj = E Obj × PObj × Aint Obj can be constructed around

Attitude-Based Multi-expert Evaluation of Design

11

C Obj (see Fig. 3 for examples in a three-dimensional semantic space). Any evaluated design alternative can thus be represented by C Obj and a “box of uncertainty” D Obj surrounding this point. This uncertainty-box represents positions in the semantic space that can also be occupied by Obj given the lower perceived relevance of some of the scales used in the input phase.

4 Evaluation Using the Interval-Valued Semantic Differential, Consensus of Evaluations Once the representations of the design alternative da by an evaluator ek are obtained ek ek ek ek ek ,int ek ,int ek ,int , Dda ) = (E da , Pda , Aedak ), (E da , Pda , Ada ) are known for each of as (Cda the q evaluators (k = 1, . . . , q), we are left with the issue of the interpretation of ek ek (Cda , Dda ) in terms of evaluation. The evaluation of the design alternative da can be assessed solely based on the coordinate of da in the E dimension (if an evaluation dimension E is confirmed to be present by the factor analysis used to define the factor q q int ek ek ,int /q and E da = k=1 E da /q loadings of the scales), e.g. using E da = k=1 E da int and assessing the distance of E da and E da from the “best evaluation value” of the [−r, r ] universe. Also different weights of the evaluators can be considered in the process. It might also be reasonable to consider the full information obtained through semantic differentiation, i.e. to reflect the remaining dimensions of the semantic space and the respective coordinates of da within the semantic space. If a desired position (crisp or uncertain) of da can be specified (in terms of ideal or most-desired ek ek , Dda ) from this ideal position in the semantic space), then the distances of (Cda evaluation can be assessed and e.g. an average or maximum/minimum distance from this ideal be used for the evaluation purposes. Stoklasa et al. suggest several distance measures in [31] that can be used for this purpose, e.g. euclidean distance of the crisp-representations of da and the ideal and a difference in the length of the bodyek and the representation of the ideal. Note, that the definition of the diagonals of Dda ideal might be frequently specified as a point in the semantic space, i.e. without any uncertainty. Another important piece of information might be carried by the existence or nonexistence of consensus (or agreement) of the evaluations provided by q evaluators. The interval-valued semantic differential provides means for the definition of several forms of consensus: e

e1 ∩ · · · ∩ Ddaq ) = ∅ and • strong general consensus—can be specified as D = (Dda ek Cda ∈ D for all k = 1, . . . , q. This situation is illustrated in Fig. 4 and suggests that all the evaluators can find an agreement on the evaluation represented by D, and moreover all their crisp evaluations lie in this common evaluation area D. In this case D can be considered to represent the overall evaluation of da.

12

J. Stoklasová et al.

Fig. 4 An example of a case of strong general consensus. Two evaluations of a design alternative e e ei ei da are considered: (Cda , Dda ) provided by the evaluator ei and depicted in green and (Cdaj , Ddaj ) ej ej ei ei provided by the evaluator e j and depicted in red. Both Cda , Cda ∈ D, where D = (Dda ∩ Dda ) = ∅. Projections of the three-dimensional representation of the situation into the projection planes E − P, E − A and P − A are provided in the right section of the figure

e

e1 • weak general consensus—can be specified as D = (Dda ∩ · · · ∩ Ddaq ) = ∅, but ek / D for some k = 1, . . . , q. This situation is illustrated in Fig. 5. In this case Cda ∈ there is an already existing common ground represented by D, but some of the crisp evaluations do not lie within this area. In this case, either D can be considered to represent the overall evaluation, or a minimum expansion of D containing all (or most of the important crisp-evaluations) can be found. e ,int e1 ,int ∩ · · · ∩ E daq ) = ∅ and • strong E-consensus—can be specified as E = (E da ek E da ∈ E for all k = 1, . . . , q. In this case (illustrated in Fig. 6) the evaluators can find common ground in the evaluation dimension, but can possibly disagree in the other dimensions. Again, E can be used to represent the overall evaluation (which disregards all the dimensions other than E). e ,int e1 ,int ∩ · · · ∩ E daq ) = ∅, but • weak E-consensus—can be specified as E = (E da ek ∈ / E for some k = 1, . . . , q. This situation can be considered to be the minE da imum requirement for the existence of consensus in terms of evaluation. Analogously to the weak general consensus case, either E can be considered to represent the overall evaluation, or a minimum expansion of E required to reach strong Eek ,int ek ,int )), maxk (max(E da ))] containing all (or consensus given by [mink (min(E da most of the important crisp-evaluations) can be found. • currently non-existing general consensus—this situation is characterized by D = e e1 ∩ · · · ∩ Ddaq ) = ∅ and it is illustrated in Fig. 6. In this case there is no con(Dda

Attitude-Based Multi-expert Evaluation of Design

13

Fig. 5 An example of a case of weak general consensus. Two evaluations of a design alternative e e ei ei da are considered: (Cda , Dda ) provided by the evaluator ei and depicted in green and (Cdaj , Ddaj ) ej ei ei provided by the evaluator e j and depicted in red. D = (Dda ∩ Dda ) = ∅. Although Cda ∈ D, e / D , hence it is not a case of strong general consensus. Projections of the threewe have Cdaj ∈ dimensional representation of the situation into the projection planes E − P, E − A and P − A are provided in the right section of the figure

sensus in terms of all the dimensions of the semantic space. Still a strong or weak E-consensus might exist. • currently non-existing E-consensus or E-dissensus—this situation is characterized e ,int e1 ,int by E = (E da ∩ · · · ∩ E daq ) = ∅. In this case there is no common ground in terms of evaluations and changes of the individual evaluations are needed to reach at least a weak form of consensus. Note, that depending on the situation, different strength and form of consensus might be required. In any case the introduction of uncertainty in the evaluation process allows for the assessment of achievability of consensus (since the uncertainty areas represent “also possible coordinates of the design alternative in the semantic space”). Obviously, investigation of subgroups of the evaluators and identification of several strong-general-consensus groups might identify a partitioning of the target population. This way a need for a specific targeting of e.g. a marketing campaign or for the investigation of the needs of a specific subgroup of the population to reach general acceptance might become apparent. The interval-valued semantic differential thus provides a valuable tool for the evaluation of design alternatives and possibly also for the management of the creative process.

14

J. Stoklasová et al.

Fig. 6 An example of a case of strong E-consensus. Two evaluations of a design alternative da are e e ei ei considered: (Cda , Dda ) provided by the evaluator ei and depicted in green and (Cdaj , Ddaj ) provided ej ei by the evaluator e j and depicted in red. D = (Dda ∩ Dda ) = ∅, hence neither strong nor weak e ,int

ei ,int ei ∩ E daj ) = ∅ and E da , E daj ∈ E — general consensus is reached. At least we have E = (E da see the top and middle subplot in the right column of the figure, where the closeness of the coordinates e ei of Cda and Cdaj in the E dimension is apparent. Projections of the three-dimensional representation of the situation into the projection planes E − P, E − A and P − A are provided in the right section of the figure e

5 Conclusion The interval-valued semantic differential proposed in [31] is suggested here for the purpose of design-alternatives evaluation. We have discussed the possible benefits of using the interval-valued semantic differential information input procedure and calculation of the position of the design alternative in the semantic space for the evaluation of design alternatives. We have introduced several definitions of various forms of consensus in evaluation of design alternatives and discussed the added value of this approach in design alternatives evaluation. The methods is identified to be a possibly valuable addition to the repertoire of currently used design evaluation methods and the use of its outputs in the management of the creative process is suggested. Acknowledgements This research was supported by LUT research platform AMBI- Analyticsbased management for business and manufacturing industry and partially also by the Specific University Research Grant IGA_ FF_2021_001, as provided by the Ministry of Education, Youth and Sports of the Czech Republic in the year 2021.

Attitude-Based Multi-expert Evaluation of Design

15

References 1. Aya, Z.: A fuzzy AHP-based simulation approach to concept evaluation in a NPD environment. IIE Trans. 37(9), 827–842 (2005). https://doi.org/10.1080/07408170590969852 2. Back, K.W., Bunker, S., Dunnagan, C.B.: Barriers to communication and measurement of semantic space. Sociometry 35(3), 347–356 (1972). https://doi.org/10.2307/2786499 3. Beckmeyer, J.J., Ganong, L.H., Coleman, M., Stafford Markham, M.: Experiences with coparenting scale: a semantic differential measure of postdivorce coparenting satisfaction. J. Fam. Issues 38(10), 1471–1490 (2017). https://doi.org/10.1177/0192513X16634764 4. Carter, R.F., Ruggels, W.L., Chaffee, S.H.: The semantic differential in opinion measurement. Publ. Opin. Quart. 32(4), 666–674 (1968) 5. Chechurin, L., Borgianni, Y.: Understanding TRIZ through the review of top cited publications. Comput. Ind. 82(1), 119–134 (2016). https://doi.org/10.1016/j.compind.2016.06.002 6. Chen, M., Lin, M., Wang, C., Chang, C.A.: Using HCA and TOPSIS approaches in personal digital assistant menu-icon interface design. Int. J. Ind. Ergon. 39(5), 689–702 (2009). https:// doi.org/10.1016/j.ergon.2009.01.010 7. Fennell, J.G., Baddeley, R.J.: Reward is assessed in three dimensions that correspond to the semantic differential. PLoS ONE 8(2), 1–15 (2013). https://doi.org/10.1371/journal.pone. 0055588 8. Geng, X., Chu, X., Zhang, Z.: A new integrated design concept evaluation approach based on vague sets. Expert Syst. Appl. 37(9), 6629–6638 (2010) 9. Hsiao, S.W., Chen, C.H.: A semantic and shape grammar based approach for product design. Des. Stud. 18(3), 275–296 (1997). https://doi.org/10.1016/S0142-694X(97)00037-9 10. Hsu, S.H., Chuang, M.C., Chang, C.C.: A semantic differential study of designers’ and users’ product form perception. Int. J. Ind. Ergon. 25(4), 375–391 (2000). https://doi.org/10.1016/ S0169-8141(99)00026-8 11. Huang, Y., Chen, C.H., Khoo, L.P.: Products classification in emotional design using a basicemotion based semantic differential method. Int. J. Ind. Ergon. 42(6), 569–580 (2012). https:// doi.org/10.1016/j.ergon.2012.09.002 12. Jindo, T., Hirasago, K., Nagamachi, M.: Ergonomics development of a design support system for office chairs using 3-D graphics. Int. J. Ind. Ergon. 15, 49–62 (1995) 13. Kervyn, N., Fiske, S.T., Yzerbyt, V.Y.: Integrating the stereotype content model (warmth and competence) and the Osgood semantic differential (evaluation, potency, and activity). Eur. J. Soc. Psychol. 43(7), 673–681 (2013). https://doi.org/10.1002/ejsp.1978 14. Kobayashi, M., Kinumura, T.: A method of gathering, selecting and hierarchizing kansei words for a hierarchized kansei model. Comput.-Aided Des. Appl. 14(4), 464–471 (2017). https:// doi.org/10.1080/16864360.2016.1257188 15. Kozlova, M., Chechurin, L., Efimov-Soini, N.: Levelized function cost: economic consideration for design concept evaluation. In: Chechurin, L., Collan, M. (eds.), Advances in Systematic Creativity: Creating and Managing Innovations, Palgrave Macmillan, pp. 267–297. https://doi. org/10.1007/978-3-319-78075-7_16 16. Likert, R.: A technique for the measurement of attitudes. Arch. Psychol. 22(140), 5–55 (1932) 17. Lin, R., Lin, C.Y., Wong, J.: An application of multidimensional scaling in product semantics. Int. J. Ind. Ergon. 18(2–3), 193–204 (1996). https://doi.org/10.1016/0169-8141(95)00083-6 18. Lo, C., Wang, P., Chao, K.: A fuzzy group-preferences analysis method for new-product development. Expert Syst. Appl. 31(4), 826–834 (2006). https://doi.org/10.1016/j.eswa.2006.01. 005 19. Mansoor, M., Mariun, N., AbdulWahab, N.I.: Innovating problem solving for sustainable green roofs: potential usage of TRIZ—theory of inventive problem solving. Ecol. Eng. 99, 209–221 (2017). https://doi.org/10.1016/j.ecoleng.2016.11.036 20. Marinelli, N., Fabbrizzi, S., Alampi Sottini, V., Sacchelli, S., Bernetti, I., Menghini, S.: Generation Y, wine and alcohol. A semantic differential approach to consumption analysis in Tuscany. Appetite 75, 117–127 (2014). https://doi.org/10.1016/j.appet.2013.12.013

16

J. Stoklasová et al.

21. Mindak, W.A.: Fitting the semantic differential to the marketing problem. J. Mark. 25(4), 28–33 (1961) 22. Mondragón, S., Company, P., Vergara, M.: Semantic differential applied to the evaluation of machine tool design. Int. J. Ind. Ergon. 35(11), 1021–1029 (2005). https://doi.org/10.1016/j. ergon.2005.05.001 23. Nagamachi, M.: Kansei engineering: a new consumer-oriented technology for product development. Int. J. Ind. Ergon. 15, 3–11 (1995) 24. Ng, C.Y., Chuah, K.B.: Evaluation of design alternatives’ environmental performance using AHP and ER approaches. IEEE Syst. J. 8(4), 1182–1189 (2014) 25. Niskanen, V.A.: Metric truth as a basis for fuzzy linguistic reasoning. Fuzzy Sets Syst. 57(1), 1–25 (1993). https://doi.org/10.1016/0165-0114(93)90117-Z 26. Osgood, C.E., Suci, G.J., Tannenbaum, P.H.: The Measurement of Meaning. University of Illinois Press, Chicago (1957) 27. Ross, I.: Self-Concept and brand preference. J. Bus. 44(1), 38–50 (1971) 28. Stoklasa, J., Talášek, T., Stoklasová, J.: Semantic differential and linguistic approximation— identification of a possible common ground for research in social sciences. In: Proceedings of the International Scientific Conference Knowledge for Market Use 2016, Societas Scientiarum Olomucensis II, Olomouc, pp. 495–501 (2016) 29. Stoklasa, J., Talášek, T., Kubátová, J., Seitlová, K.: Likert scales in group multiple-criteria evaluation. J. Mult.-Valued Logic Soft Comput. 29(5), 425–440 (2017) 30. Stoklasa, J., Talášek, T., Stoklasová, J.: Reflecting emotional aspects and uncertainty in multiexpert evaluation: one step closer to a soft design-alternative evaluation methodology. In: Chechurin, L., Collan, M. (eds.), Advances in Systematic Creativity: Creating and Managing Innovations, Palgrave Macmillan, pp. 299–322 (2019a). https://doi.org/10.1007/978-3-31978075-7 31. Stoklasa, J., Talášek, T.: Stoklasová J (2019b) Semantic differential for the twenty-first century: scale relevance and uncertainty entering the semantic space. Quality & Quantity 53, 435–448 (2019). https://doi.org/10.1007/s11135-018-0762-1 32. Tiwari, V., Jain, P.K., Tandon, P.: Product design concept evaluation using rough sets and VIKOR method. Ad. Eng. Inform. 30(1), 16–25 (2016). https://doi.org/10.1016/j.aei.2015.11. 005 33. Verhagen, T., van den Hooff, B., Meents, S.: Toward a better use of the semantic differential in IS research: an integrative framework of suggested action research. J. Assoc. Inf. Syst. 16(2), 108–143 (2015) 34. You, H.C., Chen, K.: Applications of affordance and semantics in product design. Des. Stud. 28(1), 23–38 (2007). https://doi.org/10.1016/j.destud.2006.07.002 35. Zhang, Z.J., Gong, L., Jin, Y., Xie, J., Hao, J.: A quantitative approach to design alternative evaluation based on data-driven performance prediction. Adv. Eng. Inf. 32(1), 52–65 (2017). https://doi.org/10.1016/j.aei.2016.12.009 36. Zhu, G.N., Hu, J., Qi, J., Gu, C.C., Peng, Y.H.: An integrated AHP and VIKOR for design concept evaluation based on rough number. Adv. Eng. Inf. 29(3), 408–418 (2015). https://doi. org/10.1016/j.aei.2015.01.010

An Investigation of Hidden Shared Linkages Among Perceived Causal Relationships in Cognitive Maps Mahinda Mailagaha Kumbure , Pasi Luukka , Anssi Tarkiainen , Jan Stoklasa , and Ari Jantunen

Abstract This study investigates cause and effect relationships in cognitive maps and the coexistence of pairs of such relationships in cognitive maps of a chosen group of decision-makers. We call the existence of a pair of causal relationships shared by the group of decision-makers in their cognitive maps inter-causal relationship. We investigate the coexistence of the chosen pairs of causal relationships in the maps in terms of one of the causal relationships being a necessary and/or sufficient condition for the existence of the other using the tools of fuzzy-set qualitative comparative analysis. We develop and propose a framework to extract and examine the intercausal relationships from the cognitive maps. The proposed method is based on settheoretic consistency and coverage measures. We used empirical data (of 71 cognitive maps) collected from a cognitive mapping approach performed by individuals in management teams within a strategic decision-making simulation process to test the proposed approach. Empirical results show that our method can identify inter-causal relationships and provide analytic results for a more complex interpretation if the information arises from the structure of cognitive maps.

M. Mailagaha Kumbure (B) · P. Luukka · A. Tarkiainen · J. Stoklasa · A. Jantunen School of Business and Management, LUT University, Yliopistonkatu 34, 53850 Lappeenranta, Finland e-mail: [email protected] P. Luukka e-mail: [email protected] A. Tarkiainen e-mail: [email protected] J. Stoklasa e-mail: [email protected] A. Jantunen e-mail: [email protected] J. Stoklasa Department of Economic and Managerial Studies, Faculty of Arts, Palacký University Olomouc, Kˇrížkovského 8, 771 47 Olomouc, Czech Republic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_2

17

18

M. Mailagaha Kumbure et al.

Keywords Cognitive map · Consistency · Coverage · Decision-making · Inter-causal relationship

1 Introduction There is no doubt that individual perceptions play a vital role in creating, evaluating, and choosing the decision options in the decision-making process [22]. It is common knowledge that individual perceptions are primarily derived from personal experience and knowledge. The perceptions might also be influenced by other factors, such as beliefs, interests, and expectations [22]. This means that rationales behind the problems/situations are perceived in various ways, and they might differ from one person to another. To understand this phenomenon and make efficient analyses, there has been a growing interest in using cognitive mapping as a participatory method [10]. A cognitive map as a cause-effect network of qualitative aspects [29] helps individuals represent their thinking about a problem or situation. Cognitive maps have been of continuing interest in many applications in social science including strategic management and business research [3, 4, 12], engineering and technology [16], industrial and manufacturing [5, 14, 20, 27], medicine [8, 18, 19, 30], politics [1] and environmental research [17, 28]. Shared cognitive maps or shared linkages therein refer to shared mental representation of a problem or situation with concepts of a team across gathering, sharing, and jointly analyzing and integrating [3]. The word “shared” might have two different meanings, either dividing things up or having things in common [15]. In the strategic management studies, both meanings of the word “shared” are suitable in terms of the cognitive representations because–on some occasions, responsibilities and expertise are divided between team members, and some other times they take the tasks in common. That is important to mention here because this study uses data of cognitive maps that were created by individuals in strategic management teams, but looks for structures therein that are shared by the whole team or that can be considered to constitute the shared understanding (cognition) of the system/concept represented by the causal maps. We are particularly interested in the ability to infer the existence of one causal relationship in the map from the existence of another one. The relationship between two events/cases (elements in the cognitive map) such that one causes the other is defined as causality. In cognitive science, causality is a critical aspect that plays a vital role in decision-making and often supports choosing a course of action that seems best to achieve the expected outcomes [6]. In the cognitive maps, causal phenomena are revealed as cause-and-effect relationships between concepts, and according to those relationships, the topology and workflow of the effects are designed [2]. The cognitive maps are representations of individuals’ perceived causal structures (i.e. network of causal relationships) in a given context. Networked nature of perceived causalities implies that specific cause-and-effect relationships are interrelated (e.g. they either have a causal connection between them, or the existence of one might imply the existence of the other in the causal map and the cognitive

An Investigation of Hidden Shared Linkages Among Perceived …

19

structure it represents), even though these linkages are not directly marked into cognitive map. We define the relationships of coexistence of two causal relationships in the cognitive maps of the members of a chosen group as inter-causal relationships. Even more specifically, if we assume four concepts A, B, C, and D in a cognitive map, then the inter-causal relationship between a causal relationship A → B and a causal relationship C → D is defined as the presence of A → B implying the presence of C → D in the shared cognitive structure. Examining the cognitive maps implies assessing the causal relationships to extract valuable information for future operations in a particular area. In this study, we attempt to investigate the hidden shared relationship of coexistence between two causal relationships in a group cognitive structure (i.e., across the cognitive maps of the members of the analysed group)—the inter-causal relationships. To the best of our knowledge, there is no previous research that explores such a linkage between two perceived causal relationships-this makes our study novel in this regard. In the practical example of the use of our method we limit ourselves to the role of individual perceived causalities presented in the cognitive maps regarding a strategic decision-making process and their inter-causal relationships. To investigate this, we develop and present a methodology based on set-theoretic consistency and coverage measures. We used empirical data collected from a cognitive mapping approach performed by the individuals in management teams within a strategic decision-making process simulation. The simulation was run as a part of a graduate course in business at LUT University, Lappeenranta, Finland. The rest of this chapter is organized as follows: Section 2 presents a description of the used data, key theoretical aspects applied, and the methodology. Section 3 presents and discusses observed results. Section 4 summarizes the main findings and presents concluding remarks.

2 Data, Theoretical Aspects, and Methodology 2.1 Data Sample To investigate the relation between perceived causal relationships, we analyze a data sample of cognitive maps collected from an eight-week business simulation task in a controlled setting that was performed with graduate students of a business-oriented program. During the simulation task, the students were guided to understand and interpret the operations of international trading strategies in global business in a dynamic and competitive environment. This simulation resulted in a collection of cognitive maps that were shaped by the individuals. In the data sample, there were 71 individual-level cognitive maps originally belonging to 16 management teams in the simulation, but the grouping is not relevant for the purposes of this study created based on the 40 strategic-level constructs presented in Table 6 in the Appendix. From this list, each individual selected 12 constructs seen as the most relevant from his/her

20

M. Mailagaha Kumbure et al.

knowledge and unique views on the situation to create his/her cognitive map. Each cognitive map also included the total cumulative shareholder returns (TCSR), as the causation of TCSR was the main point of investigation in the course/simulation. To carry out the necessary analysis and calculations with the cognitive maps, all individual cognitive maps were converted into association matrices. The 40 strategiclevel constructs plus the TCSR defined each dimension of the matrix (i.e., 41 × 41 association matrix was used to represent an individual map). Each cell value of the matrix represents the strength of the causal relationship between two elements in the cognitive maps; these strengths were chosen from the {−3, −2, −1, 1, 2, 3} set. This allowed for all the cognitive maps to be represented by a 41 × 41 association matrix; the rows/columns that corresponded with strategic concepts that were not used in the cognitive map of the given individual consisted entirely of zero values. It also noteworthy that we adopted these strength values (i.e., {−3, −2, −1, 1, 2, 3}) for the cognitive mapping experiment according to the implications presented by [13]. As they reported, the strengths range from −3 to −1, can indicate the negative causal relationships, and from +1 to +3 for positive causal relationships. For example, a participant can hold a strong negative belief (with the strength of −3) or positive belief (with the strength of +3) for a particular case according to his/her opinion.

2.2 Cognitive Maps A cognitive map originally conceptualized by [29] is a graphical structure that allows illustrating the knowledge and beliefs of human learning and behavior [9]. A cognitive map is produced around a specific problem of interest by an individual or a group who are familiar in the relevant field. Accordingly, participants can organize, visualize, and share their experiences, perceptions, and interpretations [28]. The cognitive map consists of nodes representing the variables and a set of directional edges representing the causal relationships among the variables [23]. Also, the edges are associated with numerical values (weights) representing the strength of the causal relationship. In our data sample, the cognitive maps have been created by individuals considering the impacts of specific 12 strategic issues (chosen by each individual from a pre-defined list of 40 strategic issues) on each other and on the TCSR during the business simulation task. Therefore, the nodes in the cognitive maps represent those 12 strategic issues plus the TCSR. An edge (linkage) with arrowhead between two nodes represents the direction of an effect (causal), and its weight the strength of the causal relation. Figure 1 displays an example of a cognitive map containing positive and negative causal relationships between the elements with associated strengths.

An Investigation of Hidden Shared Linkages Among Perceived …

21.Competition in the market

-2

16.Dividends

2.Demand

3

21

2.Number of shares outstanding

3 2

3

3

3 19.Sales

2

2 10.Product market decisions

33.Market selection decision

1

2

3

30.Promotion

2

23.Long-term profitability

1 1

29.Mission & vision

3

34.Brand company image

1

12.Product selling prices

Total cumulative shareholder return

Fig. 1 A cognitive map example in the data collected through the strategic business simulation task

2.3 Inter-causal Relationships Inter-causality is a relationship between one causation and another such that the existence of the former in a cognitive maps implies the existence of the latter. This type of relationship is not directly visible in individual causal maps, but the knowledge of the existence of such relationships can provide valuable insights into the shared cognitive structure of the group under analysis. We should also point out that the inter-causal relationships are not direct ones, that is they might not be a part of a “causality chain” within the cognitive maps. They represent a coexistence type of relationship between causal relationships in the cognitive structures of individuals and as such the pairs of causal relationships can have different strengths and can even be independent of each other (causality-wise) in the cognitive maps. The essential feature we are investigating by inter-causal relationship is the existence of both causal relationships in the cognitive structures of individuals represented by cognitive maps. It is also possible to examine whether the existence of one relationship seems to be a necessary or sufficient condition for the existence of the other across the cognitive maps (or individual cognitive structures they represent) within the analyzed group. This study attempts to provide a method to detect and interpret such inter-causal relationships in the cognitive maps.

22

M. Mailagaha Kumbure et al.

2.4 Set-Theoretic Consistency and Coverage Measures This study mainly focused on the consistency and coverage measures in fuzzy settheoretic qualitative comparative analysis (fsQCA) that were originally defined by Ragin in [21]. fsQCA is a powerful approach that applies a holistic perspective to acquire similarities and differences through the cases. The fsQCA attempts to interpret causal (cause-effect) relationships between predictor and outcome by identifying which conditions are sufficient or necessary to produce the outcome [24]. Consistency and coverage are two important notions in fsQCA. Let us consider we have a set X of observations available and we are interested in two features of these observations—feature P and feature Q. We assume that the feature P can be represented by the subset P of observations that have this feature, P ∼ P ⊆ X , and similarly Q ∼ Q ⊆ X . Alternatively we can also allow for the observations to have the features only partially, in which case P and Q would be fuzzy subsets of X , in other words P ⊆ F X or Q ⊆ F X , where for any x ∈ X we can denote the membership degree of x to P as P(x) ∈ [0, 1] and its membership degree to Q as Q(x) ∈ [0, 1]. We are now interested in knowing whether the feature P can be considered a necessary or sufficient condition for the feature Q also being present in the given observation. To investigate this, we can focus on the relationship P ⇒ Q and examine its correspondence with the available data. In other words if we are interested to see whether P is a sufficient condition for Q, we need to focus on the P ⊆ Q relationship and if P being a necessary condition for Q is of interest, we need to focus on Q ⊆ P. Set-theoretic consistency refers to a proportion of cases of P coinciding with Q in all cases of P in the data. In other words, it provides a measure of empirical evidence supporting the claim investigated (for example, P ⊆ Q). If the consistency value is low for a causal relation, then the empirical evidence does not support the existence of the given causal configuration. This means that the existence of P is not sufficient for the outcome of Q to be present. Besides, coverage refers to the proportion of the cases of the outcome Q that are associated with P considering all cases of P in the data [12, 23]. Coverage often works against the consistency, which means high coverage may have low consistency and vise versa [11]. To calculate the consistency and coverage measures, we used the standard formulas presented in [26] as in the following way: Consistency(P ⇒ Q) =

Coverage(P ⇒ Q) =

Car d(P ∩ Q) = Car d(P)

Car d(P ∩ Q) = Car d(Q)

n i=1

n i=1

min(P(xi ), Q(xi )) n i=1 P(x i )

min(P(xi ), Q(xi )) n i=1 Q(x i )

(1)

(2)

where P and Q are two fuzzy sets on X . Here we assume that Car d(P) = x∈X P(x) = 0 and Car d(Q) = x∈X Q(x) = 0 with respect to the relation P ⇒ Q. When P ∩ Q = P, then Consistency(P ⇒ Q) = 1 (perfect consistency),

An Investigation of Hidden Shared Linkages Among Perceived …

23

and this implies that there is no evidence that contradicts the given relationship in the data,1 we can also conclude that P is a sufficient condition for Q. Also, Coverage(P ⇒ Q) = 1 implies that P ∩ Q = Q and thus we can also conclude that P is a necessary condition for Q. If there are other “causes” for Q, then the coverage score could be less than 1. A relation with the consistency of 1 and coverage of 1 would be an ideal case indicating that P is the only cause for Q, and there are no counterexamples from the data [25]. In general, we prefer to get a good balance from various consistency and coverage ranges for a particular situation where the outcome is compelling theoretically and empirically. If the relation has very high consistency but with low coverage, that does not describe many cases at all, and the relationship might be too weak. In contrast, if the case has very high coverage with low consistency, that also indicates a weak relationship because there is no sufficient evidence from the data.

2.5 Hypothesis Models The primary goal of this study, as previously noted, was to identify the inter-causal relationships. Accordingly, we developed two hypotheses regarding the shape of the relationship from one causation to another and tested them with the empirical data. The other two possible hypotheses (positive implies negative, and negative implies positive) are not investigated to keep the presentation of the results simple and the length of the chapter reasonable. As the aim of this chapter is to introduce the necessary methodology and show an example of its performance, the focus on the following two hypotheses is sufficient: • H ypothesis 1 (H1 ): If Ci → C j is positive then C p → Cq is positive • H ypothesis 2 (H2 ): If Ci → C j is negative then C p → Cq is negative where, Ci , C j , C p and Cq indicate four different strategic variables (elements) and Ci → C j and C p → Cq indicate two different causal relations for i, j, p, and q (i = p and j = q) in a map. Based on our data sample, the effect from one variable to another might be positive or negative. Nonexistent effects (i.e., effects with the strength of zero) are not considered in the subsequent analysis. Nevertheless the proposed methodology can also process the absence of a causal relationship as a part of the investigated inter-causal relationships. Therefore, we consider positive and negative weights on the causal relationships to define the hypotheses. Accordingly, H1 indicates that the existence of a positive causal relation implies the existence of another positive one, and H2 indicates that the existence of a negative causal relationship implies the other one to exist too and to be negative. In fact, these hypotheses allow us to identify positive and negative inter-causal relationships. If P and Q are fuzzy sets on X then P ∩ Q is a fuzzy set on X as well and its membership function is defined, for the purpose of our calculations, using the min t-norm, that is for any x ∈ X we have (P ∩ Q)(x) = min{P(x), Q(x)}.

1

24

M. Mailagaha Kumbure et al. 23

20

19

16

Frequency

15

10

8

5

2

2 1

0 −3

−2

−1

0

1

2

3

Strengh value

Fig. 2 An example of frequency of each strength value for a selected causal relation

2.6 Research Process We started the analysis by collecting the frequency of each causal relationship for each strength value going through all adjacency matrices of the cognitive maps. A strength (weight) value for a particular causal relationship can vary from −3 to 3, and it is possible that across the individual cognitive maps the same causal relationship appears several or many times with different strengths. For example, consider the frequency vector for a causal relationship, (2, 1, 2, 23, 8, 19, 16) for the strengths vector (−3, −2, −1, 0, 1, 2, 3). This indicates that 2 individuals weighted the causal strength by −3, 1 individual weighted −2 and so on for the considered causal relation during the simulation process. This example is graphically presented in Fig. 2. In this way, we can also identify how many times a given causal relationships is positive, negative, or considered nonexistent (strength value of 0). Once the frequencies of different strengths of each investigated causal relationship were collected and visualized using histograms, it is easy to filter out those causal relationships that are not considered to exist by any of the individuals (strength frequency vector (0, 0, 0, 71, 0, 0, 0) in our case with 71 decision-makers). For those causal relationships that were assigned non-zero strength at least once we can proceed to define the membership functions representing the “positive” or “negative” causal effects to be able to investigate hypotheses 1 and 2. There are different types of membership functions that characterize different types of fuzzy sets. As one of them, the trapezoidal membership function is commonly used in current applications. The trapezoidal membership function (x; α, β, γ , δ) that is formed by four input parameters α, β, γ and δ such that α < β < γ < δ, is defined as follows:

An Investigation of Hidden Shared Linkages Among Perceived …

μ(x) = (x; α, β, γ , δ) =

⎧ 0 ⎪ ⎪ ⎪ ⎨ x−α

β−α

⎪ 1 ⎪ ⎪ ⎩ δ−x δ−γ

if x ≤ α or x ≥ δ if α ≤ x ≤ β if β ≤ x ≤ γ if γ ≤ x ≤ δ

25

(3)

In our study, the criteria to form the trapezoidal fuzzy sets were based on the frequency of the strengths of each causal relation and the linguistic labels positive and negative. It is worth to mention here that we used these linguistic labels to reveal reasonable characteristics of the causalities in the used cognitive maps. To compute the fuzzy numbers with the trapezoidal membership function, we designed its significant values using expert knowledge, (0, 1, 3, 3) for positive and (−3, −3, −1, 0) for negative (i.e., μ positive ∼ (x; 0, 1, 3, 3) and μnegative ∼ (x; −3, −3, −1, 0)). Once the membership values for all observations were obtained, consistency and coverage values were computed using formulas (1) and (2). Next, we evaluated the validity of the hypotheses based on the consistency and coverage values obtained. We prioritized the consistency first during the evaluation and then the coverage scores to gain additional support for each hypothesis. Concerning consistency, a high value makes investigated claim stronger. A specific value for the required consistency can also be defined depending on the cases we evaluated. In this study, we considered the evidence in favor of the hypotheses to be sufficient and reasonable if the corresponding rules have consistency value ranged from acceptable through high to excellent. In this way, we collected all of intercausal relationships into excellent, high and acceptable ranges if their consistency value was in the [0.9, 1], [0.75, 0.9) and [0.6, 0.75) intervals respectively. Besides that, we considered that low consistencies do not support the validity of the cases. In contrast to the consistency thresholds, we assumed that an acceptable coverage distributes between 0.25 and 0.65 (0.25 ≤ coverage ≤ 0.65) and explains the existence of particular inter-causal relationships. This way we evaluate the validity of each hypothesis and next section presents and discusses the results of the analysis.

3 Results and Discussion This section analyzes and discusses the results obtained from the proposed framework applied for identifying essential inter-causal relationships in the shared cognitive maps. We drove our analysis based on two hypotheses–accordingly, we examined 3932 relationships between perceived causal relations under H1 and 176 under H2 extracted from the cognitive maps in the data. Having this, we calculated consistency and coverage values for each of those relationships to determine whether it is a significant relationship or not (i.e., to validate each of the hypotheses). In terms of the consistency, initially, three different boundaries (for acceptable, high, and excellent) were set for categorizing the significance of the inter-causal relationships under each hypothesis. In this sense, we discuss all cases over four different intervals

26

M. Mailagaha Kumbure et al.

Table 1 Inter-causal relationships % detected within different ranges of consistency and coverage scores under H1 Coverage > 0.65 (%) [0.25, 0.65] (%) [0, 0.25) (%) Consistency

≥ 0.9 [0.75, 0.9) [0.6, 0.75) [0, 0.6)

0.0 0.0 0.1 5.14

0.20 0.64 1.07 37.03

0.79 1.55 2.14 51.35

of consistency, [0, 0.6), [0.6, 0.75), [0.75, 0.9) and [0.9, 1] and they reflect the weak, acceptable, high and excellent levels of the evidence. In addition to that, we consider three levels of coverage scores as [0, 0.25), [0.25, 0.65] and (0.65, 1] to reflect the relevance of the coverage together with consistency. In the analysis, we first examined the inter-causal relationships under H1 and an overview of the found results is presented in Table 1. This table summarizes the results of the relative number of inter-causal relationships recognized within the different ranges of consistency and coverage. These results are displayed by percentages with respect to the total number of cases we investigated. From Table 1, it is apparent that small amounts of inter-causal relationships (2.01%[= 0.2% + 0.64% + 1.07% + 0.1%] ∼ 79 cases) have been significant (gaining consistency ≥ 0.6 and coverage ≥ 0.25) under H1 . In particular, there are 0.2% (∼ 8) inter-causal relationships found with excellent consistencies (with consistency value 0.9 or higher) and reasonable coverages (coverage ∈ [0.25, 0.65]. Also, 0.64% (∼ 25) cases appear to be in the consistency ranged of [0.75, 0.9) and 1.07% (∼ 42) cases in the range of [0.6, 0.75) holding sufficient evidence under the H1 . There is no case that has the coverage of 0.65 or more when consistency ≥ 0.75. In summary, we can see that most of the cases (93.52%[= 5.14% + 37.03% + 51.35%] ∼ 3677) do not have enough support in the data based on the consistency values (consistency < 0.6). Even though some cases (4.48% ∼ 176) have sufficient evidence in their favor in the data (consistency ≥ 0.6), they seem to be too weak in terms of the coverage scores (consistency < 0.25). Besides, Table 2 presents the overview of the results

Table 2 Inter-causal relationships % detected within different ranges of consistency and coverage scores under H2 Coverage > 0.65 (%) [0.25, 0.65] (%) [0, 0.25) (%) Consistency

≥ 0.9 [0.75, 0.9) [0.6, 0.75) [0, 0.6)

5.11 0.0 0.57 13.64

3.98 0.57 1.7 30.11

8.52 0.0 0.0 36.36

An Investigation of Hidden Shared Linkages Among Perceived …

27

of the inter-causal relationships found within the different intervals of set-theoretic scores under H2 . According to the table information, it seems there is a considerable number of cases (11.93%[= 5.11% + 3.98% + 0.57% + 1.7% + 0.57%] ∼ 26) compared to the all cases investigated which has sufficient support in the data (consistency ≥ 0.6) and reasonable coverages (coverage ∈ [0.25, 0.65]). It is interesting to see that 5.11% (∼ 9) cases have obtained excellent consistencies and high coverages under H2 . Literally, all of these inter-causal relationships except one hold the consistency of 1 and coverage of 1. These results are discussed further in the coming subsections. Next, we thoroughly discuss the most important results summarized in the above tables under H1 and H2 separately.

3.1 Inter-causal Relationships Under H1 We have scrutinized numerous inter-causal relationships using the set-theoretic scores over the hypothesis H1 , and found sufficient and reasonable evidence on 255 cases. Therefore, interpretation and exhibition of the all of such cases are a real challenge and we summarize and discuss the essential cases that are concerned with what must be reported. From H1 , we expected to investigate possible linkages between positive causal relations in the cognitive maps. Figure 3 displays the consistency and coverage values obtained under H1 on the selected 40 inter-causal relationships within all ranges of the set-theoretic scores. In the figure, the boundaries for the consistency and coverage ranges are presented by horizontal dash-lines with different colors. From this figure, one can clearly understand how the set-theoretic scores distribute over each intercausal relationship and see which of them have sufficient and reasonable evidence in favor of the existence. Let us consider some examples in each range of the consistency and coverage scores. Table 6 in the Appendix provides the practical meaning of the integer labels of the strategic issues for easier presentation of the investigated intercausal relationships. It is apparent that some cases have excellent support for their existence from very high consistency and reasonable coverage scores. For example, the relation (2 → 41) ⇒ (1 → 41) holds the consistency of 0.92 and coverage of 0.4. Also, the cases (20 → 37) ⇒ (15 → 37), (23 → 16) ⇒ (16 → 41) and (8 → 41) ⇒ (21 → 41) have fully consistent support (consistency = 1) associated with appropriate coverages (0.4, 0.42 and 0.5, respectively). Moreover, we can observe that some relationships for examples, (12 → 41) ⇒ (10 → 41) and (2 → 1) ⇒ (16 → 41) have a considerable support from the evidence obtaining acceptable consistencies (0.62 and 0.65) and reasonable coverages (0.26 and 0.56). Besides, there are fully coverage scores (coverage = 1) on the relations (12 → 2) ⇒ (10 → 2) and (20 → 23) ⇒ (10 → 2) but corresponding consistencies (0.33 and 0.25) that are not on the acceptable level. This indicates that there is no sufficient evidence in favor these relationships. Also, considering the rest of the consistency values that are less than 0.6, we do not find enough evidence in favor of corresponding relationships.

0.00

0.25

0.50

0.75

1.00

>1

9)=

Consistency values >(8

−> 1

3)=

>(1

−

0.00

(1− (1 (10

0.25

Fig. 3 Consistency and coverage values on the selected 40 inter-causal relationships under H1

Coverage values >(8

0)

6−

0)

6−

)

6−

19

>4

)=>

(10

1)

(23

1)

(16

−>

−>

(11

>4 1

)=> (1

−>

1)

(23

>1 9

(10

>4

)=>

)

−> 4

2)= >

(11

41

>(2

2)=

41

>(2

)

−> 1

1)=

(12

2−

0.50

9)=

>(1

)

6−

19

−>

1)

−>

2)=

9)

>(1

9)

>(1

2−

>4 1

)

>4

2)=

−

1)

>(1

2)

(10

−>

41

(15

0− >

1)= >

−>

)

6−

−>

(12

>4

(12

41

(23

0.75

>1

−> 2

−> 1

3)=

−

>4 1

)=> (1

−>

(10

−> 4

(16

−>

−> 4

−> 1

1)=

)

1)

>(1

−>

)=>

)=>

41

(23

)

−> 4

6)=

(15

5−

1)

>(1

5−

−

>4 1

)

>4

)=>

−

1)

(19

−>

)=>

37

(15

)

−> 2

6)=

(19

3)

>(1

6−

−

>4 1

)

3−

>4

)=>

(2

1)

(16

1)

(1−

23

(20

−> 4

1)= >

−>

1)

>4

)=>

1)

(1

0− >

7)=

(20

2)

>(1

5−

−

)

1−

1)=

(21

>3 7

)=> (2

−>

9)

2

4)

(20

16

(17

−> 2

3)= >

−>

)

)

6−

6−

16

(15

>2

(17

37

(1

)=> (1

−>

1)

(19

−>

)=>

>1 6

(15

>4

)=>

4)

(20

16

−> 1

23

>(1

>2 3

)

>1

1)

(16

(20

>4

)=>

)

5−

>3 7

(21

>1

>(1

>1

>(1

9)

2

>4

)=>

1)

(16

−> 4

4)=

(24

9)

(2−

16

(23

−> 1

1)= >

−>

1)

(1

−> 4

(23

>4

)=>

)

6−

1)

>(1

6−

−

)

3−

41

>4

)=>

(29

1)

(23

>4

)=>

1)

(1

−> 4

5)=

(32

1)

>(1

5−

−

)

>2

)=>

−

1)

(1−

41

(32

−> 4

1)= >

−>

1)

>4 1

)=> (2

−>

1)

(23

>2 3

(24

>4

)=>

(30

>4

)=>

)

5−

1−

12

(34

3)

(19

3)

(16

16

(37

−> 2

1)= >

−>

)

−>

)=>

41

(23

)

−> 4

1)=

(38

1)

>(1

6−

−

)

2−

3)=

(8−

>4 1

)=> (1

>2

9)

6

>3 7

)=> (2

−>

3)

(19

>4

(37

41

(2

>1 5

(33

>2

)=>

−>

)=>

)

6−

>3 4

(8−

>1

>(1

1−

)=> (2

(2−

>4

)=>

−> 4

1)

(1−

23

−> 3

>2

>(1

)

1−

1)=

9)

(2−

16

−> 2

41

>(1

)

3−

41

−> 4

1)

(1−

41

−> 1

41

>(1

)

1−

12

3)

(16

16

−> 4

41

>(1

)

2−

3)=

>1

>(1

9)

6

>4 1

1)

>(2

)=> (1

(12

>4

2)=

2)

(10

41

−> 1

>4 1

6−

−>

−

0− >

1)= >

−>

(15

1)

>(1

)

6−

16

−

−> 1

−> 4

6)=

−

>4 1

)=> (1

−>

(15

−> 2

3)= >

−>

5− >

6)=

−

>4 1

3−

>1

(2

−> 4

1)= >

−>

(10 −

7)=

−

>3 7

)=> (2

−>

(21

−> 1

1)= >

−>

6− >

4)=

−

>4 1

)=> (2

−>

(2

−> 4

4

1)= >

−>

(1− >

5)=

−

>3 7

)=> (2

−>

(34

−> 2

1)= >

−>

3− >

1)=

−

>4 1

)=> (1

>2

−> 4

1)=

1.00

(1− (1

−> 2

(10

>1 9

(10

−>

2)= >

(11 (11

−> 4

(12

−> 1

)=> (1

(12 (12

>4

(12

−> 1

(15

>1 6

(15

>2

(17 (17

−> 1

(19

>2 3

)=> (2

(2−

−> 4

(20 (20

−> 3

(20

>3 7

(21

−> 4

(23 (23

−> 2

(24

>2 3

(24

9− >

(30 (32

−> 1

(32

>1 5

(33

−> 4

(37 (37

−> 4

(38

>3 4

(8− (8

−> 4

−> 4

)=>

1)

(21

−>

>4

1)

41

)

65

=90

28

M. Mailagaha Kumbure et al.

An Investigation of Hidden Shared Linkages Among Perceived …

29

Table 3 All the cases with excellent consistency values under H1 (Ci

→ C j ) ⇒ (C p → Cq ) Consistency Coverage (Ci → C j ) ⇒ (C p → Cq ) Consistency Coverage

(1

→ 24) ⇒ (16 → 41)

1

0.08

(3 → 12) ⇒ (16 → 41)

1

(8

→ 41) ⇒ (1

→ 41)

1

0.10

(8 → 41) ⇒ (2

→ 41)

1

0.23

(8

→ 41) ⇒ (21 → 41)

1

0.50

(11 → 2)

→ 19)

1

0.16

(12 → 1)

⇒ (16 → 41)

⇒ (2

0.09

1

0.14

(15 → 16) ⇒ (16 → 41)

1

0.02

(15 → 16) ⇒ (19 → 1)

1

0.06

(15 → 16) ⇒ (19 → 24)

1

0.05

(15 → 16) ⇒ (24 → 41)

1

0.03

(15 → 16) ⇒ (30 → 19)

1

0.07

(15 → 16) ⇒ (33 → 1)

1

0.10

(15 → 16) ⇒ (38 → 41)

1

0.33

(15 → 41) ⇒ (19 → 41)

1

0.19

(15 → 41) ⇒ (24 → 41)

1

0.14

(19 → 16) ⇒ (16 → 41)

1

0.07

(19 → 22) ⇒ (23 → 41)

1

0.14

(20 → 37) ⇒ (2

1

0.12

(20 → 37) ⇒ (15 → 37)

1

0.40

(20 → 37) ⇒ (16 → 41)

1

0.05

(20 → 37) ⇒ (23 → 41)

1

0.04

(21 → 2)

→ 19)

1

0.12

(21 → 2)

⇒ (23 → 41)

1

0.07

(21 → 23) ⇒ (23 → 41)

1

0.09

(22 → 16) ⇒ (16 → 41)

1

0.09

(23 → 16) ⇒ (16 → 41)

1

0.42

(23 → 24) ⇒ (24 → 41)

1

0.28

(23 → 37) ⇒ (16 → 41)

1

0.09

(29 → 10) ⇒ (23 → 41)

1

0.18

(30 → 41) ⇒ (1

→ 41)

1

0.10

(33 → 12) ⇒ (23 → 41)

1

0.11

(33 → 41) ⇒ (1

→ 41)

1

0.13

(2 → 41) ⇒ (1

0.92

0.40

⇒ (2

→ 1)

→ 41)

To get more insights on the results under H1 , we present all cases with excellent consistencies (i.e., consistency ≥ 0.9) in Table 3. We already discussed the cases that hold the perfect consistencies and reasonable coverage scores (see the bold case in the table). Focusing on other information in the table, however, it is apparent that most cases have a consistency of 1 but shallow coverage (less than 0.25), for example, (21 → 2) ⇒ (23 → 41). This means that even though the relationship has a consistency of 1, particular condition is empirically trivial (irrelevant). Table 4 illustrates the interpretation of the selected inter-causal relationships along with their respective consistencies and coverages under H1 . The identified intercausal relationships that are supported by excellent consistency and appropriate coverage appear logical. Such inter-causal relationships reveal deeper structural meanings in the causal maps. The direct causal effect of demand on shareholder return (2 → 41) accompanied with causal effect of market share on shareholder return (1 → 41) refers to understanding that shareholder return is dependent on size of the market and firm’s share of it. The inter-causal relationship of direct effect of corporate tax rate on equity ratio (20 → 37) and effect of debt on equity ratio (15 → 37) the underlying logic might be the knowledge about issues that influence firms’ solvency. The inter-causal relationship of the effect of profitability on dividends (23 → 16) and the effect of dividends on shareholder return (16 → 41) is very clear. When the effect of in-house R&D on shareholder return (8 → 41) is accompanied with the effect of intense competition on shareholder return (21 → 41), it can be understood that these causalities together signal perception of innovation as competitive strategy.

2. Demand 20. Corporate tax rate 23. LTP 8. In-house R&D 12. Product selling prices 2. Demand 12. Product selling prices 20. Corporate tax rate 21. Market competition

Antecedent causal relationship

→ → → → → → → → →

41. TCSR 37. Equity Ratio 16. Dividends 41. TCSR 41. TCSR 16. Market share 2. Demand 23. LTP 2. Demand

⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ 1. Market share 15. Long-term debt 16. Dividends 21. Market competition 10. P-M decisions 16. Dividends 10. P-M decisions 10. P-M decisions 23. LTP

Consequent causal relationship → → → → → → → → → 41. TCSR 37. Equity Ratio 41. TCSR 41. TCSR 41. TCSR 41. TCSR 2. Demand 2. Demand 41. TCSR

0.92 1.00 1.00 1.00 0.62 0.65 0.33 0.25 1.00

0.40 0.40 0.42 0.50 0.26 0.56 1.00 1.00 0.07

Consistency Coverage

Table 4 Interpretation of chosen inter-causal relationships and their set-theoretic scores under H1 ; P-M decisions stands for product-market decisions, LTP for long-term profitability, TCSR represents total cumulative shareholder returns. All causal relationships have positive strength

30 M. Mailagaha Kumbure et al.

An Investigation of Hidden Shared Linkages Among Perceived …

31

We can clearly see that the above discussed identified inter-causal relationship with high consistency can be interpreted in the framework of the actual simulation task and can be considered to “make sense” in the simulated reality. We should, however, remark here, that the proposed methodology for the identification of inter-causal relationships can also find highly consistent inter-causal relationships that might not be easily interpretable in the context of the given system/simulation/economic theory. But even this does not mean that such relationships are coincidental or incorrectly identified. We discuss this issue more in the following subsection. The fact that intercausal relationships are ones of coexistence rather than of “causality chains” needs to be taken into account in the interpretation of these relationships. Obviously, intercausal relationships with low consistency and insufficient coverage should not be considered to appear in the shared cognitive structure of the group.

3.2 Inter-causal Relationships Under H2 Turning now to empirical evidence on the evaluation of H2 , we have found sufficient and reasonable evidence on 36 cases. Figure 4 illustrates the consistency and coverage scores on the selected 40 inter-causal relationships. In this figure, we specifically focused on eight ideal inter-causal relationships (consistency = 1 and coverage = 1) over H2 . This is an interesting result we found, and we discuss them in detail later. Apart from this, the cases (20 → 37) ⇒ (17 → 16) and (8 → 41) ⇒ (1 → 41) have strong support from consistency of 1 and considerably high coverages of 0.67 and 0.5, respectively. We also can see there is sufficient evidence in favor of some cases for example, (12 → 2) ⇒ (20 → 23) obtaining consistency of 0.67 and reasonable coverage of 0.5. In contrast, there is weak evidence from the consistency in some cases (e.g., (15 → 41) ⇒ (15 → 16)) even though they have reasonable coverages. Also, the relations such as (10 → 19) ⇒ (21 → 1), (2 → 12) ⇒ (21 → 12) hold perfect consistencies with strong coverage values revealing that existing evidence do not support the particular relations. In this way, we identified all significant inter-causal relationships under H2 . Table 5 illustrates the interpretation of the selected found inter-causal relationships under H2 and their respective consistencies and coverages. We need to keep in mind that the inter-causal relationships analysed here do not suggest the existence of a “causality chain” from one causal relationship to the other. The existence of consistent inter-causal relationship with high coverage simply means that the existence of the first (antecedent) causal relationship in a causal map implies that the second (consequent) causal relationship can be found there too. This explains why even thought the inter-causal relationships in Table 5 have excellent consistencies and coverages, they appear irrelevant and meaningless. The face validity check of the inter-causal relationship needs to be performed in a different way than one would perform a validity check of causal relationships. It might not make sense to assess the co-existence of a pair of causal relationship in the shared cognitive structure against the theoretical assumptions. The inter-causal relationship does not suggest a causal

0.00

0.25

0.50

0.75

1.00

0.00

−>

>(1

9− >

19

Consistency values

16

)=>

−

)

(21

)

>1

9)

>(1

2)

>(1

−>

0.25

(10

(10

−> 1

0−

2)=

(1

2− >

2)=

(12

9−

)

>3

)=>

−

7)

(21

>1

>(2

9)

0

3)

>(1

5−

−

)

>1

)=>

−

6)

(15

−>

)=>

41

(2

)

37

>(1

)

5−

−

)

>1

)=>

9)

(30

)=>

9)

(10

−

)

>2

>(1

3)

2

−>

)=>

12

(1

)

)

12

>(1

−>

)

5−

>

16

>(1

−>

)

5−

)

23

>(1

)

5

1)

2)

2

9)

>(2

1−

−

)

9)

(19

−>

)=>

19

(1

)

41

>(1

)

5−

−

)

1−

15

>1

)=>

9)

(21

(3

−> 2

41

(8−

0.50

Fig. 4 Consistency and coverage values on the selected 40 inter-causal relationships under H2

Coverage values

)

(21

19 )=

(1

0− >

16

)=>

>1 9

)

−>

9)

>(1

2)

9−

5−

19

(12

2)

1−

2)=

(15

−> 1

)=> (2

−>

9)

0

>2 3

)=> (1

−>

7)

(21

19

(12

>3

)=>

>1 9

(12

>1

>(2

−> 2

6)=

(15

5−

5−

16

(17

3)

(17

16

(17

−> 2

6)= >

−>

)

0− >

1)=

(19

0−

16

(1

−> 1

23

(19

>4 1

)=> (1

−>

6)

(15

>1

(17

41

(2

>4 1

(17

>1

)=>

−>

)=>

)

5

>4 1

)=> (1

−>

9)

(30

>1 6

(19

>1

)=>

9)

(10

0−

2)=

(2−

−> 2

)=> (2

>1

3)

2

9)

(21

23

(20

−> 1

2)= >

−>

)

23

−>

7)=

(20

>

−>

1)=

(21

>2 3

>(2

0− >

9)=

(21

−>

>(1

−> 1

2)=

(30

>1

)=>

−

41 )

>(1 7

−>

)

5−

12 )=

(21

)

(21

2)=

(3

−> 4

1)= >

>1

2)

9− >

3)=

>(2 1

>3

)

5−

>2 3

(2−

>2

>(1

(20

12

(1

)

(20 −

12

>(1

37 )=

(21

16

>(1

)

−> 1

23

(15

−> 4

1)

(3−

−>

>(1

)

1−

0−

19

(30

6)

(21

41

(32

−> 1

9)= >

−>

)

1− >

5)=

(32

>1 2

)=> (1

−>

9)

(19

>1

(30

19

(1

>1 9

(30

>1

)=>

−>

)=>

)

5

>2 3

)=> (2

−>

9)

(21

3)

(21

>1 5

(32

>1

)=>

)=> (3

(10

>1

2)=

>(1

)

5−

19

2)

1−

2)=

3)

>(1

)

5−

16

3)

(17

16

−> 4

37

>(1

1)

0−

16

)=>

)

0−

2)=

−>

)=>

>1

9)

(21

23

−> 2

23

−>

7)=

−>

1)=

>2 3

(2

)=>

−> 4

(21

2)=

−> 1

19

>(2

)

0−

19

6)

(21

41

−> 1

41

>(1

3)

1−

15

)=>

−>

)=>

3)

(21

−> 1

)=>

)

(11

0.75

−>

>(1

9− >

19

−

−> 1

0−

−>

0− >

2− >

2)=

−>

>2 3

)=> (1

−>

−

−> 1

)=> (2

−>

−> 1

−> 2

6)=

−

>4 1

)=> (1

−>

(17

−> 2

6)= >

−>

0− >

1)=

−

−> 4

)=> (1

−>

(1

9− >

−> 1

23

−

−> 2

)=> (2

>1

−> 1

2)= >

−>

9− >

3)=

>(2 1

>3

−>

41 )

>(1 7

−>

−

)=>

0− >

19

>1

(3

2− >

2)=

−

>1 2

)=> (1

−>

(30

−> 1

9)= >

−>

(32

1− >

5)=

−

−> 2

)=> (2

−>

(3

8− >

−> 2

41

>4 1

−> 1

)=>

1.00

(10

19 )=

(1

0− >

(10

>1 9

)=> (3

(10 (1

0− >

(12

19

(12 (12

>1 9

(12 (15

−> 1

(15

>4 1

(17

−> 1

(17 (17

−> 4

(19

>1 6

(19

9− >

(19

>2 3

(2− (2−

>1

(20 (20

−> 2

(20 − (20

37 )=

(21 (21

>1 2

(2

1− >

(21

−> 4

1)= >

(3−

−> 1

(30

>1 9

(30

−> 1

(30

−> 1

(32

>1 5

(32

8− >

(8−

>4 1

)

(11

−>

41

41

)

)

65

=90

32

M. Mailagaha Kumbure et al.

An Investigation of Hidden Shared Linkages Among Perceived …

33

Table 5 Interpretation of chosen inter-causal relationships and their set-theoretic scores under H2 ; P-M decisions stands for product-market decisions, LTP for long-term profitability. All causal relationships have negative strength Antecedent causal relationship

Consequent causal relationship

Consistency Coverage

10. P-M decisions

→ 2. Demand

⇒ 19. Sales

→ 23. LTP

1.00

1.00

10. P-M decisions

→ 19. Sales

⇒ 19. Sales

→ 16. Dividends

1.00

1.00

10. P-M decisions

→ 19. Sales

⇒ 30. Promotion

→ 19. Sales

1.00

1.00

19. Sales

→ 16. Dividends ⇒ 10. P-M decisions → 19. Sales

1.00

1.00

19. Sales

→ 16. Dividends ⇒ 30. Promotion

1.00

1.00

19. Sales

→ 23. LTP

⇒ 10. P-M decisions → 2. Demand

1.00

1.00

30. Promotion

→ 19. Sales

⇒ 10. P-M decisions → 19. Sales

1.00

1.00

30. Promotion

→ 19. Sales

⇒ 19. Sales

1.00

1.00

→ 19. Sales

→ 16. Dividends

link between the causal relationships. Their existence (with high consistency and coverage) should instead be interpreted in the cognitive context as representing the (mis)concepts of the decision-makers possibly shared in the group. Looking at the mere comparison of the number of possible inter-causal relationships under H1 that focuses in positive causal relationships and H2 that focuses on negative ones it is striking how much the number of causal relationships differs between H1 and H2 . It appears that the majority of the respondents find it easier to focus on positive causal relationships, or they only consider a strength of the relationship, but not its sign. Apparently, the respondents find it difficult to formulate negative causal relationships in their cognitive maps, or may even misinterpret the actual meaning of a negative strength of a causal relationship. For example, there is one inter-causal relationship where negative impact of sales on dividends (19 → 16) is associated with negative impact of promotion on sales (30 → 19). Still this provides us with valuable insights into the shared cognitive structure of the group and the (in)consistency thereof with our expectations and with the relevant economic theories.

4 Conclusion and Future Directions The purpose of this study was to investigate the relationships between causal relationships (i.e., inter-causal relationships) in the individual cognitive maps and their existence within the shared cognitive structure in the given group of decision-makers. To accomplish this, we developed a methodology using set-theoretic consistency and coverage measures. The developed method was employed with the empirical data of cognitive maps collected through a strategic decision-making process. Then, the analysis was carried out by establishing two hypotheses based on the positive and negative characteristics of each relationship. From the empirical evidence obtained,

34

M. Mailagaha Kumbure et al.

Table 6 Pool of strategic issues on sustainable return to shareholders ID Strategic issues ID Strategic issues 1 2 3 4

Market share Demand Own manufacturing Contract manufacturing

22 23 24 25

5 6

26 27 28 29 30

Wages of R & D employees Mission and vision Promotion

31

Transportation cost

11 12 13 14 15 16 17 18 19 20

Inventory management Investment in production and plants Number of R & D personnel In-house R & D Buying technology and design licenses Product-market decisions (technology) Feature offered Product selling prices Logistics priorities Transfer prices Long-term debt Dividends Number of shares outstanding Internet loans Sales Corporate tax rate

Short-term profitability Long-term profitability Growth of the company Employee training and education Consumer price elasticity R & D employee turnover

32 33 34 35 36 37 38 39 40 41

Interest rates Market selection decisions Brand, company image Capacity allocation Network coverage Equity ratio Environmental sustainability Supplier selection Supply chain ethics Total cumulative shareholder returns

21

Competition in the market

7 8 9 10

the findings with set-theoretic consistency and coverage scores suggest that there are some strong (with strong support from the data) inter-causal relationships in the cognitive maps. In particular, we found sufficient and reasonable evidence on 255 inter-causal relationships under H1 and 36 inter-causal relationships under H2 . We are aware that our methodology may have some limitations. For example, the information represented in cognitive maps largely depends on the participants’ knowledge, experience, and beliefs. In that case, we cannot always guarantee the accuracy of the data of all maps concerning the particular situation. Another thing is that the proposed approach results would be increasingly complex and challenging to interpret when the number of concepts increases (i.e., causal conditions increase). Despite this, the proposed method for the identification of inter-causal relationships is helpful in revealing the networked structure of the perceived causalities. Deeper understanding of these inter-relatedness of the perceived causalities (operationalized

An Investigation of Hidden Shared Linkages Among Perceived …

35

with inter-causal relationships as the coexistence of the causal relationships within a cognitive structure) gives better description of the respondents’ overall logic and of the shared cognitive structure in the group than the assessment of separate causal relations. This is a significant methodological contribution to the study of cognitive maps. As far as we know, no other study has attempted to investigate the inter-causal relationships in cognitive maps. We believe that the presented approach in this study could be useful in cognitive mapping related studies to be applied for scrutinizing and interpreting the inter-causal relationships in a meaningful way. For example, a cognitive mapping technique can be used to study sustainable tourism policies [7]. Because of the complexity of underpinning policies used in cognitive maps, we are confident that our method has the potential to scrutinize the policy issues and their relations so that policymakers can easily reach their goals. Furthermore, we also recommend possibility of further research is undertaken in large manufacturing systems where a chemical process is conceptualized in fuzzy cognitive maps (see example, [27]). Regarding this, the most influential chemicals elements and their inter-causal relations can be examined using the proposed method and, it would make the manufacturing process more effective and success controlling the settings of the crucial factors and actions.

References 1. Avdeeva, Z.K., Kovriga, S.V.: On governance decision support in the area of political stability using cognitive maps. IFAC-PapersOnLine 51(30), 498–503 (2018). https://doi.org/10.1016/ j.ifacol.2018.11.277 2. Ayala, A.P., Azuela, J.H.S.A., Gutiérrez, A.: Cognitive maps: an overview and their application for student modeling. Computación y Sistemas 10 (2007) 3. Bergman, J.P., Knutas, A., Luukka, P., Jantunen, A., Tarkiainen, A., Karlik, A., Platonov, V.: Strategic interpretation on sustainability issues-eliciting cognitive maps of boards of directors. Corp. Govern. 16(1), 162–186 (2016) 4. Bergman, J.P., Luukka, P., Jantunen, A., Tarkiainen, A.: Cognitive diversity, managerial characteristics and performance development among the cleantech. Int. J. Knowl.-Based Org. 10(1) (2020). https://doi.org/10.4018/IJKBO.2020010101 5. Bevilacqua, M., Ciarapica, F.E., Mazzuto, G.: Analysis of injury events with fuzzy cognitive maps. J. Loss Preven. Process Ind. 25, 677–685 (2012) 6. Carvalho, J.P., Tome, J.A.B.: Fuzzy Mechanisms for Qualitative Causal Relations. Springer, Berlin, Heidelberg (2000) 7. Farsari, I., Butler, R.W., Szivas, E.: The use of cognitive mapping in analysing sustainable tourism policy: methodological implications. Tour. Recreat. Res. 145–160 (2010) 8. Froelich, W., Papageorgiou, E.I., Samarinas, M., Skriapas, K.: Application of evolutionary fuzzy cognitive maps to the long-term prediction of prostate cancer. Appl. Soft Comput. 12, 3810–3817 (2012) 9. Gray, S.A., Zanre, E., Gray, S.R.J.: Fuzzy cognitive maps as representation of mental models and group beliefs. Fuzzy Cogn. Maps Appl. Sci. Eng. 29–48 (2014) 10. Gray, S.A., Gray, S., De Kok, J.L., Helfgott, A.E.R., O’Dwyer, B., Jordan, R., Nyaki, A.: Using fuzzy cognitive mapping as a participatory approach to analyze change, preferred states, and perceived resilience of social-ecological systems. Ecol. Soc. 20(2) (2015). https://doi.org/10. 5751/ES-07396-200211

36

M. Mailagaha Kumbure et al.

11. Kent, R.: Using fsQCA: A Brief Guide and Workshop for Fuzzy- Set Qualitative Comparative Analysis (Teaching Notes). University of Manchester (2008) 12. Kumbure, M.M., Tarkiainen, A., Luukka, P., Stoklasa, J., Jantunen, A.: Relation between managerial cognition and industrial performance: an assessment with strategic cognitive maps using fuzzy-set qualitative comparative analysis. J. Bus. Res. 114, 160–172 (2020) 13. Langfield-Smith, K., Wirth, A.: Measuring differences between cognitive maps. J. Oper. Res. Soc. 43, 1135–1150 (1992) 14. Mendonca, M., Angelico, B., Arruda, L.V.R., Neves, F.J.: A dynamic fuzzy cognitive map applied to chemical process supervision. Eng. Appl. Artif. Intell. 26, 1199–1210 (2013) 15. Mohammed, S., Ferzandi, L., Hamilton, K.: Metaphor no more: a 15-year review of the team mental model construct. J. Manag. 36, 876–910 (2010) 16. Motlagh, O., Tang, S.H., Ismail, N., Ramil, A.R.: An expert fuzzy cognitive map for reactive navigation of mobile robots. Fuzzy Sets Syst. 201, 105–121 (2012) 17. Mourhir, A., Rachidi, T., Papageorgiou, E.I., Karim, M., Alaoui, F.S.: A cognitive map framework to support integrated environmental assessment. Environ. Modell. Soft. 77, 81–94 (2016). https://doi.org/10.1016/j.envsoft.2015.11.018 18. Nápoles, G., Grau, I., Bello, R., Grau, R.: Two-steps learning of Fuzzy Cognitive Maps for prediction and knowledge discovery on the HIV-1 drug resistance. Exp. Syst. Appl. 41(3), 821–830 (2014). https://doi.org/10.1016/j.eswa.2013.08.012 19. Papageorgiou, E.I., Froelich, W.: Multi-step prediction for pulmonary infection with the use of evolutionary fuzzy cognitive maps. Neurocomputing 92, 28–35 (2012) 20. Papageorgiou, E.I., Markinos, A.T., Gemtos, T.A.: Fuzzy cognitive map based approach for predicting yield in cotton crop production as a basis for decision support system in precision agriculture application. Appl. Soft Comput. 11, 3643–3657 (2011) 21. Ragin, C.C.: Redesigning Social Inquiry: Fuzzy Sets and Beyond (2008) 22. Robbins, S.P.: Organizational Behavior, 11th edn. Pearson Education, Upper Saddle River, NJ (2005) 23. Schneider, C., Wagemann, C.: Set-Theoretic Methods for the Social Sciences: A Guide to Qualitative Comparative Analysis. Cambridge University Press, Cambridge, UK (2012) 24. Skaaning, S.E.: Assessing the robustness of crisp-set and fuzzy-set QCA results. Sociol. Methods Res. 40, 391–408 (2011) 25. Stoklasa, J., Luukka, P., Talášek, T.: Set-theoretic methodology using fuzzy sets in rule extraction and validation—consistency and coverage revisited. Inf. Sci. 412–413, 154–173 (2017) 26. Stoklasa, J., Talášek, T., Luukka, P.: On consistency and coverage measures in the fuzzified set-theoretic approach for social sciences: dealing with ambivalent evidence in the data. In: Proceedings of the 36th International Conferences on Mathematical Methods in Economics, pp. 521–526 (2018) 27. Stylios, C.D., Groumpos, P.P.: Application of fuzzy cognitive maps in large manufacturing systems. IFAC Proc. Vol. 31(20), 521–526 (1998). https://doi.org/10.1016/S14746670(17)41848-9 28. Tepes, A., Neumann, M.B.: Multiple perspectives of resilience: a holistic approach to resilience assessment using cognitive maps in particular engagement. Water Res. 178 (2020) 29. Tolman, E.C.: Cognitive maps in rats and men. Psychol. Rev. 55(4), 189–208 (1948) 30. Zhu, Q., Wang, R., wang Z.: A cognitive map model based on spatial and goal-oriented mental exploration in rodents. Behav. Brain Res. 256, 128–139 (2013)

A New Framework for Multiple-Criteria Decision Making: The Baseline Approach Jan Stoklasa

and Mariia Kozlova

Abstract This chapter proposes a new multi-criteria decision-making (MCDM) problem formulation. We generalize the standard MCDM problem by assigning a specific role to one alternative from the pool of alternatives—to the one the decision maker is currently in possession of. We call this current solution a baseline. We propose that the baseline can be treated differently from the other alternatives and in fact can represent the reference solution for the whole decision-making problem. It can even serve as a basis for the determination or modification of criteria weights. The introduction of the baseline in MCDM problems allows for more customizability in the modeling of real-life decision-making. Two rules for criteria weight formation based on the baseline are introduced and discussed: Appreciating Possessed—weights of criteria are proportional to the satisfaction of the criteria by the baseline; Craving Unavailable—higher weights are assigned to those criteria that are less satisfied by the baseline while lower weights are assigned to criteria that the baseline satisfies better. Combining these two rules gives birth to several common behavioral effects, such as reluctance to switch to an alternative identical with the baseline; unwillingness to switch to a better alternative if satisfied with the baseline; endless switching between two alternatives in case of poor satisfaction; and seemingly exaggerated impact of a new feature introduction. The overall satisfaction with the baseline is quantified, however the specific role of the baseline alternative gives the name to the proposed approach. Overall, we provide a formal model for many real-life decision-making problems that are difficult to model under the standard MCDM problem formulation.

J. Stoklasa (B) · M. Kozlova School of Business and Management, LUT University, Yliopistonkatu 34, 53850 Lappeenranta, Finland e-mail: [email protected] M. Kozlova e-mail: [email protected] J. Stoklasa Faculty of Arts, Department of Economic and Managerial Studies, Palacký University Olomouc, Kˇrížkovského 8, 771 47 Olomouc, Czech Republic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_3

37

38

J. Stoklasa and M. Kozlova

Keywords Decision support · Multiple-criteria decision-making · Problem formulation · Weights determination · Baseline

1 Introduction Behavioral issues in decision-making have been recognized and extensively studied in the domains of economics [17, 28], finance [27, 32], and operations research [9, 10]. Recent developments concern the very basics of the latter, the multiplechoice problem. Kogut in [21] shows that a decision maker’s preferences and choice consistency depend on the strategy of reducing the set of alternatives. Trueblood and Pettibone in series of experiments [29] elicit such biases, as, for example, a preference of a non-dominating alternative. Asadabadi believes that a decision maker may be willing to reconsider the weights of criteria after the decision is made and offers a methodology to treat this possible weight volatility in the decision-making problem [1]. This paper continues this inquiry by proposing a potential underlying mechanism of behaviorism in multi-criteria decision-making (MCDM). We argue that the standard MCDM problem formulation might be too much of a simplification of real-life problems—first of all it treats all the alternatives homogenously, it does not ask where the experience (expertise) for the determination of weights of criteria stem from and it also assumes (implicitly) that there is no current solution to the problem (i.e. a currently chosen alternative). We show that many more modeling possibilities can be achieved by generalizing the standard MCDM problem formulation in a simple way—by the introduction of the concept of a baseline. The contribution of this paper is twofold: (1) it suggests a novel formulation of the decision-making problem that generalizes the standard formulation, and (2) it proposes a decision-making model that aims to explain some frequently encountered decision biases. By decision biases, we understand deviations from the suggestions by standard normative MCDM models. The main idea this paper revolves around is a different approach to criteria weight determination. We postulate that in many cases weights of criteria are not defined independently of the alternatives or considering all of the alternatives [23, 26, 35], but instead using one particular alternative. This particular alternative, denoted the baseline alternative (or simply the baseline, or B), is the alternative that is currently occupying the “solution slot”. We show that the determination of weights of criteria based on the current (possibly non-optimal) alternative can explain some specific decision-making patterns (constant switching in the constant set of alternatives, reluctance to accept a slightly better alternative etc.) that can be encountered in practice. The new MCDM formulation is able to explain some real-life decision-making behavior, and therefore, to serve as a modeling tool in behavioral and social-science research and marketing and industrial applications as well. The remainder of the chapter is structured as follows. The next section contains a reminder of the standard MCDM problem and the section afterwards presents the new formulation of the MCDM problem. In the Sect. 4 we present in detail the

A New Framework for Multiple-Criteria Decision …

39

underlying decision-making (and criteria weights determination) styles available in the new MCDM problem formulation, and in the Sect. 5 we derive and illustrate several behavioral effects that appear from the new MCDM formulation. Finally, we summarize the findings and future research direction in the Conclusions.

2 Definition of a Standard MCDM Problem In the standard MCDM approach, we consider n alternatives A1 , . . . , An which are to be evaluated based on m criteria C1 , . . . , Cm . The types of criteria (benefit, cost, target value) are known and the underlying scales of the criteria (i.e. the universes for criteria values, or in other words for the evaluations of alternatives with respect to the criteria) are also given. The usual goals of decision-making are: (a) selection of the best alternative out of the set of alternatives; (b) ordering of the alternatives; (c) decisions on (un)acceptability of the alternatives; other goals of decision-making, such as the classification of the alternatives, can also be considered. In order to be able to reach one or more of these goals the evaluations of the alternatives with respect to each criterion are usually computed by aggregating the evaluations of the alternatives with respect to the criteria. For this purpose, the evaluations are frequently normalized to unify the evaluation scale for all the criteria and aggregated using the weights of criteria specified by the decision maker. Some approaches, such as those based on rule bases and fuzzy rule bases do not even require the normalization of evaluations and can aggregate values of criteria on different scales (see e.g. some of our papers with rule-base aggregation) or even the knowledge of the weights of the criteria. For simplicity, we will focus in this chapter on the choice of the best alternative and assume that all the evaluations of the alternatives with respect to all the criteria are already available as values from a given continuous benefit-type scale. The scale [0, 10] is chosen for all numerical examples in this paper, where 10 represents the best imaginable evaluation, and 0 represents the worst imaginable evaluation for each criterion. We will also assume that the goal of the decision-maker is the selection of the best alternative. Note, that this does not narrow the scope of the paper, since overall evaluations that can be used for the ranking of the alternatives (and also for the decisions on (un)acceptability of the alternatives under some conditions) will be computed anyway. The normalizedweights of criteria, w1 , . . . , wm , such that 0 ≤ w j ≤ 1 for all j = 1, . . . , m and mj=1 w j = 1 are usually specified by the decision maker. There are many methods for weights determination, ranging from direct assigning of weights through points, percentages or graphical scales (see e.g. [4]), through subsequent assignment of weights as in the Metfessel allocation method [22], to indirect methods such as pairwise-comparison based methods [5, 26]. And new methods for the determination of weights of criteria are still being proposed (see e.g. [6, 12, 16, 20]). The weights are either assigned to the criteria irrespective of the alternatives (not reflecting the evaluations of alternatives with respect to the criteria), or taking into account, for example, the ranges of the evaluations of all the alternatives with respect

40

J. Stoklasa and M. Kozlova

Table 1 MCDM problem under the standard formulation, three alternatives and two criteria with equal normalized weights are considered. Weighted average is used for the aggregation Criterion 1 Criterion 2 Overall evaluation Choice Alternative 1 7 4 5.5 × Alternative 2 2 8 5.0 Alternative 3 5 5 5.0 Weights of criteria 0.5 0.5

to each criterion or reflecting the evaluations of the alternatives in another way, such as in the compensation analysis, in the multi-criteria utility function framework etc. [8, 14]. Normalization of the weights can be performed in other ways as well (e.g. so that the highest weight is set to 1 etc.). Rule-based approaches to MCDM and multiple-criteria evaluation may not even require the knowledge of the weights or their normalization, since the importance of the criteria can already be embedded in the rules. An example of a simple MCDM problem under the standard formulation with the goal of selecting the best alternative is summarized in Table 1. The alternative with the highest overall evaluation is selected as the solution to the MCDM problem with three alternatives and two criteria, weighted average is used to compute overall evaluations. Note, that in the example in Table 1 all the alternatives are treated identically, the same set of weights of criteria is used to calculate the overall evaluation of each of the alternatives. Once alternative 1 is selected, it remains the best alternative as long as the weights of the criteria, the set of alternatives and the evaluation of alternatives with respect to the criteria do not change. Also note, that if more alternatives have the same highest evaluation, there is no normative way of selecting one of them as a solution—this can constitute a complication in the decision-making. The usual, although not frequently stated, assumption is that no alternative initially assumes a specific role—all of them are simply considered to be the elements of the set of possible alternatives A = {A1 , . . . , An } (see [35]). The set A can contain either finite or infinite number of alternatives. Either way this setting implies that all the potential solutions (alternatives in A) are treated the same, i.e. no alternative is currently chosen. The set of available alternatives A is therefore searched for the most suitable alternative to “fill the void”. There are, however, many cases in which a current (“initial”) solution is available. In this paper we argue, that this “currently chosen alternative” does play a specific role in the decision-making problem, because it represents an important benchmark. It constitutes a natural reference alternative with which all the others are being compared. This new formulation of the decision-making situation has the following implications: 1. It is a generalization of the standard formulation of the decision-making situation. As long as the currently chosen alternative (the benchmark) is treated in the same way as all the alternatives in A, we have the standard decision-making setting. If

A New Framework for Multiple-Criteria Decision …

41

there is currently no benchmark alternative, we can consider the “absence of an alternative” to be the benchmark. 2. It allows for distinguishing between the elements of A and the benchmark alternative and thus it allows for the reflection of specific behavioral (decision) patterns that cannot be easily modelled in the standard framework. 3. The new formulation represents a wider family of decision-making situations than the standard one; that is it allows for the reflection of the fact that some solution (regardless of how good or bad it is) may currently be in place. For the purposes of this paper, we consider the selection of the best (most suitable) alternative to be the goal of the decision-making. We therefore do not directly address problems that, for instance, aim on deciding on (in)acceptability of the alternatives, their ordering, and so forth.

3 Definition of the New MCDM Problem Addressed in This Paper—Decision Making Under the Existence of a Baseline As we have already pointed out, in many real-life situations the assumption of the nonexistence of a current solution is incorrect. Consider e.g. the selection of a new mobile phone, decisions on buying a new car, decisions on which flat/house to move to from the current accommodation, searching for a new job, changing a bank, an insurance company, a food store, or any other service, relationship issues and searching for an “ideal partner” etc. Considering the usual understanding of rationality in decision-making, i.e. conscious search for the alternative/solution that maximizes utility (or optimizes the value of any other overall criterion/value relevant for the given decision maker), the decision-making problem does not end with the selection of one of the alternatives. In essence, the satisfaction of the decision-maker with the new solution should be validated and the environment should be continuously (or periodically) monitored for new alternatives that would provide better utility/satisfaction (or whatever the relevant overall criterion/goal is). That is if we want to remain rational as decision-makers. This is where the approach proposed in this paper might come in handy. Not necessarily as a normative tool, but as a descriptive framework capable of shedding light to some of the phenomena encountered in real-life decision-making and not covered by standard decision-making models. We therefore assume that either the first alternative was chosen based on some rational assessment of the available alternatives, or that simply the absence of a solution can be already evaluated as a baseline. In any case we assume that the currently in place solution (or the absence thereof) can be evaluated with respect to all the chosen criteria.

42

J. Stoklasa and M. Kozlova

3.1 The Basic Difference Between the Standard Formulation of the MCDM Problem and the Newly Proposed Generalized Formulation Situations, where a solution is currently in place, have one feature in common. Their set of alternatives is {B, A1 , . . . , An }, i.e. there is a baseline B representing the currently in place solution and a set of n alternative solutions A = {A1 , . . . , An } that are being considered. The nature of the decision-making problem formulated this way is now significantly different from the standardly assumed decision-making problem described in the previous section. The existence of a standing solution B introduces one more step in the decision-making problem. This step is the decision whether to exchange the current alternative B for any of the A1 , . . . , An . The subsequent selection of the best alternative to replace B is performed as a second step, as long as B is to be replaced. It also gives a reason to question the standard equal treatment of all the alternatives, because the baseline B clearly has a specific function in the DM situation, providing a benchmark and also a set of initial values of all the criteria, that are currently available and that will be maintained as long as the current alternative B is not exchanged for another one. Note, that the existence of a baseline solution (or considering the absence of a solution, e.g. a partner, as an alternative) opens one more possibility for solving the decision-making problem. This possibility is maintaining the status quo. The inaction, i.e. the absence of the selection of a new alternative, is not considered an option in the standard decision-making framework. Particularly as long as relative-type evaluation models are used. Also, because B and the alternatives from the set A are being clearly distinguished, B can be treated differently than A1 , . . . , An —i.e. the evaluation of B and even the weights assigned to the criteria for the assessment of B can differ from the weights of criteria used to assess A1 , . . . , An . As B is considered to be the current solution, it (i.e. the values of the criteria corresponding with B) can be used as a benchmark to define the weigths of criteria or to modify the weights derived by some standard method in accordance with the baseline B. Given the fact that the weights of criteria might be significantly affected by the decision-maker’s experience, the current solution B might be playing a crucial role in the assessment of importance (weights) of criteria. B can be considered to be a cognitive anchor or the most recent experience, which gives basis for the anchoring effect (see e.g. [15, 34]), recency bias [13] or other distortions of the decision-making [19, 30]. Under the assumption that the baseline might influence the weights of criteria, new patterns such as “constant switching” between several alternatives might appear in our models. The new formulation of the MCDM problem proposed in this paper therefore offers a tool for the representation of decision-making problems, that: 1. allows for inaction, i.e. for the maintaining of status quo; 2. assumes that the selection of a new solution/alternative might not necessarily be the end of the decision-making situation; 3. allows for “second thoughts”, by allowing the initial baseline to become one of the considered alternatives in the next steps of the decision-making situation (you

A New Framework for Multiple-Criteria Decision …

43

can go back to not having a partner, you can get back with your previous partner, you can decide to be unemployed again). The assumption that a baseline exists at the beginning of the decision-making process is what allows us to make these generalizations. It is exactly the role of the baseline that we are going to explore in this paper, particularly its possible influence on the weights of criteria. Throughout the paper, we will consider selecting the best alternative from the set {B, A1 , . . . , An } to be the goal, under the assumption that B occupies the currently-in-place solution slot. We will also investigate the stability of the obtained solution—we will discuss under what circumstances constant switching of alternatives might occur and when it is possible to expect the reluctance to switch for a better alternative.

3.2 Assumptions and Notation for the Newly Proposed MCDM Problem Formulation We consider n alternatives A1 , . . . , An plus a baseline B, which are to be evaluated based on m criteria C1 , . . . , Cm . The evaluation of the alternative Ai with respect to C the criterion C j is provided as the value e Aij ∈ [emin , emax ] for all i = 1, . . . , n and j = 1, . . . , m and the evaluation of the baseline B with respect to the criterion C j is C provided as e B j ∈ [emin , emax ] for all j = 1, . . . , m. Without any loss of generality we will assume, that the evaluation universe is [emin , emax ] = [0, 10] for all the criteria and all criteria are benefit-type. We will also assume a commonly used method for the aggregation of evaluations of an alternative with respect to all the criteria—the weighted sum. Other aggregation methods can obviously be considered as well, we will, however, stick with the weighted average in this paper for its relative simplicity and for the ease of explanations of the effects that manifest themselves. Another assumption we make is that once the baseline is replaced by one of the alternatives, say A x , then A x becomes the new baseline B and the old baseline is incorporated into the set A of alternative solutions. This way, the decision maker can come back to the previous solution, if he/she is not satisfied with the current baseline. If the initial baseline is included back in the set of alternatives A, then, after the recalculation of weights of criteria, it may become the most desirable once again. This way loops in the decision-making situation may appear, when the decision maker constantly switches between two (or more) alternatives. The acceptance of a new solution (i.e. the acceptance of a new baseline) can thus increase the attractiveness of other alternatives. Note that the formulation of the decision-making problem introduced in this paper does not rule out the possibility of having weights of criteria (fixed for the decisionmaking problem) defined by the decision-maker. If we assumed that these n fixed weights of criteria u 1 , . . . , u n are defined by the decision-maker, then the weights defined further by (2), (5), (7) and (8) can play the role of modification coefficients of these fixed decision-maker-expressed weights of criteria. For the transparency of

44

J. Stoklasa and M. Kozlova

explanation in this paper, we will assume that the decision-maker treats each criterion with equal importance or in other words does not specify any fixed weights.

4 Decision-Making Styles Available with the New MCDM Problem Formulation In this paper we take the perspective of the decision-maker with a baseline as the departure point. In other words, we consider the baseline to be the reference point based on which the weights of criteria can be determined. This way, the following methods fall into the family of methods that determine the weights of criteria based on the evaluations of alternatives with respect to the criteria. We, however, do not use all the alternatives, but simply the baseline as a benchmark. Because it embodies the current (i.e. previously accepted or at least currently tolerated) solution, it might provide a reasonable starting point for the determination of the weights of criteria. The new MCDM problem formulation proposed in the previous section enables the introduction of several decision-making styles. Two of these styles, in essence, represent extreme styles of decision-making: the “appreciating the possessed” style and the “craving the unavailable” style. While decision-making results of pure underlying styles appear to be somewhat rational, combining these two styles allows a number of behavioral effects to emerge, such as undervaluation of clearly better alternatives when the satisfaction with the baseline is high or overvaluation of clearly worse alternatives when the satisfaction with the baseline is low; or the relative change in evaluation of the alternatives when a new baseline is selected, which can result in a continuous need for switching the baseline (up to the point when two alternatives alternate as a baseline indefinitely). We argue here that in many realistic decision problems, no matter how many criteria are chosen and how precisely they are estimated, there is some perceived uncertainty for the decision-maker with respect to the alternatives. This uncertainty might be explained by a perceived possibility that there is another criterion (or several criteria) that have not been detected before, but with a new alternative might become significant or even decisive. Such an uncertainty prevents saturated decision-makers to switch to slightly better alternatives in the fear of something possibly bad they had not accounted for and vice versa, motivates unsaturated decision-makers to accept slightly worse alternatives in a hope that there is something good they have not accounted for. This phenomenon is illustrated and discussed more in detail later in this paper in Sect. 5.2. For the simplicity of explanation, we assume linear relationships between the evaluations of the baseline with respect to the criteria and the criteria weights— both for the baseline and for all the alternatives in the set A. Let us now consider the two decision-making styles mentioned above and the respective criteria-weights determinations, each stemming from a different idea and applicable for a different type of decision-makers. First we will introduce and analyze each style separately

A New Framework for Multiple-Criteria Decision …

45

and finally we will do the same for their combination. For simplicity we will assume that no fixed weights of criteria are specified, i.e. all the criteria are considered equally important (before the effect of the baseline is taken into account). It is also possible to reflect fixed nonuniform weights of criteria. However, because we aim on the explanation of the possible effect of the baseline, we do not use fixed criteria weights (independent on the alternatives) further in the paper.

4.1 “Appreciating Possessed” Decision-Making Style First, let us consider a “playing it safe” approach to decision-making. This careful approach can be applied by decision-makers, who value what they have and do not want to compromise currently saturated criteria. They therefore assign weights of criteria proportional to the current satisfaction of the criteria by the baseline. In other words, criteria with high satisfaction under B (high values of the evaluation on the assumed benefit-type scale) are assigned high weights and criteria with low satisfaction are assigned lower weights. This ensures that the possible new alternative replacing B as the new baseline saturates the currently highly evaluated criteria sufficiently. The approach characterized by this helps to avoid large drops in satisfaction of any criteria, while it neglects the low-saturated criteria (by the baseline) to some extent. Using the standard decision-making theory language, this approach might be close to risk-aversion. It could be explained by habituation—the decision-makers know the current baseline, they have some experience with it, they may even have some emotional attachments. They may also be used to the possibly low level of satisfaction of some of the criteria. In other words, they have learned to live with the current alternative and are willing to switch it only for such that does not make the currently saturated criteria significantly less saturated (as the DM is used to the current level of satisfaction). Under these circumstances the DM might not much appreciate the increase of satisfaction in some of the currently low-saturated criteria, as he/she has grown used to the low level of satisfaction. The weights of the criteria under this decision-making style v Aj P are determined in the following way, for all j = 1, . . . , m: C

C

v Aj P = e B j − emin = e B j ,

(1)

which after normalization gives us: w Aj P =

v Aj P emax − emin

.

(2)

In this case the weights of criteria are positively proportional to the evaluations of the baseline with respect to the given criterion. This effectively introduces a penalty for the loss in the values of highly saturated criteria (under the baseline). We do not

46

J. Stoklasa and M. Kozlova

Table 2 MCDM problem under the formulation newly proposed in this paper. Initial baseline is the Alternative 1. Three other alternatives and two criteria considered. The weights of criteria w Aj P are calculated based on the evaluation of the baseline. The “appreciating possessed” decision-making style is used. Weighted sum (3) is used for the calculation of final evaluations Baseline Criterion 1 Criterion 2 Overall evaluation Choice Alternative 1 × 7 4 6.5 × Alternative 2 4 7 5.6 Alternative 3 6 5 6.2 Alternative 4 7 4 6.5 Weights of crite0.7 0.4 ria w Aj P

propose to normalize the weights in the usual way (i.e. so that the sum of weights is 1) in order to preserve the information on the actual (un)satisfaction level of each criteria under the baseline. This way the weights can be interpreted as criterion fulfilment rates by the baseline B. The evaluation of each alternative X ∈ {B, A1 , . . . , An } is then computed using the weighted sum: eX =

m

C w Aj P · e X j .

(3)

j=1

The following example (Table 2) illustrates the weight formation under the “appreciating possessed” decision-making style and the respective choice of the best alternative. In Table 2 the Alternative 1 is assumed to be the baseline with the evaluations with respect to the first and second criteria of 7 and 4 respectively on the [0, 10] benefit-type scale. The corresponding perceived weights of criteria w Aj P , as computed by (2), are therefore 0.7 and 0.4 for C1 and C2 respectively. Alternative 2 that represents an opposite solution with a higher evaluation for the second criterion and lower for the first, yields lower overall evaluation than the baseline. We can see that because of the drop in the evaluation in the originally highly saturated criterion C1 , Alternative 2 is evaluated lower than Alternative 1, because the increase in evaluation w.r.t. the originally low-saturated criterion C2 cannot offset the drop in the originally highly saturated one. Alternative 4 is evaluated the same as Alternative 1, which is no surprise given their identical evaluations with respect to all the criteria. Using this decision-making style, the weights are determined by the satisfaction of the criteria by the baseline. Hence quite frequently the best choice might be to keep the baseline (mainly if the satisfaction with the baseline is already high), or in other words, the decision-maker prefers to maintain the status quo, unless an alternative with higher evaluation for the highly saturated criteria (by the current baseline) is presented.

A New Framework for Multiple-Criteria Decision …

47

Table 3 MCDM problem under the formulation newly proposed in this paper. Initial baseline is the Alternative 1. Three other alternatives and two criteria considered. The weights of criteria wCU j are calculated based on the evaluation of the baseline. The “craving unavailable” decision-making style is used. Weighted sum (6) is used for the calculation of final evaluations Baseline Criterion 1 Criterion 2 Overall evaluation Choice Alternative 1 × 7 4 4.5 Alternative 2 4 7 5.4 × Alternative 3 6 5 4.8 Alternative 4 7 4 4.5 CU Weights of criteria w j 0.3 0.6

4.2 “Craving Unavailable” Decision-Making Style It is also possible to imagine a different approach to the evaluation of importance of criteria with respect to the given baseline—an approach more driven by deprivation. In cases where some criteria are currently not satisfied to a desired level by the baseline, the craving for satisfaction of these criteria can result in the following weighting, for all j = 1, . . . , m: C

C

= emax − e B j = 10 − e B j , vCU j

(4)

which after normalization transforms into: = wCU j

vCU j emax − emin

.

(5)

In this case the weights are high for those criteria that are not well saturated with the baseline and low for highly saturated criteria. If the baseline is evaluated emax under some of the criteria, the weight of this criterion becomes zero for the decisionmaking task. This approach embodies (partial) “blindness” to the needs that are well satisfied, and it emphasizes “craving” for satisfaction in those criteria that are currently not well saturated. The evaluation of each alternative X ∈ {B, A1 , . . . , An } is then computed using the weighted sum: eX =

m Cj wCU . · e j X

(6)

j=1

The example in Table 3 illustrates the weight formation under the “craving unavailable” style and it effect on the result of the decision-making process. In contrast to the “appreciating possessed” style, the “craving unavailable” style makes the alternatives from A the more attractive the more they saturate criteria that are not sufficiently saturated by the baseline.

48

J. Stoklasa and M. Kozlova

At this point we would like to stress, that the decision-making styles we are discussing in this section of the paper are by no means normative—at least not in a way that would suggest the decisions reached based on them are rational or correct. We aim on showing that the new formulation of the decision-making problem proposed here allows for the explanation of some decision-making patterns that could not be easily reflected using the standard approaches. The idea of the determination of weights of criteria based on the baseline (even though it is discussed here in its “primitive” linear form) might not be that far-fetched. The baseline is the first one that comes to mind as a benchmark, the longer we keep it, the better we know it and the more it possibly influences the perceived importance of the criteria by the decision-maker.

4.3 A Combined Style—“Appreciating Possessed” in the Baseline and “Craving Unavailable” in Alternatives It is also possible to think of a combination of the two previously mentioned decisionmaking styles under the existence of a baseline. It is easy to imagine that the baseline is evaluated in a different way (e.g. using a different set of criteria or different criteria) than the other alternatives. The reason for this special treatment might lie in the decision-makers’ experience with the baseline. The baseline is potentially known better than the alternatives, its strengths are well known, its weaknesses are accepted. Because it is a standing baseline, we can even assume that some of the negative aspects (represented by low evaluation in some of the criteria) might not even be noticed by the decision-maker any more. We can attribute this to such phenomena as selective evaluation [24], existing rationalizations of the weaker aspects of the baseline [3], as well as to habituation (getting used to the baseline; see e.g. [25]). In essence, we might be more forgiving to the “negative aspects” of the current baseline and more demanding from the alternatives. We could formulate the motto of this approach as “If I am to change my current solution, let it be worth the change!”. This is exactly what we obtain if we combine the previously introduced decision-making styles in one decision-making problem. A change of the baseline for another alternative is connected with some perceived level of uncertainty—we might not have enough experience with the new solution, we might find some “skeletons in the closet” etc. Hence appreciating what is currently available (in terms of high satisfaction of criteria) seems to be a viable strategy for the assessment of the baseline—the uncertainty of the transition to a “new” alternative frequently introduces some cost of switching representing the reluctance to risk of losing high satisfaction in the criteria. In other words, sometimes even a slightly better alternative might not be chosen over the current one, for instance, because of the lack of experience with this new one. The possibility of uncovering some hidden flaws and the need for getting used to a new baseline could thus outweigh the slight increase of evaluation of the new alternative. This would be the case of baselines which are “satisfactory”, i.e. which saturate all (or most of) the criteria

A New Framework for Multiple-Criteria Decision …

49

well. If the baseline is already evaluated low in most or all of the criteria, however, the chance to switch could even be considered beneficial—the opportunity to get something different can even offset slight loss in some of the criteria with the newly chosen alternative, introducing a benefit-of-switching effect in the decision-making situation. Even a slightly inferior alternative can thus be accepted simply for the sake of “getting rid” of the current baseline which is not saturating the criteria sufficiently. Because the experience of the decision-maker with the alternatives A1 , . . . , An might be much lower than with the baseline, the knowledge of the actual parameters of the alternatives might be limited. It is possible that the positive aspects of the alternatives are stressed and advertised much more than the negative ones. In product and consumer items replacement decisions, this can often be due to advertising or because of the fact that manufacturers simply do not wish to disclose all the slightly negative aspects of their product. Now considering that, for example, advertising is often focused particularly on showing that some currently not well saturated need can be satisfied better; or that in relationships people usually really try to be attractive and to provide what the other person seems to need or long for; it seems reasonable to expect that the satisfaction of currently not well saturated needs plays a significant role in the choice of a new alternative. We thus assume here that the weights for the alternatives A1 , . . . , An are driven by the craving for the currently unavailable. Again, we are not claiming that this model is normative in terms of the correctness of the solution (whatever that might be), we are simply offering means to model situations which exist around us and can be well documented from our experience. This all can be reflected by w Aj P being used for the assessment of the baseline B and wCU being used for the assessment of A1 , . . . , An . Thus, for the combined effect we j define, for all j = 1, . . . , m: wComb j

=

⎧ ⎨

v Aj P emax −emin CU ⎩ vj emax −emin

for the baseline B, for the other alternatives A1 , . . . An .

(7)

represent the level of satisfaction of the criterion C j under the The weights wComb j baseline if used for the evaluation of the baseline, and the level of unsatisfaction of the criterion C j under the baseline if used for the evaluation of other alternatives. In our particular case with the normalized evaluation scale [0, 10], we therefore have: v AP j

wComb j

=

10 vCU j 10

for the baseline B, for the other alternatives A1 , . . . An .

(8)

The evaluation of each alternative X ∈ {B, A1 , . . . , An } is then computed using the weighted sum: m C wComb (9) · eX j . eX = j j=1

50

J. Stoklasa and M. Kozlova

Let us now take a closer look at the effects we are able to model and explain using the above-mentioned combined framework.

5 Behavioral Effects in the New MCDM Problem Formulation When the Combined Style of Criteria Weights Determination is Used Let us now have a closer look at the behavioral implications of the combined style of weights formation in terms of specific effects that can be modelled using this combined style. The following sections will each consider a single behavioral pattern/effect and discuss how it can be reflected using the combined style of weights formation represented by (8). Even though the formulas presented above assume the [0, 10] benefit-type point scale and also linearity in the computation of the weights, the discussed effects can already be approximated by the suggested (simple) formulas. Obviously, in behavioral research, the formulas for weights formation can be tuned and non-linearity can be introduced into them as needed to fit the given purpose better. Here for the simplicity of explanation we have decided to keep the linear formulation.

5.1 Cost of Switching Effect In both pure styles, “appreciating possessed” and “craving unavailable”, an alternative with the same evaluations with respect to the criteria as the baseline would have the same overall evaluation as the baseline. In the combined version a behavioral effect of cost of switching appears (see Table 4). Indeed, why would anyone consider switching to a seemingly identical alternative, especially taking into account a possibility of something else the decision-maker is not aware of could be discovered after the adoption of the new baseline. This logic explains why e.g. nobody having a car (and being reasonably satisfied with it) would exchange it for exactly the same car with the same mileage etc. Let us remark here that if the initial baseline is not saturating the criteria sufficiently, the cost of switching to an alternative with identical values of criteria might be negative—turning the cost of switching into the benefit of switching (see e.g. sub-tables C and D in Table 5). The cost of switching creates strong attachment to the baseline and reluctance to switch, as will be shown later, even to slightly better alternatives. Such phenomena as lifelong adherence to one particular brand in marketing or lifelong partner in relationships can be the illustrations of this effect. It is not the best, that makes us settle down, but the good enough (as long as it is a standing solution).

A New Framework for Multiple-Criteria Decision …

51

Table 4 MCDM problem under the formulation newly proposed in this paper. Initial baseline is the Alternative 1. Three other alternatives and two criteria considered. The weights of criteria w Aj P and wCU are calculated based on the evaluation of the baseline. The combined decision-making j style is used. Weighted sum (9) is used for the calculation of final evaluations Baseline Criterion 1 Criterion 2 Overall evaluation Choice Alternative 1 × 7 4 6.5 × Alternative 2 4 7 5.4 Alternative 3 6 5 4.8 Alternative 4 7 4 4.5 Weights w Aj P for B 0.7 0.4 Weights wCU for A1 , . . . , An j

0.3

0.6

Table 5 The (un)satisfaction effect on examples of decision-making problems with a baseline and two alternatives under two criteria. High satisfaction (A) results in unwillingness to switch to a better alternative; low satisfaction (B) results in the preference of a slightly better alternative than the baseline. In both satisfaction cases (S > 0, A&B), the alternative with values of criteria identical to the baseline is not preferred over the baseline. High unsatisfaction (C) results in the willingness to accept an alternative that is slightly worse than the baseline, low unsatisfaction (D) results in not accepting a worse alternative over the baseline. In both unsatisfaction cases (S ≤ 0, C&D), the alternative with values of criteria identical with the baseline is preferred over the baseline A. Satisfaction by baseline: 50% Alternative 1 Alternative 2 Alternative 3 Weights w Aj P for B Weights

wCU j

for A1 , . . . , An

B. Satisfaction by baseline: 10% Alternative 1 Alternative 2 Alternative 3 Weights w Aj P for B Weights wCU for A1 , . . . , An j

C. Baseline Criterion 1 Criterion 2 Overall evaluation Choice Satisfaction by baseline: -50% × 8 7 11.3 × Alternative 1 10 8 4.4 Alternative 2 8 7 3.7 Alternative 3 0.8 0.7 Weights w Aj P for B 0.2

0.3

Weights

wCU j

0.4

×

for A1 , . . . , An

D. Baseline Criterion 1 Criterion 2 Overall evaluation Choice Satisfaction by baseline: -10% × 5 6 6.1 Alternative 1 7 8 6.7 × Alternative 2 5 6 4.9 Alternative 3 0.5 0.6 Weights w Aj P for B 0.5

Baseline Criterion 1 Criterion 2 Overall evaluation Choice

Weights wCU for A1 , . . . , An j

2 1 2 0.2

3 2 3 0.3

0.8

0.7

1.3 2.2 3.7

x

Baseline Criterion 1 Criterion 2 Overall evaluation Choice ×

4 3 4 0.4

5 4 5 0.5

0.6

0.5

4.1 3.8 4.9

×

5.2 (Un)Satisfaction Effect The two styles combined together also produce an (un)satisfaction effect. The evaluation of alternatives and the propensity to switch from a baseline to another alternative is affected by the overall satisfaction of the criteria by the baseline. The overall satisfaction of the criteria by the baseline, i.e. S, can be defined using (10). The satisfaction level S lies in the [−1, 1] interval, where −1 represents a complete unsatisfaction (i.e. all the criteria have emin values for the baseline), 1 represents a complete satisfaction (i.e. all the criteria have emax values for the baseline), and the values close to 0 represent a mixed case where either all the evaluations of the baseline are close to the middle of the evaluation scale, or some are high while some are low. The

52

J. Stoklasa and M. Kozlova

min ensures that the resulting satisfaction level S normalization in (10) by m emax −e 2 lies within the [−1, 1] interval. m S=

j=1

C min e B j − emax −e 2

min m emax −e 2

(10)

If the baseline is evaluated better than by the middle point of the evaluation scale emax −emin with respect to all the criteria, we can consider it to be saturating (S > 0). 2 As such, the current solution represented by the baseline might be framed as a “good standing solution” (an analogy to the gain-framing in the prospect theory; see e.g. [17, 18]). Analogously, an unsaturating baseline (S ≤ 0) might be framed as a “bad standing solution” (an analogy to the loss-framing in the prospect theory). We are not claiming that there is a direct link between the combined style (incl. satisfaction) and the prospect theory. Given the findings of Kahneman and Tversky, however, it seems plausible to expect that the decision-making might be different under a seemingly good or seemingly bad baseline. If a seemingly good baseline is considered, then the willingness of a decision-maker to exchange a “sufficiently good baseline” (i.e. a baseline which is evaluated well with respect to all the criteria) for a comparable or even a slightly better alternative might not be “worth the trouble”. This would be analogous to risk-aversion in the domain of gains. On the other hand in the case of a “bad baseline” (i.e. a baseline, all of which criteria are evaluated low) the decisionmaker might be willing to accept a similar, even a slightly worse, alternative instead of the baseline just for the sake of getting rid of the bad baseline (a distant analogy to the risk seeking behavior in the domain of losses). Even though this might be a substantial simplification, it does not go against intuition. The effect of the level of (un)satisfaction on the willingness to switch for a better or worse alternative is summarized in Table 5. High levels of satisfaction do not motivate switching to slightly better alternatives, alternatives identical with the baseline (in terms of evaluations with respect to all the criteria) are not preferred to the baseline. Reluctance to switch the baseline (8, 7) for Alternative 2 with the evaluations (10, 8), Table 5 A, i.e. for a better alternative can be explained by the just noticeable difference concept [7]. The Weber-Fechner law of response to stimuli implies, that the magnitude of difference which is noticeable by a subject is dependent in the original level of stimulus. The higher the level of stimuli, the larger the noticeable difference has to be. Even though the original law was derived in the context of psychophysics (i.e. it was considering physical stimuli), it might be applicable also in real-life evaluation. This could mean, that if the satisfaction level of the baseline is too high, an increase in evaluations of the criteria might not be high enough to be noticeable by the decision-maker. Hence even a better alternative (i.e. dominating the baseline in the standard understanding of dominance) might not be preferred to the baseline. Low levels of satisfaction (Table 5 B) motivate switching to an alternative that is better than the baseline with respect to all the criteria, alternatives identical with the baseline are still not preferred to the baseline. Low levels of unsatisfaction (Table 5 D) do not allow switching for an alternative that is slightly worse than the

A New Framework for Multiple-Criteria Decision …

53

Table 6 Analogies between the (un)satisfaction effect and the prospect theory [31] in the context of perceived uncertainty stemming from exchanging the baseline for another alternative Satisfaction (S ≥ 0)

Unsatisfaction (S < 0) C. Risk seeking. Baseline is a loss. Uncertainty in A. Risk aversion. Baseline is a gain. Uncertainty in adopting an alternative perceived as possible gain High (un)satisfaction adopting an alternative perceived as possible loss With high unsatisfaction even a worse alternative becomes (|S| ≥ 50%) ‘Appreciating the possessed’ pattern prevails; even a slightly favorable due to the hope that there is something better better alternative is not preferred to the baseline in the unknown (not yet adopted) alternative B. Risk aversion—risk neutrality. Baseline is a gain D. Risk seeking—risk neutrality. Baseline is a loss Uncertainty in adopting an alternative perceived as Uncertainty in adopting an alternative perceived as Low (un)satisfaction possible loss possible gain (|S| < 50%) A similar alternative is fairly not attractive (positive cost of A worse alternative is no longer attractive. However, a similar switching). However, slightly better alternative becomes alternative becomes attractive (negative switching costs) due to attractive possibly good unknown

baseline, yet alternatives identical with the baseline are preferred to the baseline. High levels of unsatisfaction (Table 5 C) allow for switching the baseline even for a slightly worse alternative (with respect to all the criteria), alternatives identical with the baseline are preferred to the baseline. Even though Table 5 only considers cases of satisfaction where the baseline is no worse than average in every criterion Cj emax −emin for j = 1, 2 and unsatisfaction where the baseline is no better eB ≥ 2 C min than average in every criterion e B j ≤ emax −e for j = 1, 2, the above mentioned 2 decision patterns hold also in general just based on the value of satisfaction (10). Table 6 presents the summary of the (un)satisfaction effect under the new MCDM problem formulation proposed in this paper that unveils the fourfold pattern of risk attitudes suggesting possible analogies with the findings of the prospect theory.

5.3 Constant Switching Effect When satisfaction with the baseline is negative, i.e. when unsatisfaction occurs, and the available alternatives are not dominating the baseline, a constant switching effect may occur with an alternative that has opposite qualities. In the example summarized in Table 7, we consider first three subsequent steps of such a decision-making process. The decision-making evolves in a simple manner—once a solution is selected (i.e. once the best alternative is chosen), it replaces the current baseline and becomes the new baseline and another best solution is sought. Note that when a new baseline is established, the weights of criteria need to be recalculated in accordance with (7). This makes the decision-making in these circumstances possibly dynamic. As long as the set {B, A1 , . . . , An } of alternatives available for the decision-making does not change, we obtain: • either a “stable solution”, in which case one of the alternatives is considered to be the best one. This happens if the baseline is evaluated as the best alternative in any of the steps. This can occur e.g. when the alternatives are evaluated considerably

54

J. Stoklasa and M. Kozlova

Table 7 The constant switching effect. Three subsequent steps of the decision-making problem are considered, the winner in each step constitutes the new baseline in the following step. The constant switching between Alternatives 1 and 3 is apparent Step 1 Baseline Criterion 1 Criterion 2 Overall evaluation Choice Satisfaction by baseline: −30% Alternative 1 × 4 3 2.5 Alternative 2 3 3 3.9 Alternative 3 3 4 4.6 × Weights w Aj P for B 0.4 0.3 Weights wCU for A1 , . . . , An j Step 2 Satisfaction baseline: −30% Alternative 1 Alternative 2 Alternative 3 Weights w Aj P for B Weights wCU for A1 , . . . , An j

0.6

Baseline Criterion 1 Criterion 2 Overall evaluation Choice

×

4 3 3 0.3

3 3 4 0.4

0.7

0.6

Step 3 Baseline Criterion 1 Satisfaction by baseline: −30% Alternative 1 × 4 Alternative 2 3 Alternative 3 3 Weights w Aj P for B 0.4 Weights wCU for A1 , . . . , An j

0.7

0.6

4.6 3.9 2.5

×

Criterion 2 Overall evaluation Choice 3 3 4 0.3

2.5 3.9 4.6

×

0.7

lower on all the criteria or when the satisfaction is positive which creates the cost of switching effect. • or a “constant switching pattern”, in which case the baseline is considered inferior to some other alternative in all of the steps, hence the baseline changes with each step. Such a case is illustrated in Table 7. The existence of constant switching, or “second thoughts” after a decision has been made, is likely traceable in the experience of the reader. Constant switching of interests, views, opinions, desires is a known phenomenon. In educational science and psychology, people with multiple or constantly switching interests are recently defined as ‘multipotentionalites’ [2, 33]. Other older terms include ‘polymath’, ‘generalist’, ‘multipod’. The phenomenon is often attributed to the boredom arising when pursuing the same topic for a while. Switching phenomenon has been known for more than two millennia dating back to Xenophon’s Hellenica, where Theramenes

A New Framework for Multiple-Criteria Decision …

55

was called ‘buskin’ to point out his back and forth switching between demographic and oligarchic principles. Not far to seek, in authors’ home university there are a few colleagues who have already several times switched their careers back and forth between academia and industry. One of them told us that in academia he enjoys the freedom of theory building but misses the feeling of fulfilment from applying the knowledge to real world problems. Changing the career transfers the situation into an opposite one and in several years, he made a decision to switch the career again. In different aspects, the switching behavior gave birth to such terms as ‘ambivalent’, ‘vacillator’, and ‘flip-flopper’. This illustrates that the changing of opinions and even of the perceived (relative) importance of criteria is a natural part of decision-making. Here we offer a model capable of representing such a behavior, which might be very simple, but at the same time offers a plausible explanation of the origin of “constant switching”.

5.4 New Feature Effect The “craving unavailable” logic of weights determination for the alternatives can have one more interesting consequence. The introduction of a new feature which the baseline does not possess can make the otherwise identical (in terms of the evaluations of the criteria) alternative very attractive, thus offsetting the cost of switching. The two examples in Table 8, A and B, illustrate, that an introduction of a new criterion,

Criterion 3 newly introduced

Criterion 3 not considered

Table 8 Two examples of the effect of the introduction of a new feature (represented by criterion 3) in the decision-making task. The baseline is assumed not to possess the new feature. Case A represents the introduction of the new feature under a baseline with low satisfaction; case B represents the introduction of a new feature under a baseline with high satisfaction. Higher values of the newly introduced criterion are required for an alternative evaluated identically with the baseline with respect to the original two criteria to become more preferred to the baseline. In the low-satisfaction case, even a low evaluation with respect to the newly evaluated criteria results in the preference of Alternative 3 over the baseline A-1. Satisfaction by baseline: 10% Baseline Alternative 2 Alternative 3 Weights w Aj P for B Weights wCU for j A1 , . . . , An

A-2. Satisfaction by baseline: -27% Baseline Alternative 2 Alternative 3 Weights w Aj P for B wCU j

Weights A1 , . . . , An

for

Criterion 1 Criterion 2 Criterion 3

Criterion 1 Criterion 2 Criterion 3

5 6 5 0.5

6 7 6 0.6

B-1. Overall evaluation Choice Satisfaction by baseline: 50% not considered 6.1 × Baseline not considered 5.8 Alternative 2 not considered 4.9 Alternative 3 Weights w Aj P for B

8 10 8 0.8

7 8 7 0.7

0.5

0.4

-

0.2

0.3

Criterion 1 Criterion 2 Criterion 3 5 6 5 0.5 0.5

6 7 6 0.6 0.4

0 0 2 0.0 1.0

Weights wCU for j A1 , . . . , An

B-2. Overall evaluation Choice Satisfaction by baseline: 0% 6.1 Baseline 5.8 Alternative 2 6.9 × Alternative 3 Weights w Aj P for B wCU j

Weights A1 , . . . , An

for

Overall evaluation Choice

not considered 11.3 not considered 4.4 not considered 3.7

×

Criterion 1 Criterion 2 Criterion 3

Overall evaluation Choice

8 10 8 0.8

7 8 7 0.7

0 0 8 0.0

11.3 4.4 11.7

0.2

0.3

1.0

×

56

J. Stoklasa and M. Kozlova

Fig. 1 Popsicle hotline of Magic Castle Hotel, LA US (permission to share granted by the owner, the Magic Castle Hotel)

ceteris paribus, may induce leaving the baseline. Obviously, the introduction of a new criterion into the decision-making problem influences the satisfaction level by the baseline. Under relatively high levels of satisfaction by the baseline a high evaluation in the new criteria is required from the alternatives to become attractive. Therefore, it is possible for an alternative originally evaluated similarly to the baseline (with respect to all the criteria) to become more preferred to the baseline by introducing a new criterion. The lower the satisfaction by the baseline is, the easier it is for the alternatives to become preferred over the baseline even with relatively low evaluations of the newly introduced criterion. This effect offers one possible explanation of the functioning of advertising. A product or service can be made competitive by introducing a new feature that the competitors’ product or service does not possess and convincing the customer of the value of this new feature. The new feature might not necessarily be a high-end innovation, a breakthrough, expensive or requiring R&D. The new feature could be as simple and cheap as, for example, the popsicle hotline in the Magic Castle Hotel, LA. Not intending to be a luxury hotel, the Magic Castle for years has been receiving highest customer rankings. One of the reasons why is that along with good enough accommodation and service, any customer can get free popsicles or other snacks on a silver plate by ringing the red phone near the pool (Fig. 1). While this popsicle hotline is the success among younger customers, adults enjoy free unlimited laundry service. These simple tricks made the hotel outstanding among competitors and somewhat higher accommodation prices do not scare away the client. This and other cases of the power of simple but exciting features, as well as underlying psychology, is elaborated in [11].

A New Framework for Multiple-Criteria Decision …

57

6 Conclusions In this paper we have suggested a new formulation of the MCDM problem, which is a generalization of the standard formulation. Instead of considering the set of alternatives to be uniform (i.e. no alternative assumes a specific role/place at the beginning of the decision-making problem), we assume that there is one alternative which should be treated differently from the others. This alternative is called the baseline in this paper. We assume that the baseline is the alternative that currently occupies the “standing solution slot” for the decision-maker. If no such alternative exists, the absence of a solution (e.g. an alternative that is evaluated the worst with respect to all the criteria) can be considered to constitute the baseline. The baseline can be considered as a benchmark for the determination of weights of criteria. If we do not need/want to treat it differently from other alternatives, then we obtain the standard formulation of the decision-making problem. If we, however, use the baseline as a reference point for the determination of the weights of criteria, several behavioral effects frequently present in real-life decision-making emerge. To our knowledge, this is the first generalized formulation of a decision-making problem explicitly dealing with a baseline (initial solution). We show its potential usefulness for the behavioral and marketing research and for the research in MCDM and social sciences in general. In the second part of the paper we discuss the “appreciating possessed” and “craving unavailable” styles of determination of weights of criteria and discuss their possible real-life applications. Finally, we combine the two weights-determination styles into one decision-making problem and show how the cost of switching, the (un)satisfaction, the constant switching, and the new feature effects can be easily reflected in the proposed framework. We do not propose the weights determination styles to be normative—we present them in their most simple linear form. Even under such a simplification, the results are consistent with real-life decision-making and the proposed framework seems to offer much-needed tools for the modelling of real-life decision making in behavioral and social-science research. Obviously, for practical applications of this framework, a behavioral research has to be done to find the actual weights determination styles under the existence of the baseline. Nevertheless, we are convinced that even under the linear transformations assumed in this paper the potential of the newly proposed formulation of the MCDM problem is evident. The proposed formulation of the decision-making problem and the possible role of the baseline in the determination of weights of criteria (either as direct weights of criteria, or as modification-coefficients for fixed weights of criteria expressed by the decision-maker) allow for the reflection of some effects that can occur in reallife decision-making, such as constant switching between alternatives, reluctance to switch to a better alternative or willingness to switch to a slightly worse alternative (depending on the satisfaction level) etc. In essence the problem formulation and weights determination introduced in this paper potentially provide a new approach for the behavioral research, as well as a new method for the development of more

58

J. Stoklasa and M. Kozlova

intelligent decision-support systems or more human-like artificial intelligence algorithms. Acknowledgements The authors would like to acknowledge the support of the Ministry of Education, Youth and Sports of the Czech Republic, grant no. IGA_FF_2021_001, grant no. 190197 received from The Foundation for Economic Education, Finland, and the funding received from the Finnish Strategic Research Council, grant no. 313396/MFG40—Manufacturing 4.0. The authors would also like to thank to Peter Jones for his valuable assistance with proofreading and to Tomáš Talášek for technical help with the manuscript.

References 1. Asadabadi, M.R.: The stratified multi-criteria decision-making method. Knowl.-Based Syst. 162, 115–123 (2018). https://doi.org/10.1016/j.knosys.2018.07.002 2. Barnacle, R., Schmidt, C., Cuthbert, D.: Expertise and the PhD: between depth and a flat place. High. Educ. Quart. (2018) 3. Beauvois, J.L., Joule, R.V.: A Radical Dissonance Theory. Taylor and Francis, London (1996) 4. Bottomley, P.A., Doyle, J.R.: A comparison of three weight elicitation methods: good, better, and best. Omega 29(6), 553–560 (2001) 5. Cook, W.D., Kress, M.: Deriving weights from pairwise comparison ratio matrices: an axiomatic approach. Eur. J. Oper. Res. 37(3), 355–362 (1988) 6. de Almeida, A.T., de Almeida, J.A., Costa, Ana Paula, Seixas, Cabral, et al.: A new method for elicitation of criteria weights in additive models: flexible and interactive tradeoff. Eur. J. Oper. Res. 250(1), 179–191 (2016) 7. Fechner, G.: Elements of Psychophysics, vol. I (1966) 8. Greco, S., Figueira, J., Ehrgott, M.: Multiple Criteria Decision Analysis. Springer (2005) 9. Gregan-Paxton, J., Cote, J.: How do investors make predictions? Insights from analogical reasoning research. J. Behav. Decis. Making 13(3), 307–327 (2000) 10. Hämäläinen, R.P., Luoma, J., Saarinen, E.: On the importance of behavioral operational research: The case of understanding and communicating about dynamic systems. Eur. J. Oper. Res. 228(3), 623–634 (2013) 11. Heath, C., Heath, D.: The power of moments. Why certain experiences have extraordinary impact. Simon & Schuster, New York, US (2017) 12. Hoang, G., Stoklasa, J., Talášek, T.: First steps towards a lossless representation of questionnaire data and its aggregation in social science and marketing research. In: Slavickova, P., Talasek, T. (eds.), Proceedings of the International Scientific Conference Knowledge for Market Use 2018, Palacký University in Olomouc, pp. 112–118 (2018) 13. Hogarth, R.M., Einhorn, H.J.: Order effects in belief updating: the belief-adjustment model. Cognit. Psychol. 24(1), 1–55 (1992) 14. Hwang, C.L., Yoon, K.: Multiple Attribute Decision Making: Methods and Applications, A State of the Art Survey. Springer, Berlin, Heidelberg, New York (1981) 15. Jacowitz, K.E., Kahneman, D.: Measures of anchoring in estimation tasks. Person Soc. Psychol. Bull. 21(11), 1161–1166 (1995) 16. Jandová, V., Krejˇcí, J., Stoklasa, J., et al.: Computing interval weights for incomplete pairwisecomparison matrices of large dimension: a weak-consistency-based approach. IEEE Trans. Fuzzy Syst. 25(6), 1714–1728 (2017) 17. Kahneman, D., Tversky, A.: Choices, values, and frames. Am. Psychol. 39(4), 341–350 (1984) 18. Kahneman, D., Tversky, A.: Prospect theory: an analysis of decision under risk. Econometrica 47(2), 263–292 (1979) 19. Kahneman, D., Klein, G.: Conditions for intuitive expertise a failure to disagree. Am. Psychol. 64(6), 515–526 (2009). https://doi.org/10.1037/a0016755

A New Framework for Multiple-Criteria Decision …

59

20. Kao, C.: Weight determination for consistently ranking alternatives in multiple criteria decision analysis. Appl. Math. Model 34(7), 1779–1787 (2010) 21. Kogut, T.: Choosing what I want or keeping what I should: the effect of decision strategy on choice consistency. Org. Behav. Hum Decis. Process. 116(1), 129–139 (2011) 22. Metfessel, M.: A proposal for quantitative reporting of comparative judgments. J. Psychol. 24(2), 229–235 (1947) 23. Mitra, G., Greenberg, H.J., Lootsma, F.A., et al.: Mathematical Models for Decision Support, F48 edn. Springer, Berlin, Heidelberg, New York, London, Paris, Tokyo (1988) 24. Posavac, S., Brakus, J., Jain, S., et al.: Selective assessment and positivity bias in environmental valuation. J. Exp. Psychol.-Appl. 12(1), 43–49 (2006). https://doi.org/10.1037/1076-898X.12. 1.43 25. Rankin, C.H., Abrams, T., Barry, R.J., et al.: Habituation revisited: an updated and revised description of the behavioral characteristics of habituation. Neurobiol. Learn. Mem. 92(2), 135–138 (2009). https://doi.org/10.1016/j.nlm.2008.09.012 26. Saaty, T.L.: Fundamentals of Decision Making and Priority Theory with the Analytic Hierarchy process. RWS publications, Pittsburgh (2000) 27. Shiller, R.C.: Irrational exuberance. Philos. Publ. Policy Quart. 20(1), 18–23 (2000) 28. Thaler, R.: Toward a positive theory of consumer choice. J. Econ. Behav. Organ. 1(1), 39–60 (1980) 29. Trueblood, J.S., Pettibone, J.C.: The phantom decoy effect in perceptual decision making. J. Behav. Decis. Making 30(2), 157–167 (2017) 30. Tversky, A., Kahneman, D.: Judgment Under Uncertainty: Heuristics and Biases. Z Soz Psychol. 185(4157), 1124–1131 (1974) 31. Tversky, A., Kahneman, D.: Advances in prospect theory: cumulative representation of uncertainty. J. Risk Uncertainty 5(4), 297–323 (1992) 32. van Der Sar, N.: Book Review: Advances in behavioral finance, Richard, H.T. (ed.), Russell Sage Foundation, New York, 1993, 597 pp. (1997). ISBN 0-87154-845-3, ISBN 0-87154-844-5 (pbk). J. Behav. Decis. Making 10(4):358–360 33. Wapnick, E.: How to be Everything: A Guide for Those who (still) Don’t Know what They Want to be when They Grow Up. HarperCollins (2017) 34. Yeh, C., Yang, C.: How does overconfidence affect asset pricing, volatility, and volume?. In: Anonymous Advances in Computational Social Science Springer, pp. 107–122 (2014) 35. Yu, P.: Multiple Criteria Decision Making: Concepts. Techniques and Extensions. Plenum Press, New York (1985)

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited by Using Intra-class Entropy and a Normalized Scaling Factor Christoph Lohrmann

and Pasi Luukka

Abstract We revisit the fuzzy similarity and entropy (FSAE) filter method for supervised feature selection and propose the C-FSAE filter method that only uses similarities of observations to their own class’s ideal vector (intra-class) and deploys a normalized scaling factor that accounts for the distance between ideal vectors normalized by the standard deviations of each feature in the classes. The same adjustments are implemented for the fuzzy entropy and similarity (FES) and the proposed version is termed as the C-FES. Three simple artificial example cases showcase that the C-FSAE and C-FES result in intuitive feature rankings and consistently rank the features according to their relevance to the classification problem. On most of the seven medical real-world data sets in this study the C-FSAE demonstrates competitive validation accuracies to ReliefF, Fisher score, Laplacian score and Symmetrical uncertainty and performs at least as good and often better than the FSAE. The test set accuracies indicate and confirm the competitive results of the C-FSAE, where it led for two data sets to the highest test set accuracy overall and for more than half of the data sets provided the highest test accuracy for at least one of the classifiers. Keywords Feature selection · Intra-class · Similarity · Fuzzy entropy · Classification

1 Introduction In many machine learning applications, a large quantity of information and features are available or easily obtainable, but they are potentially of low quality [1] and the use of all of them may not always be beneficial [2]. In particular, this is the case if not all features are relevant for the studied phenomenon [3]. However, in C. Lohrmann (B) · P. Luukka School of Business and Management, LUT University, Lappeenranta, Finland e-mail: [email protected] P. Luukka e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_4

61

62

C. Lohrmann and P. Luukka

real-world problems it is often unknown which features are relevant [4, 5] and, thus, there may often be an incentive to include a great number of features [6, 7]. Unfortunately, including irrelevant or redundant features in a data set increases the dimensionality of a data set, resulting in higher computational cost and, in the context of classification, in low classification accuracies and a worse generalization [8, 9]. Hence, features should only be selected for inclusion in a classification model if they are relevant [10]. To address this problem, dimensionality reduction techniques can be applied to the data. These techniques commonly distinguish between socalled feature selection and feature extraction (e.g. PCA) [11, 12]. The focus of this study is on feature selection, which is an approach that selects a subset of relevant features from the existing features in a dataset [13, 14]. This is in contrast to feature extraction which extracts new features from the existing ones and uses a subset of them [11, 12]. In the context of classification, supervised feature selection is commonly applied since supervised feature selection uses the available class label information whereas unsupervised feature selection assumes that the class labels are unknown [15]. Moreover, supervised feature selection methods can be divided into three forms: filter, wrapper, and embedded methods [2, 16]. Filter methods, which are covered in this study, are used as part of the pre-processing of a data set and, hence, do not incorporate a classifier to rank the features/select a feature subset [1, 17]. These methods are often computationally inexpensive, and the feature subsets/feature rankings are classifier-independent [18, 19]. In the context of this work, we focus on the supervised filter methods for feature selection introduced by Lohrmann et al. [20] and Luukka [21], in particular, the fuzzy similarity and entropy (FSAE) feature selection [20] and the fuzzy entropy and similarity (FES) feature selection method [21], which was in previous works referred to as “Feature selection by Luukka (2011)” [20, 22]. The FES filter computes for each feature the similarity between all observations and the ideal vector of each class and, subsequently, deploys an entropy measure such as from De Luca and Termini [23] or Parkash et al. [24] to determine how “informative” each feature is. The features that possess the highest entropy and, thus, are least informative, are suggested for removal. The FSAE was proposed as an improvement to the FES and included a scaling factor to account for the distance between class representatives [20]. The feature- and class-specific entropy values are multiplied with the feature- and classspecific scaling factors to account for the distance between the ideal vectors for each class to adjust their level of informativity [20], which was especially important when classes were overlapping to a large extent. In other words, they allow for each feature to adjust the entropy values according to how far each class representative (ideal vector) is from the other classes’ representatives. The FSAE filter using this feature- and class-specific scaling factor has demonstrated at least comparable but often better classification accuracies than the FES filter method [20]. In this paper, we suggest an adaptation of the FSAE and FES termed the classwise FSAE (C-FSAE) and class-wise FES (C-FES) filters. Both, the FSAE and FES include the calculation of similarities for all observations with all class representatives, so-called ideal-vectors, and input these similarity values to entropy measures. However, this leads to problems when samples in classes overlap considerably for

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

63

a feature since entropy values for low and high similarities are the same, eventually resulting in features being ranked as relevant when they overlap considerably. This problem was remedied by introducing the FSAE with a feature- and classspecific scaling factor which scales the entropy, especially when samples in classes overlap to a large degree, to account for the reduced level of informativity of such features. In this study, we suggest overcoming this issue entirely by using for each class only observations belonging to that class to measure similarities with the class representative, the ideal vector. The objective of this change is to make the entropy of classes only dependent on each classes’ variation. Thus, low class- and featurespecific entropy values indicate that a class ideal vector represents the feature in that class well, whereas high values represent a large degree of variation within that class compared to the ideal vector representing that class. In addition to that, for the proposed C-FSAE also the scaling factor is adjusted to not only measure the difference between the class ideal vectors but normalize that difference by the variation of the classes for that feature. This paper is structured as follows: in the next two sections, the class-wise fuzzy similarity and entropy (C-FSAE) (Sect. 2.1) and the class-wise fuzzy entropy and similarity (C-FES) filter method (Sect. 2.2) are introduced. Section 3.1 discusses three simple artificial test cases and the corresponding feature rankings of the two new filter methods compared to their original versions. Section 3.2 introduces the seven medical real-world data sets used in this study, whereas Sect. 4 presents the training procedure and Sect. 5 provides and analyzes the results. Finally, Sect. 6 presents the conclusion and future work.

2 Methods 2.1 Class-Wise Fuzzy Similarity and Entropy (C-FSAE) Feature Selection Entropy measures are commonly used in supervised feature selection, including for mutual information [25], the fuzzy similarity and entropy (FSAE) feature selection [20] and the fuzzy entropy and similarity (FES) feature selection [21]. Fuzzy entropy can be regarded as a “measure of the degree of fuzziness” [23] and was described by De Luca and Termini [23] as the average information contained in the data for decision-making, such as in the context of the classification of objects. For classification, meaning to use the characteristics of observations to assign them to discrete classes [26], entropy values can indicate the informativity of features (= variables) for conducting the classification task. In particular, small entropy values signal regularities and structure in the data, whereas high entropy values indicate randomness [27]. Thus, fuzzy entropy can be deployed in order to determine the relevance of each of the features in a data set [21].

64

C. Lohrmann and P. Luukka

In this paper that revisists the FSAE filter method, the entropy measure developed by De Luca and Termini [23] is applied. The entropy introduced by De Luca and Termini can be formulated as follows [23]: H (A) = −

[μ A (x j ) log μ A (x j ) + (1 − μ A (x j )) log(1 − μ A (x j ))]

(1)

j∈1,...,n

where μ A (x j ) ∈ [0, 1] is the membership degree of x j to the fuzzy set A, and an entropy value close to zero or one indicates that the set is informative whereas a value close to a half (0.5) characterizes uncertainty [28], which can be considered to indicate randomness. Even though fuzzy entropy has been defined for membership degrees on fuzzy sets it can be applied to any similar functions characterizing uncertainty, e.g. to similarity s(x, y) = 1 − |x − y| where now x, y ∈ [0, 1] and s ∈ [0, 1] similarly as μ A (x j ) ∈ [0, 1] for fuzzy sets. This analog is used in the step by step introduction of the proposed feature selection methods. For the following description of the step-by-step process of the C-FSAE, let’s define a data set as X , with columns being features (= variables) from d = 1 to D and rows being observations from j = 1 to n. The class labels, which indicate for each observation its crisp class membership, is denoted by y. The classes are denoted by i = 1 to N , where N is the number of classes in the data set. All observations in X that belong to class i are denoted X i and the number of observations in that class is denoted by n i . The process underlying the C-FSAE can be depicted in seven steps: Step1: Data normalization and data division: The data is normalized for each feature separately into the compact interval [0, 1] and the data is subsequently divided (e.g. holdout method or k-fold cross-validation). The subsequent steps are conducted only using the training data in order to avoid the subset selection bias that can result in too optimistic generalization errors [29, 30]. Step 2: Calculation of the ideal vectors: The ideal vector for a class i is denoted vi = (vi,1 , vi,2 , ..., vi,D ) and is supposed to represent class i as well as possible [21]. The d-th element of the ideal vector vi , denoted vi,d , embodies the representative for values that the d-th feature in class i takes and can be defined using a generalized mean as ⎛

vi,d

⎞ m1 1 =⎝ xm⎠ n i x∈X d

(2)

i

where xd is the value of variable d in observation x. In the context of this work, we simply set the generalized mean parameter m to 1, which represents using the arithmetic mean for the calculation of the ideal vector value for each feature.

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

65

Step 3: Calculation of similarity values: The similarity S of the d-th feature for observation x j,d of the training set with the d-th feature of the ideal vector of class i using the Łukasiewicz structure [31] can be stated as: S(x j,d , vi,d ) =

p

p

1 − |x j,d − vi,d |

p

(3)

where the parameter p from the generalized Łukasiewicz structure [32] is in this work set to 1. For a feature d, for each class i, the similarity with the ideal vector element vi,d is only computed for observations belonging to that class i (x ∈ X i ). This is different to the FSAE since the FSAE incorporated for each feature the similarity of all observations with all ideal vectors—not just the similarity of each observation to the class it belongs to. Step 4: Calculation of class- and feature-specific entropy values using similarities: Using the entropy measure by De Luca and Termini [23], the entropy value corresponding to each similarity value from the previous step is calculated and, subsequently, summed over all observations in that class.

H S(x j,d , vi,d ) (4) Hi,d = x j,d ∈X i , j∈Ji

where Ji is index set of samples belonging to class i. The result is a summed entropy value that is specific to class i and feature d. This step is conducted for each class i and feature d. Step 5: Calculation of class- and feature-specific scaling factors: As in Lohrmann et al. [20], the C-FSAE uses a class- and feature-specific scaling factor to account for the difference in the ideal vectors of different classes.

S Fi,d =

k=i

1−|vi,d −vk,d | σi,d +σk,d

N −1

l 1l (5)

The main adjustment to the scaling factor of the FSAE [20] is that the difference between two classes’ ideal vector elements is normalized by both the standard deviations in these classes for that feature. This change aims to account for the variation for the feature in the classes that are compared. In simple terms, two classes in terms of their ideal vector elements may appear far from one another in terms of their difference, but may actually not be that different when accounting for the variation in these classes as well. The scaling factor of the FSAE only accounted for the difference between ideal vectors without accounting for the variation of a feature in the different classes.

66

C. Lohrmann and P. Luukka

Step 6: Calculation of feature-specific scaled entropy values: The featurespecific scaled entropy values are calculated as the sum of the product of the featureand class-specific entropy values and the corresponding scaling factors.

Hi,d ∗ S Fi,d (6) S Ed = i∈1,...N

The outcome of this step is a scaled entropy value for each of the D features in the data set. Step 7: Feature ranking: The scaled entropy values obtained from the previous step are ordered in increasing order since smaller scaled entropy values indicate more informative features. A user-specified number of features with the lowest scaled entropy values can be selected and the remaining features can be discarded. Subsequently, the selected feature subset can be used together with a classifier on the test set to obtain an estimate of the generalization error.

2.2 Class-Wise Fuzzy Entropy and Similarity (C-FES) Feature Selection The class-wise fuzzy entropy and similarity (C-FES) is a filter method that is proposed as an adjusted version of the FES, which was in previous works referred to as “Feature selection by Luukka (2011)” [20, 22]. Since the FSAE, and thus the C-FSAE, have their origin in the FES, the initial three steps are equivalent to those explained for the C-FSAE. Step1: Data normalization and data division: This step is equivalent to the C-FSAE. Step 2: Calculation of the ideal vectors: This step is equivalent to the C-FSAE. Step 3: Caculation of similarity values: This step is equivalent to the C-FSAE and only includes for each feature the calculation of similarities for observations with their own class’s ideal vector. It is noteworthy that for features in the FES, the same as for the FSAE, the similarity of each observation with respect to each of the ideal vectors is determined and not just with their own ideal vector. Step 4: Calculation of feature-specific entropy values: In contrast to the FSAE and C-FSAE, the C-FES does not use a scaling factor. Hence, using the entropy measure by De Luca and Termini [23], an entropy value corresponding to each similarity value from the previous step is calculated and, subsequently, summed over all observations.

H S(x j,d , vi,d ) (7) Hd = i∈1,...,N x j,d ∈X i , j∈Ji

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

67

The equation highlights that entropy values are only calculated for an observation with their corresponding class ideal vector and, subsequently, the class-specific sum is summed over all classes. The result is a summed entropy value for each feature d so that after this step a summed entropy value for all D features was computed. Step 5: Feature ranking: As for the C-FSAE, the summed entropy values obtained from the previous step are ordered in increasing order and a user-specified number of features with the lowest scaled entropy can be selected.

3 Data 3.1 Artificial Example Data Sets Three simple artificial example data sets are created to compare the feature rankings of the C-FSAE and C-FES against their original versions FSAE and FES to highlight their differences. The examples, which are illustrated in Fig. 1, each include two features and have three to four classes. The three artificial examples have in common that samples in the first feature are completely or at least to a large extent overlapping for all classes. In the first example, the first feature is irrelevant for the classification into the three classes. For the second and third example, this feature is less relevant than the second feature. For the third example, for each feature the variation within the classes differs. In all three examples the second feature is relevant and can for the first and second example linearly separate all classes and can for the third example basically linearly separate the classes with very few misclassified observations. Thus, for all three examples the second feature should be preferred to the first feature and ranked first. To compare the C-FSAE, C-FES, FSAE and FES on these artificial examples, a 10-fold cross-validation is set up. During each iteration of the cross-validation, all four filter methods are used to rank the features in the training data, subsequently train classifiers on the training data using the single most important feature and,

Example 3

Example 2 1

0.8

0.8

0.8

0.6 0.4

0.6 0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

Feature 2

1

Feature 2

Feature 2

Example 1 1

Class 2

0.4 0.2 0

0

0.2

Feature 1 Class 1

0.6

0.4

0.6

0.8

1

0

Class 3

Class 1

Class 2

Fig. 1 Artificial example data sets for feature selection

0.2

0.4

0.6

0.8

1

Feature 1

Feature 1 Class 3

Class 1

Class 2

Class 3

Class 4

68

C. Lohrmann and P. Luukka

Table 1 Test set accuracies on artificial example data sets Classifier & FS KNN + FSAE

Parameter 10

Example 1

Example 2

Example 3

Ranking

Avg ± Std

Ranking

Avg ± Std

Ranking

Avg ± Std

2, 1

100±0

2, 1

100±0

1, 2

40.85±2.01 99.6±0.37

KNN + C-FSAE

10

2, 1

100±0

2, 1

100±0

2, 1

KNN + FES

10

1, 2

36.1±2.6

1, 2

78.77±2.9

1, 2

40.85±2.01

KNN + C-FES

10

2, 1

100±0

2, 1

100±0

2, 1

99.6±0.37

DT + FSAE

10

2, 1

100±0

2, 1

100±0

1, 2

39.55±2.58

DT + C-FSAE

10

2, 1

100±0

2, 1

100±0

2, 1

99.75±0.34

DT + FES

10

1, 2

34.93±2.44 1, 2

78.13±3.1

1, 2

39.55±2.58

DT + C-FES

10

2, 1

100±0

2, 1

100±0

2, 1

99.75±0.34

Sim + FSAE

1

2, 1

100±0

2, 1

100±0

1, 2

25.3±2.33

Sim + C-FSAE

1

2, 1

100±0

2, 1

100±0

2, 1

97.35±0.84

Sim + FES

1

1, 2

33.5±1.72

1, 2

79.27±2.69 1, 2

25.3±2.33

Sim + C-FES

1

2, 1

100±0

2, 1

100±0

2, 1

97.35±0.84

SVM + FSAE

RBF

2, 1

100±0

2, 1

100±0

1, 2

33.35±5.56

SVM + C-FSAE

RBF

2, 1

100±0

2, 1

100±0

2, 1

97.95±0.72

SVM + FES

RBF

1, 2

32.47±1.78 1, 2

79.13±2.91 1, 2

33.35±5.56

SVM + C-FES

RBF

2, 1

100±0

100±0

97.95±0.72

2, 1

2, 1

finally, determine the test set accuracy of the trained models. The classifiers used for the comparison are the K-nearest neighbor classifier (with 10 neighbors), a decision tree (with minimum leaf size of 10), a similarity classifier (with the p-parameter set to 1) and a support vector machine (with radial basis function as kernel). Results are displayed in Table 1 (The best results are highlighted in bold). The feature removal decision for the first artificial example shows that only the FES ranked the features incorrectly, leading to a significantly worse average test set accuracy using the single highest ranked feature. This result is independent of the classifier used. Using only class-specific similarity values in the C-FES remedies the problem since the variation for each class for the second feature is considerably lower than for the first feature, thus, correctly ranking the second feature first. Since C-FSAE and FSAE account for the difference in the ideal vectors, both filter methods easily rank the first feature with almost identical ideal vector values last. A similar situation is encountered for the second artificial example. FES suggests the removal of the second feature, which is capable to linearly separate the three classes. Even though compared to the first example there is comparably less variation for each of the classes for feature one and more variation for the second feature, CFES still ranks the second feature first. The C-FSAE and the FSAE account for the fact that the ideal vector values are further apart for the second feature, which contributes to reducing the entropy value for the second feature and ranking it first.

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

69

Looking at Fig. 1, it is apparent that using the C-FSAE that incorporates the standard deviation of the feature values for each class to normalize the differences between the ideal vectors emphasizes the importance of the second feature more than simply accounting only for the differences between the ideal vectors as the FSAE does. In the third example, both the FES and the FSAE suggest retaining the first feature, which overlaps for at least two classes and for values in the range from 0.4 to 0.5 even for all classes (see Fig. 1—subplot 3). Only C-FSAE and C-FES rank the second feature first, leading to significantly higher test set accuracies than using the first feature. For this example, using for the calculation of similarities only observations for the corresponding class is crucial in ranking the features correctly. Thus, the third example illustrates that accounting for the difference between ideal vector values may not be sufficient to rank the features correctly using entropy and that the class-wise calculation of similarity values contributes to correctly ranking the features in this example. Overall, C-FSAE and C-FES consistently rank the features according to their relevance, whereas FSAE fails to do so in one of the examples and FES in all example cases.

3.2 Real-World Data Sets The two modified feature selection methods will be applied to seven real-world medical data sets obtained from the UCI Machine Learning Repository and the Kaggle Repository. These data sets are Breast Cancer Wisconsin (Original) [33], Chronic Kidney Disease [34], Dermatology [35], Diabetic Retinopathy Debrecen [36], Heart Disease (Cleveland) [37], Horse Colic [38], and Pima Indians Diabetes [39]. The details about these data sets are displayed in Table 2. With the exception of the Dermatology data set, which has six unique classes, most data sets have binary class variables. The number of observations ranges from 303 to 1151 and the number of features ranges from 8 to 34. The accuracy achieved by assigning all observations simply to the largest class in the data lies in between 31.01% up to 73.72%. For the horse colic data set seven horses (=observations) had multiple lesions (two or three). In this case separate observations were created for each lesion type a horse possessed so that each observation had only a single lesion. The class label for the horse colic data set represents whether a lesion is a “surgical lesion” or not.

70

C. Lohrmann and P. Luukka

Table 2 Summary of real-world medical data sets Name Observations Features Classes Breast Cancer 699 Wisconsin (Original) Chronic 400 Kidney Disease Dermatology 366 Diabetic 1151 Retinopathy Debrecen Heart Disease 303 (Cleveland) Horse Colic 379 Pima Indians 768 Diabetes

Missing values

Majority class (%)

9

2

Yes (16)

65.01

24

2

Yes (244)

73.72

34 19

6 2

Yes (8) No

31.01 53.08

13

2

Yes (6)

53.87

23 8

2 2

No No

63.59 65.10

4 Training Procedure In addition to the FSAE and FES, the results of the C-FSAE and C-FES are benchmarked against five setups of common filter methods for supervised feature selection. These are (1) ReliefF with 10 nearest hits/misses [40, 41], (2) ReliefF with 70 nearest hits/misses and a decay parameter σ = 20 [42], (3) the Fisher score [43], (4) the Laplacian score [44], and (5) Symmetrical uncertainty [45]. These nine feature ranking methods are compared with each other as well as against using no feature selection (= using the entire set of features). The models trained on the feature subsets in this study are a k-nearest neighbor classifier (KNN) [46], a binary decision tree for classification [47], a similarity classifier [32], and a support vector machine (SVM) [48]. The feature ranking and classification was implemented as part of a two-level external cross-validation procedure [49] to ensure that the feature subset selection and the parameter tuning are properly cross-validated and entirely independent of the test set. First, a 10-fold cross-validation split is conducted with a single fold functioning in each iteration as the independent test set. Second, in each iteration the nine folds from the 10-fold split are split into 80% training data and 20% validation data using the holdout method. The training data is deployed for the feature ranking using the filter methods as well as for training all classifiers with all parameter setups (KNN and decision tree—10, 5, 3, 1 neighbors/minimum leaf size; Similarity classifier—6, 4, 2, 1 as p-parameter; SVM - linear and radial basis kernel) and all feature subset sizes. The trained models are then applied to the validation data to select for each filter method and classifier combination the feature subset and

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

71

classifier parameters that result in the highest validation accuracy. Subsequently, the classifiers with the selected parameters and feature subsets are trained on the 9 folds (training and validation data). Finally, the test set accuracy of the trained model is determined for the independent test set. This procedure is completed for each iteration of the 10-fold cross-validation and the test set accuracies and the feature subset sizes for each filter method are averaged. Lastly, for each filter method the similarity of the selected feature subset from each iteration of the 10-fold cross-validation is measured using the “adjusted stability measure” (ASM) [50]. High ASM scores are desirable for a filter method and are achieved for feature subsets that intersect to a large extent and are small in size. The MATLAB code files for the four filter methods C-FSAE, C-FES, FSAE and FES are freely available at the GitHub repository [51].

5 Results The validation results for the “Breast Cancer Wisconsin (Original)” data set are presented in Fig. 2 and for the “Chronic Kidney Disease” data in Fig. 3. The results for “Breast Cancer Wisconsin (Original)” indicate that the C-FSAE and FSAE show very competitive results to other well-known filter methods such as ReliefF, the Fisher score and the Laplacian score. For all classifiers, FES appears to perform worst, whereas C-FES is clearly leading to higher validation accuracies for most feature subset sizes. It is apparent that the “Breast Cancer Wisconsin (Original)” data set contains many irrelevant or redundant features since up to 8 out of 9 features can be removed while still maintaining a similar accuracy than with the complete feature set. The results on the best performing feature subset sizes (on the validation data) and the test accuracies for this data set are presented in Table 3 (the best results are highlighted in bold). It is apparent that the average test set accuracies for all classifiers and feature selection methods are very similar with most average accuracies being between 96 to 97%. The average test accuracies support the impression obtained from the validation accuracies that several features can be removed without deteriorating the generalization performance. The highest mean test accuracy of 97.07% on this data set was achieved using the KNN classifier with the C-FSAE. In addition, CFSAE is for two of the classifiers and overall generating the most stable feature subsets (Appendix: Table 10). The validation accuracies for the “Chronic Kidney Disease” data (Fig. 3) show that the removal of up to about 22 out of 24 features leads for most filter methods and classifiers to no deterioration of the classification accuracy. C-FES is clearly outperforming the FES, especially when only few features remain. C-FSAE and FSAE show similar validation accuracies for the majority of feature subset sizes, which are competitive to the best performing filter methods on this data set. Table 4 shows that with most filter methods test set accuracies of 98 to 100% are possible, with the KNN classifier with C-FSAE and on average 2.7 out of 24 features resulting in the highest test accuracy of 100%. For all classifiers C-FSAE leads to at least as

0.9

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2

3

5

6

Feature Removals

4

7

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

K-nearest neighbor

8

9

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

5

6

Feature Removals

4

7

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Decision Tree

8

9

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

5

6

Feature Removals

4

7

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Similarity Classifier

8

9

Accuracy

Accuracy

Accuracy

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

5

6

Feature Removals

4

7

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Support Vector Machine

Fig. 2 Validation accuracy for the Breast Cancer Wisconsin (Original) data set for all feature subset sizes by feature selection method

Accuracy

1

8

9

72 C. Lohrmann and P. Luukka

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5

15

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

20

Feature Selection Methods

K-nearest neighbor

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

15

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

20

Feature Selection Methods

Decision Tree

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

15

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

20

Feature Selection Methods

Similarity Classifier

Accuracy

Accuracy

Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 3 Validation accuracy for the Chronic Kidney Disease data set for all feature subset sizes by feature selection method

Accuracy

1

5

10

15

Feature Removals

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

20

Feature Selection Methods

Support Vector Machine

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited … 73

74

C. Lohrmann and P. Luukka

Table 3 Test accuracy for the Breast Cancer Wisconsin data set for all classifiers by feature selection method Breast Cancer Wisconsin

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

10

9

9

9

96.93±1.76

KNN + ReliefF (10)

10

2

5.1

9

95.9±3.09

KNN + ReliefF (70, 20)

10

2

5.3

9

95.9±3.09

KNN + FSAE

10

2

4.3

8

96.34±1.73

KNN + C-FSAE

5

2

4.8

8

97.07±1.69

KNN + FES

10

3

7.5

9

96.49±1.41

KNN + C-FES

5

4

7.5

9

96.34±1.43

KNN + Fisher Score

10

3

6.2

9

96.78±1.52

KNN + Laplacian Score

5

2

6.3

9

96.63±1.39

KNN + Symmetrical Uncertainty

10

2

4.6

9

96.49±1.23

DT + No Feature Selection

10

9

9

9

94.87±2 94.15±2.91

DT + ReliefF (10)

10

2

3.1

5

DT + ReliefF (70, 20)

5

2

3

5

94.59±2.9

DT + FSAE

10

2

2.9

8

94.59±3.5

DT + C-FSAE

1

2

3.7

8

95.17±2.94

DT + FES

5

3

5.4

9

95.61±1.55

DT + C-FES

5

4

6.3

9

94.14±2.77

DT + Fisher Score

5

2

3.8

8

93.86±3.13

DT + Laplacian Score

10

2

3.5

8

95.75±2.01

DT + Symmetrical Uncertainty

10

2

3

9

95.61±2.18

Sim + No Feature Selection

6

9

9

9

96.49±1.85 94.29±2.78

Sim + ReliefF (10)

1

3

4.6

7

Sim + ReliefF (70, 20)

1

3

3.9

7

94.29±2.95

Sim + FSAE

6

2

5.8

8

95.47±2.22

Sim + C-FSAE

6

3

6

8

95.03±2.5

Sim + FES

6

6

7.5

9

96.19±2.09

Sim + C-FES

6

5

7.5

9

96.34±1.85

Sim + Fisher Score

6

2

5.4

8

95.17±3.22

Sim + Laplacian Score

6

2

5.5

8

95.46±2.12

Sim + Symmetrical Uncertainty

6

2

5.9

9

95.46±2.12

SVM + No Feature Selection

Linear

9

9

9

96.92±1.47

SVM + ReliefF (10)

Linear

2

4.6

7

96.78±1.81

SVM + ReliefF (70, 20)

Linear

3

4.6

7

96.48±1.59

SVM + FSAE

RBF

2

5.6

9

96.05±1.95

SVM + C-FSAE

RBF

2

6.2

9

96.34±2.2

SVM + FES

Linear

4

7.8

9

96.93±1.28

SVM + C-FES

Linear

4

6.9

9

95.9±2.39

SVM + Fisher Score

Linear

5

5.9

8

96.63±1.21

SVM + Laplacian Score

Linear

2

6.4

8

96.63±1.56

SVM + Symmetrical Uncertainty

RBF

2

6.1

9

96.05±1.95

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

75

Table 4 Test accuracy for the Chronic Kidney Disease data set for all classifiers by feature selection method Chronic Kidney Disease

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

1

24

24

24

98.04±3.16

KNN + ReliefF (10)

10

2

2

2

99.38±1.98

KNN + ReliefF (70, 20)

10

1

1.9

2

98.71±2.72

KNN + FSAE

10

2

2.2

3

99.38±1.98

KNN + C-FSAE

10

2

2.7

4

100±0

KNN + FES

1

3

7.5

12

98.71±2.72

KNN + C-FES

1

2

3.6

9

97.42±3.34

KNN + Fisher Score

10

1

1.3

2

99.33±2.11

KNN + Laplacian Score

10

2

2

2

99.38±1.98

KNN + Symmetrical Uncertainty

10

2

3.2

4

99.38±1.98

DT + No Feature Selection

10

24

24

24

99.33±2.11 99.33±2.11

DT + ReliefF (10)

10

1

1.8

2

DT + ReliefF (70, 20)

10

1

1.9

2

99.33±2.11

DT + FSAE

10

1

2.1

3

99.33±2.11

DT + C-FSAE

10

1

2.6

4

99.33±2.11

DT + FES

10

3

7.1

12

98.08±3.09

DT + C-FES

10

2

2.4

4

99.33±2.11

DT + Fisher Score

10

1

1.1

2

99.33±2.11

DT + Laplacian Score

10

1

1.9

2

99.33±2.11

DT + Symmetrical Uncertainty

10

1

3.1

4

99.33±2.11

Sim + No Feature Selection

6

24

24

24

100±0 97.46±3.28

Sim + ReliefF (10)

6

1

3.7

7

Sim + ReliefF (70, 20)

6

2

3.5

7

97.46±3.28

Sim + FSAE

6

1

3.8

9

97.46±3.28

Sim + C-FSAE

6

1

3.5

8

98.08±3.09

Sim + FES

6

3

6.8

11

98.08±3.09

Sim + C-FES

6

2

5.8

11

96.79±3.39

Sim + Fisher Score

6

1

3.1

7

96.83±6.1

Sim + Laplacian Score

6

1

3.4

9

97.46±3.28

Sim + Symmetrical Uncertainty

6

1

3.6

9

98.71±2.72

SVM + No Feature Selection

Linear

24

24

24

99.38±1.98

SVM + ReliefF (10)

Linear

1

2.5

4

98.08±3.09

SVM + ReliefF (70, 20)

Linear

1

2.4

4

98.08±3.09

SVM + FSAE

Linear

1

2.8

5

98.08±3.09

SVM + C-FSAE

Linear

1

3.3

8

98.71±2.72

SVM + FES

Linear

3

6.8

11

98.08±3.09

SVM + C-FES

Linear

2

3.8

9

97.42±3.34

SVM + Fisher Score

RBF

1

2.3

7

97.42±4.45

SVM + Laplacian Score

Linear

1

2.3

4

98.08±3.09

SVM + Symmetrical Uncertainty

Linear

1

3.6

9

98.71±2.72

76

C. Lohrmann and P. Luukka

accurate or more accurate test set results than the FSAE. In addition, it is noteworthy that C-FES for all classifiers results in considerably smaller average feature subsets than the FES, in some cases also resulting in higher average test set accuracies. With the sole exception of the FES, all feature selection methods lead to highly stable feature subsets (Appendix: Table 11). The validation set accuracies for the “Dermatology” data set are displayed in (Fig. 4) and for “Diabetic Retinopathy Debrecen” in (Fig. 5). For the “Dermatology”data sets the validation accuracies indicate that only up to about a half of the features can be removed before the classification accuracies for all filter methods and classifiers start to deteriorate. It is apparent that for all classifiers the two ReliefF setups allow most feature removals before the validation accuracies starts to deteriorate. However, for 21 to 26 feature removals, most of the remaining filter methods result in better validation accuracies. C-FSAE clearly outperforms the original FSAE for most feature subset sizes. The results in Table 5 illustrate that most filter methods can accomplish average test accuracies that are comparable or slightly higher to using no feature selection. In most cases, less than an average of 10 features were removed. The entropy-based filter methods C-FSAE, C-FES, FSAE and FES lead together with the SVM to the highest test set accuracies on this data set. C-FSAE is generating the most stable feature subsets on this data set (Appendix: Table 12). The validation results of the “Diabetic Retinopathy Debrecen” in Fig. 5 highlight that the C-FSAE outperforms the FSAE, C-FES and FES for most feature subset sizes, but leads to an earlier decline in the validation accuracies than the remaining filter methods. The validation accuracies accomplished using the similarity classifier are for all filter methods clearly worse than those achievable using other classifiers. Table 6 depicts that for the KNN classifier and the decision tree, the Fisher score is clearly leading to the highest classification results, while it shows the least accurate results for the SVM. For all filter methods the test set accuracies using the similarity classifier are several percentage points worse than with the other classifiers. The performances of the C-FSAE and FSAE as well as the C-FES and FES are similar for all classifiers with exception of the similarity classifier. All filter methods appear moderately stable on this data set, with the FSAE overall generating the most similar feature subsets (Appendix: Table 13). The validation set accuracies for the “Heart Disease (Cleveland)” data set are displayed in Fig. 6 and for the “Horse Colic” data set in Fig. 7. For the “Heart Disease (Cleveland)” data set the validation accuracies for the C-FSAE show very competitive results to all filter methods, resulting for all classifiers with exception of the KNN classifier in consistently higher accuracies than the FSAE. For the KNN classifier, the FSAE appears for the initial feature removals slightly better, whereas the C-FSAE performs better when more features are removed. Besides that, the C-FES appears to perform slightly but consistently better than the FES. For all classifiers the validation accuracy remains at the same level or even improves slightly when about 8 up to 10 out of the 13 features are removed. Table 7 illustrates the C-FSAE often results in similar results than the FSAE, but with on average less features. An exception is the decision tree, where the C-FSAE additionally leads to the highest test accuracy and is several percentage points more accurate than the FSAE. The highest average

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5

10

20

Feature Removals

15

25

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

K-nearest neighbor

30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

10

20

Feature Removals

15

25

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Decision Tree

30

Accuracy

Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

10

20

Feature Removals

15

25

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Similarity Classifier

30

Fig. 4 Validation accuracy for the Dermatology data set for all feature subset sizes by feature selection method

Accuracy

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

10

15

20

Feature Removals

25

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Support Vector Machine

30

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited … 77

Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5

10

Feature Removals

15

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

K-nearest neighbor

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

5 10

Feature Removals

15

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Decision Tree

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

5

10

Feature Removals

15

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Similarity Classifier

Accuracy

Accuracy

Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Support Vector Machine

5

10

Feature Removals

15

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Fig. 5 Validation accuracy for the Diabetic Retinopathy Debrecen data set for all feature subset sizes by feature selection method

Accuracy

0.8

78 C. Lohrmann and P. Luukka

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

79

Table 5 Test accuracy for the Dermatology data set for all classifiers by feature selection method Dermatology

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

10

34

34

34

96.07±3.58

KNN + ReliefF (10)

10

18

25.8

34

96.64±3.45

KNN + ReliefF (70, 20)

10

19

27.4

34

96.36±3.26

KNN + FSAE

3

23

25.9

34

96.08±3.82

KNN + C-FSAE

1

24

28.2

31

96.64±2.6

KNN + FES

1

23

26.7

34

94.97±4.91

KNN + C-FES

3

24

27.5

34

96.91±3.88

KNN + Fisher Score

10

23

25.5

33

97.75±3.21

KNN + Laplacian Score

5

25

28.6

34

96.63±4.56

KNN + Symmetrical Uncertainty

3

23

27.4

32

94.41±3.72

DT + No Feature Selection

1

34

34

34

94.41±3.23 94.14±2.45

DT + ReliefF (10)

10

16

21.1

26

DT + ReliefF (70, 20)

10

17

20.6

25

94.14±2.45

DT + FSAE

3

22

22.9

23

94.41±3.22

DT + C-FSAE

3

24

26

29

93.85±2.89

DT + FES

3

23

23.4

24

94.41±3.22

DT + C-FES

3

24

24

24

94.68±2.47

DT + Fisher Score

3

23

24.4

25

94.41±2.64

DT + Laplacian Score

3

23

24.8

25

94.69±2.78

DT + Symmetrical Uncertainty

1

20

30.7

34

92.72±6.35

Sim + No Feature Selection

1

34

34

34

94.94±4.59 95.22±4.45

Sim + ReliefF (10)

1

18

22.9

34

Sim + ReliefF (70, 20)

1

18

22.9

32

95.21±5.04

Sim + FSAE

2

24

29.9

34

94.12±3.84

Sim + C-FSAE

1

25

28.8

33

94.68±4.07

Sim + FES

1

25

30.8

34

93.56±4.21

Sim + C-FES

1

24

29

34

95.25±3.74

Sim + Fisher Score

1

24

26

28

94.65±4.93

Sim + Laplacian Score

1

25

26.6

31

94.38±4.04

Sim + Symmetrical Uncertainty

1

24

30.4

34

92.44±5.89

SVM + No Feature Selection

Linear

34

34

34

98.32±1.95

SVM + ReliefF (10)

Linear

19

22.6

26

97.48±2.46

SVM + ReliefF (70, 20)

Linear

20

22.8

32

97.48±2.46

SVM + FSAE

Linear

22

24

34

98.33±1.95

SVM + C-FSAE

Linear

25

28.6

32

98.05±2.29

SVM + FES

Linear

21

25.9

34

98.88±1.44

SVM + C-FES

Linear

24

25

34

98.33±2.69

SVM + Fisher Score

Linear

22

26.8

34

97.77±2.55

SVM + Laplacian Score

Linear

24

27.8

31

97.77±2.87

SVM + Symmetrical Uncertainty

Linear

20

30.5

34

97.48±2.76

80

C. Lohrmann and P. Luukka

Table 6 Test accuracy for the Diabetic Retinopathy Debrecen data set for all classifiers by feature selection method Retinopathy Debrecen

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

10

19

19

19

62.47±3.58

KNN + ReliefF (10)

10

2

6.5

12

64.99±4.5

KNN + ReliefF (70, 20)

10

2

7

14

64.9±4.27

KNN + FSAE

10

12

14.7

19

64.29±4.18

KNN + C-FSAE

10

10

13.9

16

65.94±4.01

KNN + FES

5

12

15.3

19

63.86±3.26

KNN + C-FES

5

12

15.5

19

63.77±3.38

KNN + Fisher Score

3

2

9.2

15

69.07±3.47

KNN + Laplacian Score

5

3

6.7

17

65.68±4.87

KNN + Symmetrical Uncertainty

5

6

9.7

15

67.25±4.85

DT + No Feature Selection

10

19

19

19

63.33±4.98 65.51±5.11

DT + ReliefF (10)

3

2

5.8

11

DT + ReliefF (70, 20)

10

4

7.5

14

65.42±5.88

DT + FSAE

10

11

14.5

18

62.11±6.13

DT + C-FSAE

1

9

12.6

15

61.77±5.23

DT + FES

10

11

14.4

18

60.55±3.86

DT + C-FES

10

11

14.1

18

61.68±5.19

DT + Fisher Score

10

3

4.9

11

69.33±4.69

DT + Laplacian Score

1

3

6.1

14

68.11±7

DT + Symmetrical Uncertainty

3

2

8.7

14

63.85±5.36

Sim + No Feature Selection

1

19

19

19

56.83±6.3 59.86±5.89

Sim + ReliefF (10)

6

1

1

1

Sim + ReliefF (70, 20)

6

1

1

1

59.86±5.89

Sim + FSAE

2

9

13

14

59.26±6.77

Sim + C-FSAE

1

1

8.1

14

54.91±4.52

Sim + FES

2

9

13.8

15

55.69±4.85

Sim + C-FES

2

2

11

15

56.22±4.28

Sim + Fisher Score

4

1

3.2

12

59.34±6.22

Sim + Laplacian Score

1

1

1.9

2

57.95±3.51

Sim + Symmetrical Uncertainty

1

5

6.1

16

60.21±6.29

SVM + No Feature Selection

Linear

19

19

19

68.63±4.56

SVM + ReliefF (10)

Linear

7

14.4

19

69.5±5.05

SVM + ReliefF (70, 20)

Linear

8

14.3

19

69.15±4.53

SVM + FSAE

RBF

15

16.8

18

68.81±4.18

SVM + C-FSAE

RBF

13

16.8

19

69.67±4.71

SVM + FES

RBF

14

16.1

18

69.68±3.99

SVM + C-FES

RBF

14

16.3

18

69.5±3.99

SVM + Fisher Score

RBF

10

16.2

19

67.41±6.47

SVM + Laplacian Score

RBF

9

13.4

19

69.94±4.13

SVM + Symmetrical Uncertainty

RBF

14

16.4

18

70.2±3.07

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2

4

8

Feature Removals

6

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

K-nearest neighbor

12

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

4

8

Feature Removals

6

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Decision Tree

12

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

4

8

Feature Removals

6

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Similarity Classifier

12

Accuracy

Accuracy

Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 6 Validation accuracy for the Heart Disease (Cleveland) data set for all feature subset sizes by feature selection method

Accuracy

1

2

4

6

8

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Support Vector Machine

12

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited … 81

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5

15

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

K-nearest neighbor

20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

15

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Decision Tree

20

Accuracy

Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

15

Feature Removals

10

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Similarity Classifier

20

Fig. 7 Validation accuracy for the Horse Colic data set for all feature subset sizes by feature selection method

Accuracy

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5

10

15

Feature Removals

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Support Vector Machine

20

82 C. Lohrmann and P. Luukka

Accuracy

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

83

Table 7 Test accuracy for the Heart Disease data set for all classifiers by feature selection method Heart Disease (Cleveland)

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

10

13

13

13

80.85±6.46

KNN + ReliefF (10)

5

4

6.4

9

79.2±7.27

KNN + ReliefF (70, 20)

5

4

5.3

8

79.47±6.55

KNN + FSAE

5

6

9

12

81.15±3.97

KNN + C-FSAE

5

2

5.7

10

80.85±7.95

KNN + FES

10

8

10.1

13

81.18±8.13

KNN + C-FES

3

8

10

13

81.53±8.27

KNN + Fisher Score

10

3

6.4

10

82.22±8.27

KNN + Laplacian Score

10

2

5.5

8

79.84±5.32

KNN + Symmetrical Uncertainty

10

4

6.5

11

77.48±6.78

DT + No Feature Selection

10

13

13

13

78.54±11.26 81.53±9.62

DT + ReliefF (10)

10

4

5.5

7

DT + ReliefF (70, 20)

10

4

5.9

9

81.2±9.43

DT + FSAE

10

2

7.3

11

74.06±8.44

DT + C-FSAE

10

4

6.2

8

81.91±7.46

DT + FES

10

8

9.6

11

76.49±10.33

DT + C-FES

10

3

9.1

11

77.83±8.11

DT + Fisher Score

10

2

4.4

7

80.12±10.04

DT + Laplacian Score

10

5

5.9

8

79.23±9.15

DT + Symmetrical Uncertainty

10

3

5.1

9

80.87±8.22

Sim + No Feature Selection

1

13

13

13

82.53±3.97 81.84±5.42

Sim + ReliefF (10)

1

5

8.3

12

Sim + ReliefF (70, 20)

1

4

7.7

12

83.85±4.62

Sim + FSAE

1

6

8.8

10

82.86±5.93

Sim + C-FSAE

1

6

8.9

12

82.52±4.57

Sim + FES

1

5

9.5

13

79.48±8.72

Sim + C-FES

1

5

9.5

13

79.48±8.72

Sim + Fisher Score

1

4

5.7

7

84.22±6.02

Sim + Laplacian Score

1

5

7.8

12

82.2±5.58

Sim + Symmetrical Uncertainty

1

5

6

8

84.89±4.76

SVM + No Feature Selection

Linear

13

13

13

82.87±6.1

SVM + ReliefF (10)

Linear

5

6.1

8

83.2±5.86

SVM + ReliefF (70, 20)

Linear

4

5.9

8

83.18±4.12

SVM + FSAE

Linear

8

10.2

13

83.22±6.94

SVM + C-FSAE

Linear

5

7.2

12

83.54±6.32

SVM + FES

Linear

8

10.6

13

82.82±9.94

SVM + C-FES

Linear

9

10.5

13

82.47±9.46

SVM + Fisher Score

Linear

4

6.6

11

84.55±7.07

SVM + Laplacian Score

Linear

5

6.4

8

82.86±6.15

SVM + Symmetrical Uncertainty

Linear

4

7.1

11

82.52±5.61

84

C. Lohrmann and P. Luukka

test set accuracy is accomplished using the similarity classifier with symmetrical uncertainty. However, it is noteworthy that symmetrical uncertainty together with any other classifier leads to considerably worse test set accuracies, which are lower than those of the C-FSAE. All filter methods are moderately stable on this data set with the C-FES overall generating slightly more stable feature subsets than all other filter methods (Appendix: Table 14). The validation accuracies for the “Horse Colic” data set (Fig. 7) show for all classifiers competitive validation accuracies for the C-FSAE, which belongs to the approaches with the highest validation accuracies for all feature subset sizes. C-FSAE appears to lead to more accurate validation accuracies than the FSAE, which is most apparent for the decision tree classifier. It is noteworthy that symmetrical uncertainty performs in comparison poorly on this data set and suggests less accurate feature subsets after 13 to 14 feature removals. C-FES and FES are leading to comparable but clearly lower validation accuracies than the remaining approaches after about 10 features are removed. In terms of the test set accuracies (Table 8), ReliefF (70, 20) together with the SVM lead to the highest average test accuracy of 88.11%. However, C-FSAE leads to the highest classification accuracy using the decision tree, which is the second highest accuracy on this data set overall, as well as the highest test set accuracy for the KNN classifier. In addition, the C-FSAE leads to higher average test set accuracies than the FSAE for all classifiers and the C-FES to at least as accurate or more accurate results than the FES. With exception of the similarity classifier, the best test set accuracies per classifier are achieved with about half or less of all features. All filter methods are only moderately stable on this data set with the C-FES and FES being clearly the two most stable approaches (Appendix: Table 15) For the “Pima Indians Diabetes” data set (Fig. 8) the validation accuracies for C-FSAE for more than 5 out of 8 feature removals start to deteriorate and are below those of the best competing filter methods: ReliefF, Laplacian score, Symmetrical uncertainty and the Fisher score. Notwithstanding, the accuracies achieved by the CFSAE are clearly higher for most feature subset sizes than those of the FSAE, C-FES and FES, which start to deteriorate even faster and stronger. The test set accuracies (Table 9) indicate that the C-FSAE, FSAE, C-FES and FES still lead to competitive results compared to all other filter methods, with the best average result of 77.75% on this data set coming from the C-FES with the SVM classifier. The C-FES is for all classifiers at least as accurate as the FES. The C-FSAE leads for most classifiers to comparable results than the FSAE but is on average more than 1% point more accurate for the decision tree. In terms of stability, all filter methods appear to show only low to moderate stabilities with the Fisher score being for most classifiers and overall the most stable method (Appendix: Table 16). Overall, the validation accuracies indicate that the C-FSAE is for the majority of the data sets leading to higher accuracies than the FSAE and for the remaining data sets to at least similar accuracies. Comparing it to the C-FES and FES, it outperforms them for all data sets for the vast majority of feature subset sizes. The results of the C-FSAE show competitive results to the best performing filter methods on most data sets with exception for the Diabetic Retinopathy Debrecen and the Pima Indians data sets where it underperforms these filter methods after roughly one third and half of

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

85

Table 8 Test accuracy for the Horse Colic data set for all classifiers by feature selection method Horse Colic

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

10

23

23

23

86.29±4.59

KNN + ReliefF (10)

1

3

8.7

21

86.54±4.74

KNN + ReliefF (70, 20)

1

4

7

12

86.27±4.46

KNN + FSAE

3

8

11.5

22

85.74±4.02

KNN + C-FSAE

5

3

9.7

21

86.54±3.64

KNN + FES

10

7

17

23

82.85±6.93

KNN + C-FES

5

10

17.7

23

84.96±4.82

KNN + Fisher Score

3

2

8.8

22

86.54±5.49

KNN + Laplacian Score

5

2

5.2

15

86±5.17

KNN + Symmetrical Uncertainty

3

5

11.8

21

83.9±7.09

DT + No Feature Selection

5

23

23

23

86.25±6.65 86.81±4.3

DT + ReliefF (10)

5

3

8.8

21

DT + ReliefF (70, 20)

5

3

7.6

16

86.54±3.81

DT + FSAE

5

6

12.3

23

85.48±3.63

DT + C-FSAE

5

3

11.7

21

87.59±4.85

DT + FES

5

7

15.6

19

84.68±6.71

DT + C-FES

5

6

16

20

84.68±6.71

DT + Fisher Score

5

5

12

20

86.01±3.96

DT + Laplacian Score

5

2

9.7

19

86.27±4.67

DT + Symmetrical Uncertainty

5

4

11.3

19

86.02±7.75

Sim + No Feature Selection

2

23

23

23

81.53±3.29 83.63±4.11

Sim + ReliefF (10)

1

1

9.1

23

Sim + ReliefF (70, 20)

1

1

8.7

23

83.11±4.36

Sim + FSAE

1

8

13.2

23

83.11±4.86

Sim + C-FSAE

1

5

8.4

17

83.89±3.9

Sim + FES

1

14

17.7

22

83.63±4.84

Sim + C-FES

2

15

18.3

22

83.91±4.2

Sim + Fisher Score

1

4

9.1

18

83.9±3.2

Sim + Laplacian Score

2

1

9.3

19

82.58±4.02

Sim + Symmetrical Uncertainty

2

10

13.5

21

83.63±3.5

SVM + No Feature Selection

Linear

23

23

23

87.07±4.38

SVM + ReliefF (10)

RBF

3

8.9

21

88.11±6.04

SVM + ReliefF (70, 20)

RBF

3

10.9

22

87.06±4.78

SVM + FSAE

Linear

8

13.6

23

84.69±5.41

SVM + C-FSAE

Linear

6

13.9

22

85.75±5.14

SVM + FES

RBF

14

17.9

22

87.07±4.2

SVM + C-FES

RBF

14

19.7

22

87.33±4.08

SVM + Fisher Score

Linear

4

9.9

19

86.54±5.33

SVM + Laplacian Score

RBF

2

13.2

22

86.28±4.93

SVM + Symmetrical Uncertainty

Linear

11

15.1

22

86.81±3.72

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2

3

4

5

Feature Removals

6

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

K-nearest neighbor

7

8

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

4

5

Feature Removals

6

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Decision Tree

7

8

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

4

5

Feature Removals

6

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Similarity Classifier

7

8

Accuracy

Accuracy

Accuracy

Fig. 8 Validation accuracy for the Pima Indians Diabetes data set for all feature subset sizes by feature selection method

Accuracy

1

0 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2

3

4

5

Feature Removals

6

ReliefF (10) ReliefF (70,20) FSAE C-FSAE FES C-FES Fisher Score Laplacian Score Symmetrical Uncertainty

Feature Selection Methods

Support Vector Machine

7

8

86 C. Lohrmann and P. Luukka

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

87

Table 9 Test accuracy for the Pima Indians data set for all classifiers by feature selection method Pima Indians Diabetes

Parameter

Min subset

Avg subset

Max subset

Accuracy (Avg ± Std)

KNN + No Feature Selection

10

8

8

8

73.58±5.53

KNN + ReliefF (10)

3

2

5.9

8

73.58±6.8

KNN + ReliefF (70, 20)

10

3

5.4

8

73.32±6.75

KNN + FSAE

10

5

6.3

8

73.45±5.47

KNN + C-FSAE

10

3

5.8

8

72.93±5.08

KNN + FES

3

5

6.7

8

73.19±5.64

KNN + C-FES

10

5

6.6

8

73.71±5.14

KNN + Fisher Score

10

3

4.3

6

74.75±4.49

KNN + Laplacian Score

10

3

4.6

8

74.75±5.89

KNN + Symmetrical Uncertainty

10

4

5.4

8

72.28±5.23

** DT + No Feature Selection

10

8

8

8

72.66±5.16 73.96±4.47

DT + ReliefF (10)

10

1

5.8

8

DT + ReliefF (70, 20)

10

1

5.2

8

73.31±5.75

DT + FSAE

10

3

5.8

8

72.02±5.08

DT + C-FSAE

10

4

6.2

8

73.32±5.55

DT + FES

10

3

6.4

8

70.59±6.55

DT + C-FES

10

3

5.9

8

71.63±5.06

DT + Fisher Score

10

1

4.5

8

73.44±4.48

DT + Laplacian Score

10

1

5.6

8

73.96±3.59

DT + Symmetrical Uncertainty

10

1

4.5

8

75.01±5.26

Sim + No Feature Selection

6

8

8

8

74.62±6.64 73.71±6.12

Sim + ReliefF (10)

6

1

4.6

8

Sim + ReliefF (70, 20)

6

1

3.7

8

73.84±5.86

Sim + FSAE

1

5

6.3

8

74.23±6.6

Sim + C-FSAE

1

2

5.6

8

74.09±5.04

Sim + FES

1

5

6.6

8

74.88±5.53

Sim + C-FES

1

5

6.4

8

74.88±5.53

Sim + Fisher Score

1

1

4.3

7

73.84±6.22

Sim + Laplacian Score

1

1

5

8

73.57±4.57

Sim + Symmetrical Uncertainty

1

1

4.6

7

73.97±5.64

SVM + No Feature Selection

Linear

8

8

8

77.36±6.08

SVM + ReliefF (10)

RBF

1

5.9

8

76.71±5.55

SVM + ReliefF (70, 20)

RBF

3

6.5

8

77.36±5.3

SVM + FSAE

RBF

5

6

8

77.23±5.73

SVM + C-FSAE

RBF

3

6.3

8

77.36±4.82

SVM + FES

RBF

5

6.3

8

77.36±5.89

SVM + C-FES

RBF

5

6.4

8

77.75±5.8

SVM + Fisher Score

Linear

2

5.6

8

77.36±5.93

SVM + Laplacian Score

RBF

1

5.3

8

77.1±6.5

SVM + Symmetrical Uncertainty

RBF

3

5.9

8

76.58±4.92

88

C. Lohrmann and P. Luukka

the feature removals, respectively. In terms of the test set results, the C-FSAE returns for two data sets the highest test set accuracy overall and for four data set it results in the highest test set accuracy for at least one of the classifiers. It shows competitive test set accuracies for all data sets with exception of the Pima Indians data set, for which the results are mixed. The test set accuracies for the C-FES are more competitive to the C-FSAE then the validation accuracies were, with the C-FES being the overall best filter method on the Pima Indians data set and also for three data sets provided for at least one of the classifiers the highest test accuracy.

6 Conclusion In this study, we revisited the FSAE filter method for supervised feature selection and suggested the class-wise fuzzy similarity and entropy (C-FSAE) filter, which uses (1) the entropy of similarities of observations to their own class ideal vector and (2) a scaling factor that accounts for the variation of a feature within each of the classes. The idea of using only the entropy of similarities of observations to their own class ideal vector was also applied to the FES method, which was hence termed C-FES. On three simple artificial example cases, the C-FSAE and C-FES demonstrated the ability to rank the features according to their relevance to the classification task whereas the FES failed to do so for all examples and the FSAE for one of the examples, respectively. In addition, all four of these filter methods were tested on seven medical real-world data sets and benchmarked against five setups of popular filter methods: the ReliefF algorithm with 10 nearest hits/misses as well as with 70 nearest hits/misses and a decay factor of 20, the Fisher score, the Laplacian score, and Symmetrical uncertainty. The validation accuracies for each feature subset size provided an understanding how the removal of features gradually effects the classification accuracy. The C-FSAE showed very competitive accuracies compared to the best performing filter methods for all feature subset sizes for most data sets. Moreover, the results demonstrated that the C-FSAE performs for most feature subset sizes at least as good as the FSAE and often even better. The C-FES displayed a similar behavior compared to the FES but, in contrast, did not show competitive results when a moderate to large share of the features was removed. This indicates its ability to rank features that are irrelevant well but also highlights its difficulty to rank relevant features according to their importance. The test set accuracies indicate and confirm the competitive results of the C-FSAE, where it led for two data sets to the highest test set accuracy overall and for more than half of the data sets provided the highest test accuracy for at least one of the classifiers. For future research, the C-FSAE filter can be extended to a multivariate approach that accounts for local neighborhoods to lead to a potentially more accurate feature ranking. Acknowledgements The authors would like to thank for the support of the Finnish Strategic Research Council, grant number 313396/MFG40 Manufacturing 4.0.

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

89

Appendix See Tables 10, 11, 12, 13, 14, 15 and 16 for the results of the adjusted stability measure for all real-world data sets.

Table 10 Adjusted stability measure for the Breast Cancer Wisconsin (Original) data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

Laplacian

SU

Average

KNN

0.284

0.293

0.484

0.486

0.121

0.128

0.323

0.225

0.331

0.297

DT

0.560

0.567

0.567

0.525

0.402

0.202

0.507

0.553

0.570

0.495

Sim

0.480

0.528

0.533

0.553

0.230

0.225

0.521

0.511

0.165

0.416

SVM

0.474

0.504

0.388

0.435

0.077

0.180

0.551

0.595

0.121

0.369

Overall

0.435

0.460

0.469

0.485

0.199

0.196

0.451

0.441

0.283

0.394

SU

Average

Table 11 Adjusted stability measure for the Chronic Kidney Disease data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

Laplacian

KNN

0.917

0.917

0.738

0.802

0.545

0.778

0.936

0.917

0.844

0.821

DT

0.895

0.917

0.816

0.835

0.553

0.871

0.950

0.917

0.844

0.844

Sim

0.803

0.813

0.759

0.801

0.567

0.663

0.811

0.792

0.803

0.757

SVM

0.872

0.879

0.802

0.785

0.567

0.763

0.857

0.886

0.803

0.802

Overall

0.868

0.878

0.788

0.811

0.568

0.765

0.883

0.876

0.826

0.806

Laplacian

SU

Average

Table 12 Adjusted Stability Measure for the Dermatology data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

KNN

0.477

0.520

0.423

0.781

0.424

0.329

0.685

0.610

0.737

0.554

DT

0.544

0.554

0.671

0.718

0.680

0.706

0.693

0.704

0.224

0.611

Sim

0.439

0.572

0.641

0.776

0.522

0.597

0.742

0.748

0.406

0.605

SVM

0.570

0.591

0.536

0.783

0.417

0.565

0.573

0.751

0.156

0.549

Overall

0.514

0.559

0.552

0.764

0.500

0.541

0.673

0.702

0.365

0.580

90

C. Lohrmann and P. Luukka

Table 13 Adjusted stability measure for the Diabetic Retinopathy Debrecen data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

Laplacian

SU

Average

KNN

0.473

0.526

0.537

0.665

0.436

0.443

0.463

0.564

0.485

0.510

DT

0.5262

0.474

0.684

0.596

0.675

0.668

0.675

0.596

0.496

0.599

Sim

0.947

0.947

0.644

0.485

0.682

0.517

0.735

0.895

0.642

0.722

SVM

0.355

0.467

0.802

0.254

0.791

0.802

0.312

0.346

0.768

0.544

Overall

0.477

0.487

0.655

0.457

0.636

0.589

0.438

0.507

0.496

0.594

SU

Average

Table 14 Adjusted stability measure for the Heart Disease (Cleveland) data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

Laplacian

KNN

0.485

0.516

0.590

0.491

0.523

0.533

0.475

0.491

0.457

0.507

DT

0.495

0.489

0.481

0.495

0.668

0.631

0.568

0.483

0.458

0.530

Sim

0.524

0.502

0.606

0.583

0.359

0.359

0.503

0.517

0.489

0.494

SVM

0.494

0.490

0.414

0.497

0.430

0.590

0.469

0.456

0.454

0.477

Overall

0.486

0.492

0.518

0.495

0.496

0.526

0.497

0.477

0.469

0.502

Laplacian

SU

Average

Table 15 Adjusted stability measure for the Horse Colic data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

KNN

0.542

0.605

0.484

0.495

0.472

0.517

0.530

0.673

0.458

0.531

DT

0.517

0.562

0.381

0.463

0.542

0.533

0.478

0.489

0.480

0.494

Sim

0.456

0.468

0.389

0.534

0.663

0.711

0.523

0.502

0.478

0.525

SVM

0.536

0.454

0.384

0.483

0.681

0.784

0.487

0.471

0.530

0.534

Overall

0.520

0.521

0.419

0.486

0.591

0.636

0.507

0.509

0.482

0.521

Laplacian

SU

Average

Table 16 Adjusted stability measure for the Pima Indians data set Classifier

ReliefF ReliefF FSAE (10) (70, 20)

C-FSAE

FES

C-FES

Fisher

KNN

0.297

0.182

0.267

0.375

0.219

0.317

0.447

0.386

0.352

0.316

DT

0.264

0.308

0.333

0.379

0.189

0.258

0.381

0.285

0.369

0.307

Sim

0.236

0.411

0.492

0.236

0.517

0.569

0.458

0.333

0.472

0.414

SVM

0.343

0.358

0.394

0.392

0.414

0.422

0.442

0.269

0.259

0.366

Overall

0.277

0.298

0.377

0.350

0.337

0.392

0.448

0.342

0.368

0.351

Fuzzy Similarity and Entropy (FSAE) Feature Selection Revisited …

91

References 1. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997) 2. Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40, 16–28 (2014) 3. Luukka, P.: Similarity classifier using similarity measure derived from Yu’s norms in classification of medical data sets. Comput. Biol. Med. 37(8), 1133–1140 (2007) 4. Almuallim, H., Dietterich, T.: Learning Boolean concepts in the presence of many irrelevant features. Artif. Intell. 69(1–2), 279–305 (1994) 5. Piramuthu, S.: Evaluating feature selection methods for learning in data mining applications. Eur. J. Oper. Res. 156(2), 483–494 (2004) 6. Almuallim, H., Dietterich, T.G.: Learning with many irrelevant features. Proc. Ninth Nat. Conf. Artif. Intell. 91, 547–552 (1991) 7. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997) 8. Dessì, N., Pes, B.: Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst. Appl. 42(10), 4632–4642 (2015) 9. Caruana, R., Freitag, D.: Greedy attribute selection. Int. Conf. Mach. Learn. 48, 28–36 (1994) 10. Luukka, P., Leppälampi, T.: Similarity classifier with generalized mean applied to medical data. Comput. Biol. Med. 36, 1026–1040 (2006) 11. Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining perspective (2001) 12. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 94 (2017) 13. Kittler, J., Mardia, K.V.: Statistical pattern recognition in image analysis. J. Appl. Stat. 21(1, 2), 61–75 (1994) 14. Bins, J., Draper, B.A.: Feature selection from huge feature sets. In: Proceedings of the IEEE International Conference on Computer Vision (2001) 15. Liang, J., Yang, S., Winstanley, A.: Invariant optimal feature selection: a distance discriminant and feature ranking based solution. Pattern Recogn. 41, 1429–1439 (2008) 16. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans.. Knowl. Data Eng. 17(4), 491–502 (2005) 17. John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. Mach. Learn. Proc. 1994, 121–129 (1994) 18. Liu, H., Setiono, R.: A probabilistic approach to feature selection—a filter solution (1996) 19. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 20. Lohrmann, C., Luukka, P., Jablonska-Sabuka, M., Kauranne, T.: A combination of fuzzy similarity measures and fuzzy entropy measures for supervised feature selection. Expert Syst. Appl. 110 (2018) 21. Luukka, P.: Feature selection using fuzzy entropy measures with similarity classifier. Expert Syst. Appl. 38, 4600–4607 (2011) 22. Lohrmann, C., Luukka, P.: Using clustering for supervised feature selection to detect relevant features. In: Nicosia, G.S.V., Pardalos, P., Umeton, R., Giuffrida G. (eds.) The Fifth International Conference on Learning, Optimization and Data Science (LOD). Springer, Cham (2019) 23. De Luca, A., Termini, S.: A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory. Inf. Control 20, 301–312 (1972) 24. Parkash, O., Sharma, P., Mahajan, R.: New measures of weighted fuzzy entropy and their applications for the study of maximum weighted fuzzy entropy principle. Inf. Sci. 178, 2389– 2395 (2008) 25. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (1991) 26. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer Science Business Media, New York (2006)

92

C. Lohrmann and P. Luukka

27. Yao, Y.Y., Wong, S.K., Butz, C.J.: On information-theoretic measures of attribute importance. In: PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 133–137 (1999) 28. Bandemer, H., Näther, W.: Fuzzy Data Analysis. Springer ScienceBusiness Media, Dordrecht (1992) 29. Miller, A.: Subset Selection in Regression (2002) 30. Singhi, S.K., Liu, H.: Feature subset selection bias for classification learning. In: ACM International Conference Proceeding Series (2006) 31. Lukasiewicz, J.: Selected Work. Cambridge University Press, Cambridge (1970) 32. Luukka, P., Saastamoinen, K., Könönen V.: A classifier based on the maximal fuzzy similarity in the generalized Lukasiewicz-structure. In: 10th IEEE International Conference on Fuzzy Systems (2001) 33. Wolberg, W.H.: Breast Cancer Wisconsin (Original) Data Set (1992) 34. Soundarapandian, P., Rubini, L.: Chronic Kidney Disease Data Set (2015) 35. Ilter, N., Guvenir, H.A.: Dermatology Data Set (1998) 36. Antal, B., Hajdu, A.: Diabetic Retinopathy Debrecen Data Set (2014) 37. Janosi, A., Steinbrunn, W., Pfisterer, M., Detrano, R.: Heart Disease Data Set (1988) 38. McLeish, M., Cecile, M.: Horse Colic Data Set (1989) 39. National Institute of Diabetes and Digestive and Kidney Disease, Pima Indians Diabetes (1990) 40. Kononenko, I.: Estimating attributes: analysis and extensions of Relief. In: De Raedt, L., Bergadano, F. (eds.) Machine Learning: ECML-94, pp. 171–182 (1994) 41. Kononenko, I., Simec, E., Robnik-Sikonja, M.: Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell. 7, 39–55 (1997) 42. Robnik-Šikonja, M., Kononenko, I.: Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53(1–2), 23–69 (2003) 43. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classif. (2012) 44. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Proceedings NIPS, pp. 507– 514 (2005) 45. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. The Art of Scientific Computing, 2nd edn. Cambridge University Press (2002) 46. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967) 47. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.: Classification and Regression Trees (1984) 48. Cortes, C., Vapnik, V.: Support-Vector networks. Mach. Learn. 20(3), 273–297 (1995) 49. Wood, I.A., Visscher, P.M., Mengersen, K.L.: Classification based upon gene expression data: Bias and precision of error rates. Bioinformatics (2007) 50. Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. In: AMIA Annual Symposium Proceedings/AMIA Symposium. AMIA Symposium, pp. 406–410 (2009) 51. Lohrmann, C., Luukka, P.: Matlab files for C-FSAE, C-FES, FSAE and FES (2020)

A Region-Based Approach for Missing Value Imputation of Cooling Technologies for the Global Thermal Power Plant Fleet Using a Decision Tree Classifier Alena Lohrmann , Christoph Lohrmann , and Pasi Luukka Abstract The lack of information on the current water demand of individual thermal power plants is a problem for the planning of future energy systems, especially in regions with high water scarcity and an elevated power demand. This lack is linked to the limited availability of data on the type of cooling technology for these power plants. In this study, we propose a hybrid decision-tree based classification model to impute the missing values of the cooling technology for individual power plants globally. The proposed model is cross-validated on the GlobalData database and benchmarked against several approaches for missing value imputation of the cooling technology of individual power plants found in the scientific literature. The decision tree model (with the average test set accuracy of 75.02%) outperforms all alternative approaches in terms of accuracy, often by a considerable margin. In addition, for 103 out of the 137 minor regions in this study, the hybrid model yields the highest test set accuracy of all approaches. It is apparent that, in terms of accuracy, the proposed hybrid model seems to outperform more general models which are based on shares or the portfolio mix in a region/country. The proposed model can be replicated and used in future studies, which have different data sources at their disposal. Keywords Machine learning · Sustainability · Water-energy nexus · Power generation · Supervised learning

A. Lohrmann (B) · C. Lohrmann · P. Luukka School of Business and Management, LUT University, Lappeenranta, Finland e-mail: [email protected] C. Lohrmann e-mail: [email protected] P. Luukka e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_5

93

94

A. Lohrmann et al.

1 Introduction Thermoelectric generation, which is currently the most common form of electricity generation in the world, requires considerable amount of water for its operation. The dependency of the energy generation technology on the water availability is commonly called the water-energy nexus. This water mainly is used for cooling purposes and usually withdrawn from the nearest water body (river, lake, or even sea and ocean). According to a study by U.S. Geological Survey (USGS), thermoelectric generation in the United States is responsible for about 40% of the total country’s freshwater demand [1]. Another example is the electricity sector of the European Union, which is on average responsible for approximately 55% of the total water withdrawal in the region [2]. The above-mentioned shares of water use represent rough estimates since in general no statistical data on the water demand of specific power plants is collected and reported. The data scarcity of the implemented cooling technologies for specific power plants complicates the assessment of the current water use in the power generation sector. Alongside with other tasks and applications, the availability of data related to the current water demand of individual power plants is crucial for the planning of future energy systems, especially in the regions categorized by high water scarcity and elevated power demand. The choice of cooling technology for thermoelectric generation determines the amount of water extracted for cooling purposes [3]. Hence, the availability of cooling technology data enables a more accurate assessment of the water use by specific power plants, per selected time interval, compared to the situation when the type of the cooling system is unknown. The problem of the limited availability of data regarding the type of cooling technology, which is installed in individual power plants was raised by many scholars. In particular, Spang et al. [4] reported that the World Electric Power Plants [5], which is currently one of the most comprehensive power plant databases in the world, contains cooling technology information for only about 37% of the power plants in the database. In addition, Davies et al. [6] noted that not only the currently available power plant databases lack the cooling system data but also many widely used inventories of electricity generation (for example, [7] and [8]) do not contain the cooling system information. In order to overcome this data constraint, in previous studies the assignment of missing information on the cooling system for individual power plants was conducted by either using (a) literature-based shares of cooling technologies in the region [6], (b) the cooling portfolio mix of the generator-fuel type from known power plants in each specific country [4], (c) historical instalments of different types of cooling systems [9] or (d) the geographical location of specific power plants in relation to different water bodies: large rivers [9], and the ocean coastline [6]. Table 1 contains an overview of the existing approaches to overcome this constraint and assign cooling technologies to individual power plants. The approaches were selected based on (1) their novelty and (2) their wide application in the research field of the water-energy nexus. For instance, the method presented by Vassolo and Döll [9] (see Table 1) was

A Region-Based Approach for Missing Value Imputation …

95

developed further by Flörke et al. [10] and became a methodological base of the commonly used Global Water System Project (GWSP) Digital Water Atlas [11]. To address the above-mentioned concern of the cooling technology data limited availability, this study aims: (1) to propose a new machine-learning based approach of “filling the gaps”, meaning conducting missing value imputation, in the cooling system data using a decision tree classifier, and (2) to assess the performance (test accuracy) of the new method and compare it with the performance of the existing approaches.

Table 1 Brief description of the existing approaches for missing value imputation for cooling technologies References Brief description of the approach Biesheuvel et al. [12] 1. Thermal power plants located within 20 km of the coastline are assumed to use once-through cooling. 2. The cooling systems of other thermal power plants are randomly assigned based on the shares of cooling technologies in each specific region Davies et al. [6] 1. The cooling systems of all freshwater-cooled thermal power plants are randomly assigned based on the shares of cooling technologies in each specific region and each type of fuel. In total, 14 regions were considered. The shares of cooling technologies were calculated using the information collected from the literature (reports, surveys, datasets). It was ensured that the resulting water demand was the same as the literature-based water demand estimates 2. Seawater-cooling was assigned to a certain share of power plants using once-through cooling systems in each specific region Vassolo and Döll [9] 1. United States and Canada: The cooling systems of all thermal power plants are randomly assigned based on the shares of cooling technologies in each specific year (reported for the US and Canada) 2. All Other Countries: All thermal power plants are assigned to geographical “cells” for which the total electricity generation was plotted against the estimated river discharge (computed by the WaterGAP hydrology model [13]). Based on the results, the cooling system are assigned to either once-through cooling or cooling towers ECOFYS [14] 1. All thermal power plants which are located nearest to the coastline are assumed to use once-through cooling. 2. All thermal power plants which are located next to the freshwater source (rivers, lakes, channels, etc.) are assumed to use cooling towers Spang et al. [4] The cooling systems of thermal power plants are randomly assigned based on the shares of cooling technologies in each specific country (by generator and fuel type) Lohrmann et al. [15] Hybrid approach. In the first step, cooling systems identification is performed using aerial images. In the second step, in order to fill in the data gaps (so-called “aggregated capacities”, which account for about 10% of the power plant data), the unknown thermal power plants are assigned the historically most common combination of generator type and cooling technology (in each specific region and each type of fuel). If this combination was not available for the specific region and the given type of fuel, then the globally most common combination of generator type and cooling technology (for the given type of fuel) was assigned

96

A. Lohrmann et al.

The reason behind the choice of the decision tree classifier for the cooling technology missing value imputation is threefold. On the one hand, decision trees have the advantage that they can be interpreted easily since they provide a set of rules to explain their class assignment for observations [16–18]. On the other hand, they are computationally efficient [17, 18] and are embedded feature selection methods, which means that they include the selection of relevant features in the model training [19]. Finally, decision trees have already been applied successfully in a wide variety of contexts from medicine [20] to remote sensing [21], machine vision [22] and bankruptcy prediction [23]. This study is structured as follows: First, we provide an overview of the power plant database, which was selected for modeling and testing the proposed approach. The explanatory and dependent variables, which are selected during this step ensure that the proposed model can be replicated in future studies. Secondly, the idea and concept behind the decision tree as classification algorithm is depicted. Subsequently, it is considered which information on the location of each individual power plant (country or minor region) will be deployed for modeling. Next, a hybrid model based on two separate decision tree models, a minor region model and a major region model, is proposed. This hybrid model aims to overcome the limitations imposed by an insufficient number of observations for certain regions. Lastly, the accuracy of the developed hybrid model is compared with the accuracies of the alternative approaches presented in Table 1). This step is performed in order to additionally validate the results of the proposed model. For this purpose, each of the alternative approaches is modelled and tested on the same dataset. The results of the comparison are presented in Sect. 3 of this study.

2 Data and Methodology 2.1 Power Plant Data The main source of power plant data for this study was the GlobalData dataset [24]. In general, the GlobalData dataset provides comprehensive information on over 136,000 registered power plants globally, encompassing over 170 pieces of information (although not all fields are present for every technology). This database was complemented and corrected by Farfan and Breyer [25] using the information gathered from other datasets (e.g. [26, 27]). Subsequently, this dataset was filtered for thermal power plants (coal, gas, nuclear and oil) exceeding 50 MW. In the study by Lohrmann et al. [15], this subset was then further enhanced by adding information concerning the exact location of each power plant unit, the installed cooling system and the type of water used for cooling (sea- or freshwater). From the 13865 individual power units in the power plant dataset provided by Lohrmann et al. [15],1079 are marked as “aggregated capacities”. These power plants were added by Farfan and Breyer [25] to represent additional capacities to known

A Region-Based Approach for Missing Value Imputation …

97

Fig. 1 Geographical distribution of cooling technologies utilized by individual power plants

power plants due to refurbishments, or separate power plants with no known location, name or capacity. These capacities also may include off-grid installations, or privately-owned capacities to serve industrial complexes out of the grid control, thus not individually reported or known to the grid. Since these “aggregated capacities” are not individually accounted for, they were filtered out for this study. The final data subset contains 12786 observations (individual power units) located in 147 countries worldwide. Figure 1 depicts the exact location of the individual power units for the analysis and their installed cooling systems. For each observation (each specific power unit), the following information is available: (1) the type of fuel used, (2) the location (latitude and longitude, country, minor region and major region), the (3) active capacity, (4) the commission year, (5) the type of the generation technology and (6) of the installed cooling system, and (7) the type of water used for cooling. The analysis of the potential link between the active capacity and cooling technology at unit-level might lead to incorrect results since cooling systems are typically designed for entire power plants and not for separate units. Thus, based on the available information regarding the location, the fuel type and the capacity of individual power units, these units were combined into 5778 power plants and the actual size (capacity) of each power plant was calculated. These calculated total capacity values for each specific power plant were added as a new variable “Active capacity of power plant” to the data. The descriptive statistics of the variables (both technology-related and locationrelated) are provided in Table 2. Since the focus of this study is on the missing value imputation of the cooling system, the variable “Type of cooling” is listed in the table as dependent variable. The dependent variable has five unique cooling types: dry, inlet, once-through, pond and tower. As shown in Table 2, three variables related

98

A. Lohrmann et al.

Table 2 Descriptive statistics of the variables included in the study Technology-related Location-related Dependent variable: Type of cooling (categorical: dry, inlet, once-through, pond, tower) Independent variables: 1. Type of fuel (categorical: coal, gas, nuclear, 7. Country (categorical: 147 countries) oil) 2. Active capacity of power unit (continuous: 8. Major region (categorical: 9 major regions) min=50.0, Q1=98.1, median=198.0, Q3=386, max=2910) 3. Active capacity of power plant (continuous: 9. Minor region (categorical: (137 minor min=50.0, Q1=261.0, median=650.0, regions) Q3=1350.0, max=7965) 4. Commission year (continuous: min=1923, Q1=1976, median=1994, Q3=2005, max=2015) 5. Type of generator (categorical: generic, combined cycle, steam, subcritical, supercritical, IGCC) 6. Type of water used for cooling (categorical, binary: freshwater or seawater)

to the location of each power unit were selected for the analysis: identifiers for the country, major region and minor region. The link between these three variables is discussed in the Methods Sect. 2.3 and shown in Table 3 of the Appendix. In general, the independent variables for the study are variables that are typically included in the existing (commercial and free) power plant databases. Hence, the proposed model for the cooling system missing value imputation can be replicated and applied in future studies, which might use different data sources. As shown in Fig. 1, the power plant units presented for the analysis are not evenly distributed across the Earth’s surface. As shown by Lohrmann et al. [15], about 33.4% of the global thermal power fleet is located within 20 km from the ocean coastline and about 55.5% is positioned within 5 km of main global lakes and rivers. Besides that, the amount of observations (the number of individual power units available for the analysis), varies in different regions. Figure 2 depicts the number of power plants in each specific minor region and the shares of cooling technologies observed in each of them. The highest amount of observations is available for the major region “North America”, where the number of observations reaches 681 per minor region. The lowest amount of observations is detected in the major region “Africa”: more than 46% of its minor regions have less than 10 observations. On average, 93 observations per minor region are available for the analysis. Six minor regions (‘PAM’, ‘KH’, ‘INN’, ‘AF’, ‘SOMDJ’ and ‘CSA’—see Table 3 of the Appendix for more information concerning the countries included in these regions) have less than five observations.

Fig. 2 Number of observations (power units) in each specific minor region and the shares of cooling technologies observed. Minor regions are grouped into nine major regions

A Region-Based Approach for Missing Value Imputation … 99

100

A. Lohrmann et al.

2.2 Decision Tree Classifier Assigning a cooling technology to specific power plants can be regarded as a classification problem, which describes a problem where the characteristics of observations in the data can be used to assign them to discrete classes [16]. In the context of this work, the observations are the individual power plants, the characteristics are the feature (= variable) values that are known about the power plant (e.g. sea- or freshwater, active capacity, year of commissioning, etc.), and the discrete classes are the five distinct types of cooling technology that any power plant in this data set can take. The focus in this research will be on decision trees for the classification of each power plant’s cooling technology (target feature). Decision trees are a class of machine learning techniques that can address classification and regression problems. The objective of a classification tree is to divide and subdivide the feature space [18, 28]. The underlying idea for classification is that after partitioning the space into disjoint, meaning non-overlapping, sets, each observation can be classified by determining into which partition it belongs and assigning it to the corresponding class [28]. A decision tree can be constructed using decision tree inducers, meaning algorithms that automatically set up a decision tree for a data set based on some objective function such as the generalization error of the tree [18]. Popular decision tree inducers are CART [29], CN2 [30] and C4.5 [31], which induce the trees top-down, meaning starting from the root of the tree and recursively subdividing the space (“divide and conquer”) until the tree leaves are reached [18]. Hence, the algorithm creates starting from the root of the tree with branches up to the leaves, where each path to a leaf represents a set of rules (discrete functions) for the features that predict the class of each observation [18]. A simple example of a (binary) decision tree in the context of cooling technology classification is illustrated in Fig. 3 for the power plants in the region “Mexico South” (MX-S). The example was selected since it only uses two features to achieve a test set accuracy of about 100%, thus, allowing easy visualization of the feature space while showing intuitive and accurate results. Figure 3 displays the decision tree, starting from the root node and extending from the top down to the bottom, where the leaves are located. Each split represents a rule. The first split is conducted for the categorical variable representing the type of the water - freshwater (0) and seawater (1), leading to all freshwater cooled power plants being classified as tower cooled. On the right side, the tree extends further, applying a second rule (discrete function) to those power plants that are seawater cooled (first rule). Those power plants are classified as dry-cooled in case their active capacity per unit is below 126 MW and into once-through cooled in case this capacity is at least 126 MW or more. Overall, this decision tree illustrates how the recursive subdivision of the feature space can be represented by following the path set out by the rules and, eventually, ending up at a leaf that is associated with a class value [16, 18]. Figure 3 illustrates (using shading) how the rules of the decision tree subdivide the two-dimensional feature space in this example. In addition, it shows that the cooling technologies of all power plants as indicated by the color of the points (power plants),

A Region-Based Approach for Missing Value Imputation …

(a) Decision tree

101

(b) Data subdivision

Fig. 3 Decision Tree for the classification of the cooling technology of power plants (Region: “Mexico South [MX-S]). The variable “Type of water” is binary and the variation on the x-axis (b) is based on a small level of random jitter added to reduce the overlap of the points on the x-axis

are all located in the decision region (subspace) that correctly corresponds to their actual cooling technology. For the description of how a decision tree is grown, let’s assume a data set X with n observations and D features, as well as a corresponding vector of class labels Y with k unique, discrete classes. Our goal is to divide the space of X into k disjoint subspaces so that each observation is assigned to one of the k classes based on the subspace it is in [28]. The growing of a decision tree can be formulated as depicted in Fig. 4. In this research, a decision tree using the CART algorithm will be deployed which applies the Gini index, a univariate impurity measure, to conduct the splits of the feature space (splitting criterion) [28]. The term “univariate” means that the splits are conducted only according to the value of a single feature and not using multiple features [18]. The Gini index is applied to the probability distribution p = ( p1 , p2 , ..., pk ) of the class labels Y in a specific subspace, which can have k discrete classes. In this context, this represents the shares of cooling technologies at a node. Using these shares, the Gini index can be defined as [16, 18]: G =1−

pi2

(1)

i∈1,...k

where the squared shares of each class, denoted pi2 , are summed and subtracted from one. Thus, if there would be only one class in a subspace, hence taking a share of one, this would lead to a Gini index of zero, indicating the maximum purity for the subspace [18]. In contrast, the minimum purity (= maximum impurity) is reached when the subspace contains equal proportions of all classes, leading to the highest Gini index. Thus, a low Gini index is desirable since it reflects a subspace with a high share for one or comparably few classes [16].

102

A. Lohrmann et al.

Fig. 4 Pseudo-code for growing a decision tree (modified from [18])

In this study, the parameter “minimum leaf size” of the decision tree, meaning the minimum number of observations that need to be at a (leaf) node, is optimized during the cross-validation during the model training. Thus, the stopping criterion is that no split can be conducted if it would lead to a number of observations in any of the leaves that is below the specified minimum leaf size [18]. For many tree inducers, the growing of a tree is succeeded by a tree pruning phase, which aims to remove branches of the tree that do not help to improve the generalization error considerably, thus avoiding to overfit the tree [18]. In CART a 5-fold cross-validation is deployed to estimate the generalization error of the tree [28]. Using cross-validation is one key characteristic for the CART algorithm since it replaces the problem of pruning the trained tree [17]. In this study, the programming software MATLAB and the Statistics and Machine Learning Toolbox is used for the training of the decision tree and all analyses conducted.

2.3 Different Approaches to Account for the Geographical Location of Power Plants In this study, a model based on decision trees is deployed to assign the cooling technology to specific power plants. The classification is conducted for power plants

A Region-Based Approach for Missing Value Imputation …

103

for each geographical location separately in order to be able to capture local/regional idiosyncrasies that a global model may not capture. As shown in Table 1, the location of power plants was accounted differently in various studies. In particular, in the studies of Biesheuvel et al. [12] and Davies et al.[6], the missing value imputation for the cooling technology was conducted on the region-level. In contrast to that, Vassolo and Döll [9] and Spang et al. [4] assigned cooling technologies based on the country-specific location of each individual power plant. As mentioned previously, the power plant units in this analysis are geographically located in 147 countries worldwide. However, from another viewpoint, if the division of the world into 145 minor and 9 major regions is applied (as it was done in [15]), then the above-mentioned individual power units are located in 137 minor and 9 major regions. The eight minor regions with zero observations were not included in this study. The difference between the “minor region-” and “country-” based approaches can be emphasized by highlighting the most common cooling technology in each individual region and each individual country. Figure 5 shows this difference in the most common cooling technologies (based on the number of observations, for all fuel and generation types) in the 147 countries (Fig. 5a) and in the 137 minor regions (Fig. 5b). It is evident from the figure that the two different perspectives can lead to different results, especially in cases when (1) a single country comprises several minor regions (e.g. Australia, Brazil, Japan, Russia, the United States), and (2) a single minor region includes several countries (e.g. “AR-NE”, “CAU”, “SW”, “WS” - see Table 3 of the Appendix for more information concerning the countries included in these regions). The question arises from which perspective (country or minor region) the missing cooling technology values can be imputed more accurately. In addition, the selected geographical perspective should (1) ensure a good representation of the geographic and climatic conditions of the individual power plants’ locations, and (2) contain enough observations for the decision tree classifier. For this reason, a decision tree model was trained separately for (1) the 147 countries and for (2) the 137 minor regions. The Sect. 3.1 of this study presents the results of this comparison. Since the minor region model has shown a higher average test accuracy than the country model, it was selected as the perspective of the decision tree-based model in this study.

2.4 Overcoming the Problem of Data Scarcity A challenge associated with building a classifier for each minor region is that there is a limited number of observations for some minor regions which severely limits the possibility to properly train and validate a model learned from these very few observations. In contrast to that, using a model trained for each of the nine major regions will on average have a substantially higher number of observations for model training and validation and, thus, will not face this challenge. However, major region

104

A. Lohrmann et al.

(a) Countries

(b) Minor regions

Fig. 5 Distribution of major cooling technologies (based on the number of power plants of all fuel and generation types) in 147 countries (a) and in 137 minor regions (b). Grey color indicates that no power plants were recorded for that specific region/country. Black lines represent the borders between countries/regions

models will not be able to capture idiosyncrasies of the much smaller minor regions contained in them and lead to lower classification accuracies overall. It is apparent from this explanation that using minor region models for minor regions with sufficient observations is desirable whereas using a major region model is preferable for minor regions that do not have sufficient observations to train a robust minor region model. As a consequence, in this study a hybrid decision tree-based model is set up that selects for each minor region whether a minor region or a major region model should be applied for the classification of the cooling technology of the power plants in that minor region. The model is referred to as hybrid since it applies one of two models (minor region or major region model) depending on the specific minor region at hand. A simple flowchart for the hybrid model is presented in Fig. 6. Even though the hybrid model conducts the missing value imputation for the power plants in each minor region, the model starts from the observations of the major regions in order to setup the major region model first. The first step is the

A Region-Based Approach for Missing Value Imputation …

105

division of the data for a major region into the training and test data sets for the K-fold cross-validation, where K in this study is set to 5. This is a common choice for K and also restricts the minimum number of observations required within a region to be able to conduct the cross-validation to only 5. Since the major region model should be trained on data containing a representative share of the cooling technologies (classes) as well as the share of power plants in the minor regions, a stratified random sampling with respect to the class label as well as to the minor region is conducted. This ensures that in the data to train the model (training data set) and the data to test the estimated generalization on future power plants that were not used in setting up the model (test data set), the shares of observations of these two features (class label, minor region) are similar. Conducting the data split in this way also ensures that the major region model will have also been trained on observations

Fig. 6 Flowchart of the hybrid model for cooling technology classification

106

A. Lohrmann et al.

from the comparably smaller minor regions. The only exception exists for those minor regions that have less than five observations, which are not included in the data split and for which all observations are assigned to the test set (see Appendix Fig. 10 for details). Subsequently, a decision tree model is trained on the training data, including a parameter selection for the minimum leaf size of the decision tree and the corresponding validation accuracy is recorded. The next steps are run over all minor regions contained in the current major region for which a major region model was just trained. In particular, for each minor region the decision is made whether there are sufficient observations (power plants) in this minor region in order to train the corresponding minor region model (the threshold being K = 5 observations). If there are enough observations, then those observations from the training data belonging to that minor region are used to train the minor region model. The model training also includes a parameter selection and the validation accuracy for the minor region model is recorded. At this point, for the current minor region the validation accuracy for both the minor region model and the major region model are known. Thus, a decision is conducted whether the minor or the major region model are likely going to generalize better. In particular, if the major region model’s validation accuracy exceeds the minor region’s validation accuracy by at least 10% points, the major region model will be selected since it is assumed that it will generalize better even though it is trained on multiple minor regions and, thus, more general. Otherwise, the minor region model is selected. Lastly, the test set accuracy of the selected model on the independent test set for the current minor region is calculated. This step is repeated for all minor regions in the current major region, and then this procedure is repeated for each remaining major region.

2.5 Validation of Results After the development of the hybrid model and obtaining the classification results, it is crucial to investigate whether the performance of the proposed hybrid model is better than the performance of the already existing approaches presented previously in Table 1 for missing value imputation of the cooling technology. The accuracies of these “alternative” approaches were computed using the same dataset and 5-fold cross-validation where the model was trained using the training data set and the generalization ability tested using the test data set. It is crucial to mention that the approach presented by Vassolo and Döll [9] was modelled for Canada and the United States only. This limitation was caused by the fact that it was not feasible to replicate the results of the WaterGAP hydrology model [13] for this study. In addition, the approach developed by Spang et al. [4] was computed for minor regions featured in the dataset (not for countries—as mentioned in their study). This was done in order to ensure the compatibility of all results (in particular, test accuracies). The authors acknowledge that the above-mentioned changes in the methodology might have influenced the estimated performance of the model.

A Region-Based Approach for Missing Value Imputation …

(a) Country model

107

(b) Minor regions model

Fig. 7 Average test accuracy for the Country model (a) and the Minor region model b aggregated by nine global major regions

3 Results 3.1 Country Model Versus Minor Region Model As mentioned in Sect. 2.3, the first step was the selection of the perspective for the classification (minor region or country). After the 5-fold cross-validation for this step is completed, five test set accuracies for each of the 137 minor regions and 147 countries were determined and the average test set accuracies were calculated. The average test accuracy of the minor region model was estimated at the level of 72.82%, whereas the average test accuracy of the country model was 60.63%. The above-mentioned values of the average test accuracy also consider zero values, which were obtained in cases when the classification model was not able to train due to insufficient amount of observations (power units) for the minor region or country in the dataset, respectively. In order to enable a visual comparison of the results, they were aggregated into nine major regions (see Table 3 of the Appendix for information concerning countries and minor regions included in each major region). The results are presented in Fig. 7. The figure shows the min-max intervals and the median values of the average test accuracies estimated for the minor regions/countries, which constitute each of the nine major regions. In general, the country model shows a very good generalization ability: in all major regions (except for India SAARC) there are cases where the test accuracy of

108

A. Lohrmann et al.

the model reaches 100%. Here, however, it is crucial to mention that for 34 countries out of the 147 countries in the data set (or about 23%), the country model either has an insufficient amount of observations to train the decision tree classifier (less than 5) to conduct 5-fold cross-validation or does not generalize well (the average accuracy was zero) . Almost half of these countries (16) are located in Africa, which is why the median test accuracy for the Africa major region is 0% (Fig. 7a). In contrast to the country model, the minor region model showed 0% accuracy for just 7 minor regions (each of these minor regions had five or less observations). Thus, combining several countries with few observations into minor regions brings the advantage that there are less cases with too few observations to conduct model training than for the country-based approach. As shown in Fig. 7, the minor region model has demonstrated smaller min-max intervals of the estimated average accuracies than the country model. For the minor region model, the min-max intervals tend to be located in the upper half of the figure (in many cases, not reaching 100%), while for the country model the accuracies typically range from 0% to 100%. This implies that even if the region model does not necessary guarantee a higher performance than the country model, it is less likely to obtain low test accuracies (40% and lower). Thus, the main limitation for using the country model is the insufficient amount of observations for many countries for training the decision tree classifier. This limitation is solely related to the dataset, which was applied in this study. In general, the country model has shown a very good performance as well: if the countries with the estimated test accuracy of 0% are not accounted for, then the average test accuracy reaches 78.88%. Therefore, the country model appears also to be suitable for use in future studies when (1) there are enough observations for model training (at least 5 observations for the 5-fold cross-validation), and/or (2) the division of the world into minor regions is not feasible (or the study needs to be on a country-level). In summary, due to the above-mentioned limitations of the country model, the minor region model was selected as the basis for the development of the hybrid classification model.

3.2 Performance of Hybrid Model The next step of the study was to evaluate the performance of the hybrid model. The results show that the addition of the “major regions” component to the classification model has improved the overall performance of the approach: the average test accuracy has increased from 72.82 to 75.02%. The difference in the performance is based on two aspects: first, the hybrid model has enabled the classification of observations in minor regions with too few observations using the corresponding major region model and, second, the major region model was in a few cases (e.g. LK/Sri Lanka) also applied when it was likely to generalize better in a minor region than the corresponding minor region model.

A Region-Based Approach for Missing Value Imputation …

109

Figure 8 provides a comparison of the existing approaches for missing value imputation of the cooling technology based on the estimated average test set accuracy. As depicted in the figure, the hybrid model was estimated to perform, on average, better than all other approaches that were selected for this study. Both components of the hybrid model, the minor region model and the major region model were plotted on the figure separately. The values of the test set accuracy for these models provided in this figure were obtained using the same training and test data sets. Unsurprisingly, the test set accuracy of the major region model is lower than the performance of the minor region model since the larger-scale model is less capable to capture the idiosyncrasies of the selection of the cooling type for power plants in the corresponding minor region. As shown in Fig. 8, three of the selected approaches [6, 9, 14] demonstrate an average test set accuracy of about 35%. Such low performance could be explained by the fact that studies like Davies et al. [6] and ECOFYS [14] are typically focused on the assessment of the country- and region-level water demand. Hence, these approaches might be effective for the large-scale water demand calculations but demonstrate low performance for missing value imputation of the cooling technology for individual power plants. At the same time, for the missing value imputation of the cooling technology for power plants not currently in the data set, using a classifier that tries to identify patterns underlying the choice of the cooling technology appears more plausible and is likely more accurate in the future than using a portfolio mix, which may change over time. It is crucial to mention that the approach presented by Vassolo and Döll [9] was computed for Canada and the United States only, thus, the results are the estimated

Fig. 8 Comparison of the existing approaches for the cooling technology missing value imputation—average global test accuracy

Fig. 9 Comparison of the existing approaches for the cooling technology missing value imputation, per minor region. The circles represent the average test set accuracy obtained for each approach for each specific minor region

110 A. Lohrmann et al.

A Region-Based Approach for Missing Value Imputation …

111

average test set accuracies for only these two countries. The authors of this study acknowledge that the approach based on the estimated river discharge, which is implemented by Vassolo and Döll [9] for all other countries, might have a considerably higher accuracy, then the presented values for Canada and the United States. A suitable approach for missing value imputation of the cooling system should demonstrate high performance not only on the global level but, optimally, also for each specific minor region. Hence, another way to evaluate the proposed hybrid model is to compare it with the alternative approaches for each minor region separately. Figure 9 illustrates the results of this comparison. Here, the values of the expected performance of the hybrid model were plotted against the estimated average test accuracies of the alternative approaches for each specific minor region. In addition, the numerical values of the average test set accuracy as well as their standard deviations are provided in Table 4 of the Appendix. As seen from the figure, in the majority of cases (for 103 out of 137 minor regions) the hybrid model performs at least as good or better than any of the alternative approaches. As mentioned previously, the main driving force behind the development of the hybrid approach was the low performance of the minor region model particularly in cases with low amount of observations or the inability to train a model on account of the same reason. The results demonstrate that, compared to the alternative approaches, the proposed hybrid model demonstrates considerably higher performance in regions with few (1–5) observations. Except for the minor region SOMDJ, the addition of the “major region component” to the initial minor region models has improved the performance of the classifier from 0% to, on average, 51.67% for the regions with a low amount of observations.

4 Conclusion This work focuses on the imputation of missing values for the cooling technology of individual power plants. The proposed hybrid model combines two classifiers: the first one considers the location of power plants in nine major regions; the latter takes into account their position in regard to 145 minor regions. Whereas the division of the world into nine major regions is commonly accepted and widely used, the assignment of power plants to 145 minor regions might appear challenging. However, as it was shown in this study, a different geographical division, such as division into countries, performs comparably well in terms of classification accuracy. Therefore, the hybrid approach can be also applied in cases when a different territorial division is preferred. The model was trained and tested on a data subset, which was derived from the GlobalData database [24]. It considers characteristics of power plants such as the type of fuel and generator, the active capacity, the commission year and the type of water used for cooling. These characteristics are commonly known/available in the existing (commercial and free) power plant databases. Thus, the proposed hybrid

112

A. Lohrmann et al.

model can be replicated and used in other studies, which have different data sources at their disposal. The hybrid model based on a minor region decision tree model and a major region decision tree model was benchmarked against several approaches found in the scientific literature for the assignment of the cooling technology for specific power plants for which the cooling technology is unknown. In particular, the hybrid model was compared to the approaches by Biesheuval et al. [12], Davies et al. [6], Vassolo and Döll [9] and ECOFYS [14] as well as a decision tree model for each minor region and one for each major region. The hybrid model yields a mean classification accuracy of 75.02%, which is higher than the performance of all alternative approaches covered in this study. In addition, for 103 out of 137 minor regions, the hybrid model yields the highest test set accuracy of all approaches in this study. It is apparent that machine learning based models, especially the hybrid model, but also the single minor region or major region decision tree, seem to outperform more general models on this data set which are based on shares or the portfolio mix in a region/country. Acknowledgements The authors would like to thank for the support the Kone Foundation, the Finnish Academy of Science and Letters and the Finnish Strategic Research Council, grant number 313396/MFG40 Manufacturing 4.0.

Appendix See Fig. 10 for the extended flowchart of the proposed model. See also Table 3 for the information concerning minor regions, countries and major regions and Table 4 for the classification accuracies for the minor region models.

A Region-Based Approach for Missing Value Imputation …

Fig. 10 Extended flowchart of the hybrid model for cooling technology classification

113

114

A. Lohrmann et al.

Table 3 Minor regions, countries and major regions Minor region name

Countries (contained in Minor Region)

Major region

‘NO’

Norway

Europe

‘DK’

Denmark

Europe

‘SE’

Sweden

Europe

‘FI’

Finland

Europe

‘BLT’

Baltic: Estonia. Latvia. Lithuania

Europe

‘PL’

Poland

Europe

‘IBE’

Iberia: Portugal. Spain. Gibraltar

Europe

‘FR’

France. Monaco. Andorra

Europe

‘BNL’

Belgium. Netherlands. Luxembourg

Europe

‘BRI’

Ireland., Great Britain. Isle of Man. Europe Guernsey. Jersey

‘DE’

Germany

‘CRS’

Czech Republic. Slovakia

Europe

‘AUH’

Austria. Hungary

Europe

‘BKN-W’

Balkan-West: Slovenia. Croatia. Bosnia & Herzegovina. Serbia. Kosovo. Montenegro. Macedonia. Albania

Europe

‘BKN-E’

Balkan-East: Romania. Bulgaria. Greece

Europe

‘IT’

Italy. San Marino. Vatican. Malta

Europe

‘CH’

Switzerland. Liechtenstein

Europe

‘TR’

Turkey. Cyprus

Europe

‘UA’

Ukraine. Moldova

Europe

‘DZ’

Algeria

MENA

Europe

‘BHQ’

Bahrain. Qatar

MENA

‘EG’

Egypt

MENA

‘IR’

Iran

MENA

‘IQ’

Iraq

MENA

‘IL’

Israel

MENA

‘JWG’

Jordan (incl. West Bank 201 & Gaza Strip = State of Palestine)

MENA

‘KW’

Kuwait

MENA

‘LB’

Lebanon

MENA

‘LY’

Libya

MENA

‘MA’

Morocco

MENA

‘OM’

Oman

MENA

‘SA’

Saudi Arabia

MENA

‘TN’

Tunisia

MENA

‘AE’

United Arab Emirates

MENA

‘YE’

Yemen

MENA

‘SY’

Syria

MENA

A Region-Based Approach for Missing Value Imputation …

115

Table 3 (continued) Minor region name

Countries (contained in Minor Region)

Major region

‘RU-NW’

Russia North-West

Eurasia

‘RU-C’

Russia Center

Eurasia

‘RU-S’

Russia South

Eurasia

‘RU-V’

Russia Volga region

Eurasia

‘RU-U’

Russia Urals

Eurasia

‘RU-SI’

Russia Siberia

Eurasia

‘RU-FE’

Russia Far East

Eurasia

‘BY’

Belarus

Eurasia

‘CAU’

Armenia. Azerbaijan. Georgia

Eurasia

‘KZ’

Kazakhstan

Eurasia

‘PAM’

Tajikistan. Kyrgyzstan

Eurasia

‘UZ’

Uzbekistan

Eurasia

‘TM’

Turkmenistan

Eurasia

‘JP-E’

Japan East

NE Asia

‘JP-W’

Japan West

NE Asia

‘KR’

South Korea

NE Asia

‘KP’

North Korea

NE Asia

‘CN-NE’

China North-East

NE Asia

‘CN-N’

China North

NE Asia

‘CN-E’

China East

NE Asia

‘CN-C’

China Central

NE Asia

‘CN-S’

China South

NE Asia

‘CN-NW’

China North-West

NE Asia

‘CN-XU’

China Uighur region

NE Asia

‘MN’

Mongolia

NE Asia

‘NZ’

New Zealand

SE Asia

‘AU-E’

Australia East

SE Asia

‘AU-W’

Australia West

SE Asia

‘ID-SU’

Indonesia Sumatra

SE Asia

‘ID-JV+TL’

Indonesia Java. Timor-Leste

SE Asia

‘ID-KL-SW’

Indonesia East

SE Asia

‘MY-W+SG’

Malaysia West. Singapore

SE Asia

‘MY-E+BN’

Malaysia East. Brunei

SE Asia

‘PH’

Philippines

SE Asia

‘MM’

Myanmar

SE Asia

‘TH’

Thailand

SE Asia

‘VN’

Vietnam

SE Asia

‘KH’

Cambodia

SE Asia

‘IN-E’

India East

India_SAARC

‘IN-CE’

India Central-East

India_SAARC

‘IN-W’

India West

India_SAARC

‘IN-CW’

India Central-West

India_SAARC

116

A. Lohrmann et al.

Table 3 (continued) Minor region name

Countries (contained in Minor Region)

Major region

‘IN-N’

India North

India_SAARC

‘IN-NW’

India North-West

India_SAARC

‘IN-UP’

India Uttar Pradesh

India_SAARC

‘IN-S’

India South

India_SAARC

‘IN-CS’

India South Central

India_SAARC

‘IN-NE’

India North-East

India_SAARC

‘BD’

Bangladesh

India_SAARC

‘PK-S’

Pakistan South

India_SAARC

‘PK-N’

Pakistan North

India_SAARC

‘AF’

Afghanistan

India_SAARC

‘LK’

Sri Lanka

India_SAARC

‘WW’

Senegal. Gambia. Cape Verde Islands. Guinea Bissau. Guinea. Sierra Leone. Liberia. Mali. Mauritania. Western Sahara

Africa

‘WS’

Ghana. Cote D’Ivoire. Benin. Burkina Faso (Upper Volta). Togo

Africa

‘NIG-S’

Nigeria South

Africa

‘NIG-N’

Nigeria North

Africa

‘SER’

Sudan. Eritrea

Africa

‘SOMDJ’

Djibouti. Somalia

Africa

‘KENUG’

Kenya. Uganda

Africa

‘TZRB’

Rwanda. Burundi. Tanzania

Africa

‘CAR’

Central African Republic. Africa Cameroon. Equatorial Guinea. Sao Tome and Principe. Congo. Republic of (Brazzaville). Gabon

‘SW’

Angola. Namibia. Botswana

‘ZAFLS’

Republic of South Africa. Lesotho

Africa

‘SE’

Malawi. Mozambique. Zambia. Zimbabwe. Swaziland

Africa

‘IOCE’

Comoros Islands. Mauritius. Mayotte. Madagascar. Seychelles

Africa

‘CAM’

Panama. Costa Rica. Nicaragua. N America Honduras. El Salvador. Guatemala and Belize

‘CO’

Colombia

S America

‘VE’

Venezuela. Guyana. French Guiana. Suriname

S America

‘EC’

Ecuador

S America

‘PE’

Peru

S America

‘CSA’

Bolivia. Paraguay

S America

Africa

A Region-Based Approach for Missing Value Imputation … Table 3 (continued) Minor region name ‘BR-S’ ‘BR-SP’ ‘BR-SE’ ‘BR-N’ ‘BR-NE’ ‘AR-NE’ ‘AR-E’ ‘AR-W’ ‘CL’ ‘CA-W’ ‘CA-E’ ‘US-NENY’ ‘US-MA’ ‘US-CAR’ ‘US-S’ ‘US-TVA’ ‘US-MW’ ‘US-C’ ‘US-TX’ ‘US-SW’ ‘US-NW’ ‘US-CA’ ‘US-AK’ ‘US-HI’ ‘US-GU’ ‘MX-NW’ ‘MX-N’ ‘MX-C’ ‘MX-S’

117

Countries (contained in Minor Major region Region) Brazil South Brazil São Paulo Brazil Southeast Brazil North Brazil Northeast Argentina Northeast. Uruguay Argentina East Argentina West Chile Canada West Canada East US New England & New York US Mid-Atlantics US Carolinas US Southern US TVA US Midwest US Central US Texas US Southwest US Northwest US California US Alaska US Hawaii US Gulf Mexico Northwest Mexico North Mexico Center Mexico South

S America S America S America S America S America S America S America S America S America N America N America N America N America N America N America N America N America N America N America N America N America N America N America N America N America N America N America N America N America

60±54.8 58±14.8 78.6±21.4 63.9±8.3 82.1±6.6 79.2±12.7 62.8±6.4 82.3±4.4 45.2±8.9 64.6±3.5 64.4±5.1 80.6±8.4 38.9±11.1 65.4±15.9 82.6±7 61.3±9.2 30±44.7 65.5±5.5 78.1±9.6 62.9±16.3 64±15.8 70.7±9.8

‘NO’ ‘DK’ ‘SE’ ‘FI’ ‘BLT’ ‘PL’ ‘IBE’ ‘FR’ ‘BNL’ ‘BRI’ ‘DE’ ‘CRS’ ‘AUH’ ‘BKN-W’ ‘BKN-E’ ‘IT’ ‘CH’ ‘TR’ ‘UA’ ‘DZ’ ‘BHQ’ ‘EG’

100±0 88±11 100±0 100±0 71.4±11.9 79.1±8.9 64.7±4.2 80.8±6.1 79.3±5 64.6±4.8 68.1±3 86.8±2.5 45.6±17.7 81.3±4 87.5±3 62.9±7.7 70±44.7 75.2±6.6 91.6±6.7 74.3±13 72.9±14.5 78.3±9.5

Major region model

Minor region name Minor region model

Table 4 Test set accuracies by minor region

100±0 88±11 100±0 100±0 71.4±11.9 79.1±8.9 64.7±4.2 80.8±6.1 79.3±5 64.6±4.8 68.1±3 86.8±2.5 45.6±17.7 81.3±4 87.5±3 62.9±7.7 30±44.7 75.2±6.6 91.6±6.7 74.3±13 72.9±14.5 78.3±9.5

Hybrid model 100±0 88±11 100±0 100±0 51.4±9.4 63.3±5.2 52.3±8.7 61.5±8.2 61.4±8.3 51.5±4.2 55.2±7.1 72.3±9.1 36.1±21 51.5±6.4 70.7±13.8 48.8±6.4 30±44.7 48.3±8.8 40.4±12.7 61.4±10.8 40.4±4.6 61.5±2.4

20±44.7 25±16.6 25.2±14.9 26.9±9.4 28.6±11.9 59±8.8 56.2±8.2 38.2±7.4 36.8±9.3 37.9±11.7 57±5.7 58.5±13.2 31.1±9.3 56±15.9 57.2±3.1 34.9±4.7 10±22.4 26.9±6.2 32.2±8.4 12.9±6 31.3±13 14.5±7.3

Biesheuvel Davies et et al. (2016) al. (2013) – – – – – – – – – – – – – – – – – – – – – –

100±0 88±11 100±0 100±0 61.8±7.6 20.9±1.7 28.8±1.2 57.7±0 67±2.8 43.7±0.5 24.8±1 4.1±2.3 47.8±5 43.7±3.5 19.5±2 51.3±1.1 70±44.7 22.1±1.9 59±2.2 38.6±3.9 40.4±4.6 70.9±4

Vassolo and ECOFYS Döll (2005) (2014) 100±0 88±17.9 100±0 97.8±5 40.7±12.8 64.8±8.3 57.5±10.9 56.9±13.4 51±15.9 42.2±4 54.8±6.8 81.6±4.5 36.4±12.1 49.9±7.9 79.9±9.2 49.6±5.8 50±50 47.6±9.6 51.3±10.3 41.4±18.5 40.4±9.1 58.4±9.9

100±0 84±16.7 100±0 100±0 61.8±7.6 74.8±0.4 70.6±4.3 60±12.3 70.8±3.7 51.5±3.6 66.7±1.3 86.8±6.6 50±5.6 64.2±11.2 88.2±3.9 49.6±2.3 50±50 63.4±5.8 62.2±4.8 48.6±3.2 36.4±7.4 70.9±4

85.0 75.9 88.0 86.1 58.7 65.0 57.2 64.8 61.4 52.6 57.4 69.7 41.4 61.7 71.6 52.6 42.5 53.0 63.3 51.8 49.9 62.9

Spang et al. Lohrmann Average (2014) et al. (2019)

118 A. Lohrmann et al.

73±4.8 60.9±12.7 69.2±12 59±24.8 90.5±9.6 53.3±38 92.7±11.9 80±24.5 55.2±24.4 90.7±5.3 43.3±9.1 59.3±6.9 40±41.8 72.4±24.8 72.5±7.2 76.3±5.3 81.1±12.6 77.7±3.9 79.2±11.5 75.4±8.5 60.6±14.5 73.3±3.7 85.1±12.5 49.1±10.4

‘IR’ ‘IQ’ ‘IL’ ‘JWG’ ‘KW’ ‘LB’ ‘LY’ ‘MA’ ‘OM’ ‘SA’ ‘TN’ ‘AE’ ‘YE’ ‘SY’ ‘RU-NW’ ‘RU-C’ ‘RU-S’ ‘RU-V’ ‘RU-U’ ‘RU-SI’ ‘RU-FE’ ‘BY’ ‘CAU’ ‘KZ’

76.8±5.6 73.2±11.2 67.3±6.8 92±11 96.2±5.2 100±0 87.7±4.7 60±14.1 41±4.3 93.1±3 66.7±10.2 72.6±10.3 80±27.4 88.1±6.7 79.3±5.3 75.8±6.4 98.2±4.1 84.3±3.9 76.8±5.2 78.2±3.8 58.3±23 41.7±21.2 85.6±5.1 66.1±6

Major Region Model

Minor Region Name Minor Region Model

Table 4 (continued)

76.8±5.6 71.4±15 67.3±6.8 92±11 96.2±5.2 100±0 87.7±4.7 64±21.9 55.2±24.4 93.1±3 50±11.8 72.6±10.3 80±27.4 88.1±6.7 79.3±5.3 75.8±6.4 98.2±4.1 84.3±3.9 76.8±5.2 78.2±3.8 63.1±23.8 56.7±25.3 85.6±5.1 66.1±6

Hybrid Model 33.1±3.5 44.7±9.2 69.1±7.2 59±39.4 92.5±4.2 13.3±29.8 72.1±6.2 72±11 29±19.8 50.4±2.8 68.3±20.7 75.5±6 30±44.7 55.2±17 43.8±10 49.4±9.2 64.9±11.5 61.9±6.6 50.4±8.3 47.6±7.6 53.6±16.3 25±30.6 33.8±13.2 37.6±11.2

21.5±6.2 34.1±15.2 36.7±13 0±0 58.7±9.3 0±0 10.5±7.3 36±16.7 18.1±13.2 21.9±5.9 6.7±14.9 49.1±15 10±22.4 41.4±7.8 46.7±5 53.3±7.2 72.5±10.8 53.3±3.7 44.4±10.8 47±7.5 34.7±7.1 35±28.5 39.1±11.8 52.3±20.1

Biesheuvel Davies et et al. (2016) al. (2013) – – – – – – – – – – – – – – – – – – – – – – – –

7.3±1.2 58.9±4.6 65.6±5.2 0±0 92.5±4.2 0±0 68.5±4.1 64±8.9 14.8±1.1 31.1±0.8 61.7±11.2 75.5±6 20±27.4 29.5±2.1 46±1.4 36±1.7 14.7±4.7 21.8±1.7 64.8±1.8 52.4±0.9 53.9±8.2 31.7±10.9 37.8±7.3 22.1±5.1

Vassolo and ECOFYS Döll (2005) (2014) 40.4±6.1 44.5±8.1 64.1±16.2 92±11 83.1±12.5 100±0 68.8±17.5 40±24.5 28.6±22.6 35.9±7.9 45±16.2 65.2±6.3 80±27.4 53.3±10.3 58.8±11.5 48.4±3.6 61.3±6.7 59.4±9.3 56±9.8 47.7±8.8 44.4±15.2 36.7±21.7 43.8±8.2 47.3±10.3

41.6±4.1 60.8±7.7 56.7±18.3 92±11 92.5±4.2 100±0 64.8±6.6 64±8.9 47.1±6.4 50.9±3.3 75±25 75.5±6 80±27.4 59±11 70.1±7.5 59.7±5.1 81.6±6.1 74.6±1.6 64.8±3.3 57.1±5.5 61.1±9.4 58.3±11.8 49.8±12.3 62.6±5.8

46.3 56.1 62.0 60.8 87.8 58.3 69.1 60.0 36.1 58.4 52.1 68.2 52.5 60.9 62.1 59.3 71.6 64.7 64.2 60.4 53.7 44.8 57.6 50.4

Spang et al. Lohrmann Average (2014) et al. (2019)

A Region-Based Approach for Missing Value Imputation … 119

Major region model

50±35.4 90±10.5 36.7±21.7 86.2±1.9 88.9±2.2 88.7±3 84.8±17.5 78.8±8 77.6±7 80.5±5.6 78.1±4.1 82.3±9.1 75.7±4.9 77.3±15.3 86.7±29.8 46.7±44.7 80.9±7.5 73.1±17.7 78±22.8 83.5±6.5 60±54.8

74.4±9

Minor region name Minor region model

0±0 84.6±5.4 73.3±25.3 89.8±4.6 92.9±3.5 84.7±9 100±0 81.6±12.7 79.3±3 85.6±2.4 82.7±7.7 87.4±4.3 74.7±5.9 72.7±18.3 100±0 80±18.3 83.1±6.2 73.6±19.3 88.7±17.6 81.5±2.5 100±0

87.3±5.9

‘PAM’ ‘UZ’ ‘TM’ ‘JP-E’ ‘JP-W’ ‘KR’ ‘KP’ ‘CN-NE’ ‘CN-N’ ‘CN-E’ ‘CN-C’ ‘CN-S’ ‘CN-NW’ ‘CN-XU’ ‘MN’ ‘NZ’ ‘AU-E’ ‘AU-W’ ‘ID-SU’ ‘ID-JV+TL’ ‘ID-KLSW’ ‘MYW+SG’

Table 4 (continued)

87.3±5.9

50±35.4 84.6±5.4 73.3±25.3 89.8±4.6 92.9±3.5 84.7±9 100±0 81.6±12.7 79.3±3 85.6±2.4 82.7±7.7 87.4±4.3 74.7±5.9 72.7±18.3 100±0 80±18.3 83.1±6.2 73.6±19.3 88.7±17.6 81.5±2.5 100±0

Hybrid model

60.6±2.6

0±0 54.3±18.1 56.7±25.3 84±1 88.1±1.3 83.3±2.4 100±0 55.7±6.9 62.4±5.9 61.4±2.9 65±6.1 65.4±3 65.1±4.8 60.7±20.9 100±0 46.7±38 34.6±8.1 35.1±11.4 66±34.4 84.6±3.4 83.3±15.6 43.6±8.3

0±0 54.3±18.1 26.7±25.3 76.9±2.5 78.3±6.5 42.7±13.8 16.2±16.7 53±13.2 47.8±2.7 34.7±3.9 52.2±11.7 39.9±4.7 51.4±5.4 58±14.8 56.7±25.3 53.3±38 23.2±6.1 20.4±7.1 58±4.5 47.2±9.8 68.3±10.9

Biesheuvel Davies et et al. (2016) al. (2013)

–

– – – – – – – – – – – – – – – – – – 60.6±2.6

0±0 35.7±4 10±22.4 84±1 88.1±1.3 85.3±1.8 100±0 19.5±2.4 9.9±0.6 65.1±0 21.1±0.2 57.4±0.8 1±2.1 4±8.9 0±0 0±0 23.3±2 10.2±0.5 50.7±13 83.6±3.9 83.3±15.6

Vassolo and ECOFYS Döll (2005) (2014)

62.9±10.1

0±0 56.8±13.2 73.3±25.3 77.3±5.5 80.2±5 80.7±6.8 100±0 60.2±10.7 57.4±4.9 54.9±5 65±5.3 57.4±5.3 73±8.2 68.7±11.9 100±0 53.3±38 47.9±9.3 61.1±5.3 63.3±30.9 77.5±7.2 95±11.2 59.6±5.8

0±0 59.3±9.2 73.3±25.3 84±1 88.1±1.3 85.3±1.8 100±0 76.9±5.6 74.8±0.6 64.2±1.5 79.7±1.2 57.4±0.8 78.7±2.3 80.7±1.5 100±0 80±18.3 50.7±6 71.3±8.7 74±13.4 83.6±3.9 100±0

67.0

12.5 65.0 52.9 84.0 87.2 79.4 87.6 63.4 61.1 66.5 65.8 66.8 61.8 61.8 80.4 55.0 53.3 52.3 70.9 77.9 86.3

Spang et al. Lohrmann Average (2014) et al. (2019)

120 A. Lohrmann et al.

Major region model

44.7±24.2

88.5±8 50±50 80.3±10 72.7±12.9 100±0 85.3±7.7 92.1±11.4 85±4.8 83.9±11.3 10±22.4 90.9±9.6 88.8±8.1 75.7±10 83.5±7.7 50±35.4 44±14.9 77.5±19.4 73.8±11.4 60±54.8 70±29.8 50±35.4 38.3±16.2

Minor region name Minor region model

70.7±14.8

79.3±8 90±22.4 82.2±5.6 74.5±13.5 0±0 90.2±4.6 86.8±8.9 94.2±3.7 91.1±5.9 0±0 85.7±4.5 91.6±9.2 77±19.3 85.9±3.2 60±22.4 60.4±14.1 91.8±12.6 91.4±7.8 0±0 50±16.7 60±22.4 36.7±24.7

‘MYE+BN’ ‘PH’ ‘MM’ ‘TH’ ‘VN’ ‘KH’ ‘IN-E’ ‘IN-CE’ ‘IN-W’ ‘IN-CW’ ‘IN-N’ ‘IN-NW’ ‘IN-UP’ ‘IN-S’ ‘IN-CS’ ‘IN-NE’ ‘BD’ ‘PK-S’ ‘PK-N’ ‘AF’ ‘LK’ ‘WW’ ‘WS’

Table 4 (continued)

79.3±8 90±22.4 82.2±5.6 74.5±13.5 100±0 90.2±4.6 86.8±8.9 94.2±3.7 91.1±5.9 10±22.4 85.7±4.5 91.6±9.2 77±19.3 85.9±3.2 50±35.4 50.6±11.3 91.8±12.6 91.4±7.8 60±54.8 70±29.8 50±35.4 38.3±16.2

59.3±6

Hybrid model

83.3±7.2 40±41.8 50.6±14.1 72.7±6.4 0±0 70.4±8.4 76.1±9.9 58.3±9.3 77.4±7 0±0 76.5±10 66.4±12.8 43.4±6.5 69.4±16.3 50±35.4 31.8±8.1 75.4±10.1 82.4±6.2 0±0 26.7±14.9 60±22.4 0±0

34±10.1 54.7±17.4 20±27.4 43.5±12.9 69.1±5 0±0 50.5±16.1 32.9±26 48±8.6 47.9±8.2 0±0 39.3±16 45.1±13.2 29.1±10.2 52.9±4.2 40±41.8 35.6±20.5 45.7±34.1 44.3±3.2 0±0 36.7±24.7 20±27.4 43.3±14.9

18±11.9

Biesheuvel Davies et et al. (2016) al. (2013)

– – – – – – – – – – – – – – – – – – – –

– 83.3±7.2 0±0 12.9±2.6 83.6±4.1 0±0 10.7±2.3 5±6.8 8.3±0 13.7±2.1 0±0 7.1±2.6 22.6±3.4 13.2±2.5 2.4±3.2 0±0 39.7±1.8 38.9±6.2 0±0 0±0 26.7±14.9 60±22.4 0±0

34±10.1

Vassolo and ECOFYS Döll (2005) (2014)

66.2±7.9 80±27.4 74.4±8.9 65.5±14.9 0±0 77.9±7.2 70.4±4.7 81.7±10.5 78.3±7.1 0±0 80.7±8 55±14.2 49.4±12.4 81.2±12.1 60±41.8 35.1±12.6 61.1±15.6 79±9.3 0±0 53.3±38 50±50 36.7±7.5

60.7±26.1 71.8±14.7 90±22.4 84.2±1.8 83.6±4.1 0±0 85.2±3.8 81.4±5.9 89.2±2.3 87.9±2.8 0±0 87.8±2.6 74.7±3.6 66.4±10.5 92.9±2.6 30±27.4 49.4±14.7 75.4±10.1 91.4±7.8 0±0 56.7±14.9 30±27.4 61.7±28.6

64±15.2

75.8 57.5 63.8 74.5 25.0 70.1 66.4 69.9 71.4 2.5 69.2 67.0 53.9 69.3 42.5 43.3 69.7 69.2 15.0 48.8 47.5 31.9

48.2

Spang et al. Lohrmann Average (2014) et al. (2019)

A Region-Based Approach for Missing Value Imputation … 121

78.4±13.9 40±41.8 70±18.3 0±0 56.7±25.3 50±35.4 60±54.8 56.7±25.3 97.4±3.9 93.3±14.9 70±44.7 59±4.2 55.2±19.9 68.5±15.7 35±9.1 56±30.7 30±27.4 55±18.3 20±44.7 56.7±39.7 51.7±31.4 58.7±25 65±26.6

‘NIG-S’ ‘NIG-N’ ‘SER’ ‘SOMDJ’ ‘KENUG’ ‘TZRB’ ‘CAR’ ‘SW’ ‘ZAFLS’ ‘SE’ ‘IOCE’ ‘CAM’ ‘CO’ ‘VE’ ‘EC’ ‘PE’ ‘CSA’ ‘BR-S’ ‘BR-SP’ ‘BR-SE’ ‘BR-N’ ‘BR-NE’ ‘AR-NE’

80.9±16.4 80±27.4 93.3±14.9 0±0 93.3±14.9 30±27.4 0±0 16.7±23.6 95.6±5.3 100±0 70±27.4 58.4±3.7 61.4±14 77.3±11.2 55±38.9 57±13 0±0 86.7±18.3 70±44.7 60±18.1 35±26.6 68.7±8.7 95±11.2

Major region model

Minor region name Minor region model

Table 4 (continued)

78.7±15.8 80±27.4 93.3±14.9 0±0 93.3±14.9 50±35.4 60±54.8 56.7±25.3 95.6±5.3 93.3±14.9 50±35.4 58.4±3.7 61.4±14 77.3±11.2 55±38.9 57±13 30±27.4 86.7±18.3 70±44.7 60±18.1 38.3±37.1 68.7±8.7 95±11.2

Hybrid model 42.9±12.1 50±50 83.3±23.6 0±0 43.3±25.3 0±0 40±54.8 46.7±36.1 51.9±7.5 50±37.3 0±0 55.3±3.1 35.7±18.6 41.8±11.6 60±9.1 27±13 0±0 45±29.8 60±54.8 45±16.2 18.3±17.1 10±14.9 95±11.2

14.7±15 20±27.4 76.7±22.4 0±0 10±22.4 0±0 0±0 26.7±25.3 57±9.3 46.7±13.9 70±27.4 29.1±6.7 25.2±12.2 12.8±4.9 11.7±16.2 8±17.9 0±0 58.3±16.7 0±0 46.7±20.9 25±14.4 17.3±1.5 11.7±16.2

Biesheuvel Davies et et al. (2016) al. (2013) – – – – – – – – – – – 0±0 – – – – – – – – – –

44.7±3.5 0±0 6.7±14.9 0±0 0±0 0±0 40±54.8 10±22.4 2.6±2.4 0±0 0±0 54.1±2.1 39±10.2 27.8±3.2 40±9.1 22±2.7 0±0 16.7±15.6 30±44.7 46.7±20.9 5±11.2 3.3±7.5 95±11.2

Vassolo and ECOFYS Döll (2005) (2014) 45.1±15.5 70±27.4 86.7±18.3 0±0 73.3±25.3 100±0 20±44.7 20±27.4 78.9±8.4 93.3±14.9 70±44.7 41±3.9 64.8±12.5 38.3±13.2 35±9.1 59±23 0±0 60±30.8 30±44.7 66.7±10.2 51.7±20.7 76±25.1 75±14.4

44.7±3.5 80±27.4 93.3±14.9 0±0 93.3±14.9 100±0 0±0 26.7±25.3 88.5±4.2 100±0 80±44.7 53.4±2.8 41.9±8.4 49.4±5.6 55±18.3 70±9.4 0±0 78.3±21.7 70±44.7 83.3±15.6 43.3±14.9 80±21.7 95±11.2

53.8 52.5 75.4 0.0 57.9 41.3 27.5 32.5 70.9 72.1 51.3 45.4 48.1 49.2 43.3 44.5 7.5 60.8 43.8 58.1 33.5 47.8 78.3

Spang et al. Lohrmann Average (2014) et al. (2019)

122 A. Lohrmann et al.

65.9±9.5

68.5±3.6

72.3±14.2

62.3±7.8

76.4±7.7

69.6±4.3

73.3±7.5

78.6±4.6

78.5±8.3

76.4±11.7

68.3±2.3

62.8±6.9

70.9±5.7

83.6±0.4

72.6±4.3

72.3±5.8

73.3±25.3

54±18.5

78.3±8

96.7±7.5

57.5±14.3

72±6.7

100±0

‘CL’

‘CA-W’

‘CA-E’

‘US-NENY’

‘US-MA’

‘US-CAR’

‘US-S’

‘US-TVA’

‘US-MW’

‘US-C’

‘US-TX’

‘US-SW’

‘US-NW’

‘US-CA’

‘US-AK’

‘US-HI’

‘US-GU’

‘MX-NW’

‘MX-N’

‘MX-C’

‘MX-S’

75±21.3

76±30.7

62.5±35.4

74.7±25.6

71.1±5.4

74±21.6

43.3±43.5

72.3±5.4

63.7±5.2

68.1±11

71.9±6.6

64.6±14.8

67.3±2.1

81.8±6.8

73.5±4.8

75.1±7.1

60.8±12.7

47.7±16.8

68±14.2

45.7±20.6

100±0

72±6.7

57.5±14.3

96.7±7.5

78.3±8

54±18.5

73.3±25.3

72.3±5.8

72.6±4.3

83.6±0.4

70.9±5.7

62.8±6.9

68.3±2.3

76.4±11.7

78.5±8.3

78.6±4.6

73.3±7.5

69.6±4.3

76.4±7.7

62.3±7.8

72.3±14.2

67.1±14.9

76.7±8

58±16

42.5±6.8

72±7.3

46.1±8.2

69±8.2

0±0

43.5±7.6

49±3

77.8±7.3

45.1±5.2

42.7±6.3

26.4±5.6

22.9±7.9

30.5±4.3

29.5±8.2

31.1±6.6

37±5.2

62.8±9.1

27.6±10.7

56.5±5.3

18.1±13.2

65.4±13.8

70±17.1

‘AR-W’

78.9±14.4

78.9±14.4

‘AR-E’

78.9±10.3

Biesheuvel et al. (2016)

Minor region name Minor region Major region Hybrid model model model

Table 4 (continued)

37.2±19.9

32±15.7

57.5±6.8

29.3±11.9

48.9±9.9

50±19

20±44.7

40.6±7.7

43.2±5.9

58.1±10.1

50.8±4.5

38.4±5.4

33.3±3

32.5±6.3

34.8±3.4

31.2±6.4

36.7±2.2

37.1±4.7

53.2±17.4

36.4±16.4

18.7±8

39.5±8.5

13.6±10.1

–

–

–

–

42.8±15.4

37±22.5

16.7±23.6

38.1±10

40.2±8.9

39.2±9.5

46.3±7

32.4±8.1

35.7±2.7

45.8±10.6

37.3±6.5

36.4±12.7

39.9±4.7

34.9±7.8

43.1±12

34.5±24.2

–

–

–

Davies et al. Vassolo and (2013) Döll (2005)

53.9±9.1

24±2.2

0±0

72±7.3

19.4±0

69±8.2

0±0

23.3±1.6

2±1.1

0.8±1.9

28.8±0.3

14±1.8

29.7±0.4

33.1±1.8

25.8±0.6

19.1±1.3

31.5±0.4

37.5±1.2

73.7±5.1

34.8±3.7

49.3±4

21.4±8.9

70.7±9.5

ECOFYS (2014)

55.8±9

67±21.1

52.5±22.4

75.3±8

60.6±6.9

47±19.9

90±22.4

40.1±5.8

50.5±4.3

76.2±10.1

51.1±3.6

44.5±4

40.5±2.8

41.2±12.7

48.2±5.8

43.4±4.1

44.1±5.5

42±11.3

61.3±7.9

39.3±13.9

53.6±6.2

30.5±20.4

48.9±13.3

Spang et al. (2014)

56.1±13.3

34±23

62.5±8.8

81.3±14.3

66.7±2

69±8.2

100±0

52±3.3

65.7±1.5

82.8±1.6

63.3±1.4

62.2±3.2

43.8±3.8

43.9±6.7

40.9±4.7

48±6.2

45.7±5.3

46.5±4.5

66.9±9.3

30.4±5.9

50.7±6.4

42.4±5.9

67.9±11.2

69.3

54.4

49.1

74.8

56.9

58.1

46.3

50.5

51.1

63.4

55.5

47.2

45.9

50.5

49.8

48.9

49.3

48.9

63.8

41.7

55.2

41.8

62.9

Lohrmann et Average al. (2019)

A Region-Based Approach for Missing Value Imputation … 123

124

A. Lohrmann et al.

References 1. Kenny, J.F., Barber, N.L., Hutson, S.S., Linsey, K.S., Lovelace, J.K., Maupin, M.A.: Estimated Use of Water in the United States in: Circular 1344. No. Circular 1344, 2009 (2005) 2. Eurostat, Water use in industry (2014) 3. Macknick, J., Newmark, R., Heath, G., Hallett, K.C.: Operational water consumption and withdrawal factors for electricity generating technologies: a review of existing literature. Environ. Res. Lett. 7(4) (2012) 4. Spang, E.S., Moomaw, W.R., Gallagher, K.S., Kirshen, P.H., Marks, D.H.: The water consumption of energy production: an international comparison. Environ. Res. Lett. 9(10) (2014) 5. Platts, World Electric Power Plants Database (2010) 6. Davies, E.G., Kyle, P., Edmonds, J.A.: An integrated assessment of global and regional water demands for electricity generation to 2095. Adv. Water Res. 52, 296–313 (2013) 7. International Energy Agency (IEA), Energy balances of OECD countries 1960–2008, Technical report, International Energy Agency, Paris (2010)D 8. Energy Information Administration (EIA), International energy annual 2006, World electricity installed capacity, Technical report, US Energy Information Administration, Washington (2008) 9. Vassolo, S., Döll, P.: Global-scale gridded estimates of thermoelectric power and manufacturing water use. Water Res. Res. 41(4), 1–11 (2005) 10. Flörke, M., Kynast, E., Bärlund, I., Eisner, S., Wimmer, F., Alcamo, J.: Domestic and industrial water uses of the past 60 years as a mirror of socio-economic development: a global simulation study. Global Environ. Change 23(1), 144–156 (2013) 11. Global Water System Project (GWSP) Digital Water Atlas (2018) 12. Biesheuvel, A., Witteveen+Bos, Cheng, I., Liu, X.: Greenpeace International, “Methods and Results Report: Modelling Global Water Demand for Coal Based Power Generation,” Technical report, Witteveen + Boss & Greenpeace International (2016) 13. Alcamo, J., Flörke, M., Märker, M.: Future long-term changes in global water resources driven by socio-economic and climatic changes. Hydrol. Sci. J. 52, 247–275 (2007) 14. ECOFYS, Pilot project on availability, use and sustainability of water production of nuclear and fossil energy. Geo-localised inventory of water use in cooling processes, assessment of vulnerability and of water use management measures, Technical report, European Commission, Directorate General Environment, Utrecht, Netherlands (2014) 15. Lohrmann, A., Farfan, J., Caldera, U., Lohrmann, C., Breyer, C.: Global scenarios for significant water use reduction in thermal power plants based on cooling water demand estimation using satellite imagery. Nat. Energy 4(12), 1040–1048 (2019) 16. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer ScienceBusiness Media, New York (2006) 17. Gelfand, S.B., Ravishankar, C.S., Delp, E.J.: An iterative growing and pruning algorithm for classification tree design. IEEE Trans. Pattern Anal. Mach. Intell. (1991) 18. Maimon, O., Rokach, L.: Decision trees. In: Data Mining and Knowledge Discovery Handbook, pp. 165–192 (2005) 19. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(no. null), 1157–1182 (2003) 20. Podgorelec, V., Kokol, P., Stiglic, B., Rozman, I.: Decision trees: an overview and their use in medicine. J. Med. Syst. 26(5), 445–463 (2002) 21. Ghulam, A., Porton, I., Freeman, K.: Detecting subcanopy invasive plant species in tropical rainforest by integrating optical and microwave (InSAR/PolInSAR) remote sensing data, and a decision tree algorithm. ISPRS J. Photogram. Remote Sens. 88, 174–192 (2014) 22. Yan, S., Adegbule, A., Kibbey, T.C.G.: A boosted decision tree approach to shadow detection in scanning electron microscope (SEM) images for machine vision applications. Ultramicroscopy 197, 122–128 (2019) 23. Syed Nor, S.H., Ismail, S., Yap, B.W.: Personal bankruptcy prediction using decision tree model. J. Econ. Finance Adm. Sci. 24, 157–170 (2019)

A Region-Based Approach for Missing Value Imputation …

125

24. GlobalData Ltd, GlobalData Power (2014) 25. Farfan, J., Breyer, C.: Structural changes of global power generation capacity towards sustainability and the risk of stranded investments supported by a sustainability indicator (2017) 26. Platts, World Electric Power Plants Database (2016) 27. IRENA, Renewable Energy Capacity Statistics 2015 (2015) 28. Loh, W.-Y.: Classification and regression trees. Wiley Interdiscip. Rev.: Data Mining and Knowl. Disc. 1(1), 14–23 (2011) 29. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees (1984) 30. Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989) 31. Quinlan, J.R.: C4.5: Programs for Machine Learning (1992)

A Neural Network Based Multi-class Trading Strategy for the S&P 500 Index Leo Soukko, Christoph Lohrmann , and Pasi Luukka

Abstract Most of the studies investigating machine learning methods such as neural networks and their ability to forecast stock markets deploy a binary target (buy, sell). However, when trading strategies are implemented based on the prediction of a binary target, they can be vulnerable to misclassification and their overall return may also be more susceptible to transaction cost. This research work used the well-known S&P 500 index and neural networks to compare three different targets (binary and two different ternary cases: a ±0.5% threshold and a ±1% threshold) that represent different cut-offs for the buy and sell classes. In addition, different thresholds for the class probabilities are used, that reflect the confidence that the neural network has in the prediction in order to decide when a buy and sell decision is made in the trading strategy. The experiments including transaction costs indicated that increasing the confidence threshold on a binary model increases the returns gained. Moreover, stricter cut-offs for the target classes tended to decrease the confidence thresholds required to obtain the best performances for the strategy. Keywords Neural networks · S&P 500 index · Trading strategy · Multi-class problem · Thresholds

1 Introduction Being able to forecast financial markets is of interest for investors, stakeholders, researchers, and governments alike [1]. Especially investors use such forecasts as a decision support for making investments [2] by identifying opportunities and challenges in a market [3] and deriving trading strategies [4–6]. Many scientific studies L. Soukko (B) · C. Lohrmann · P. Luukka School of Business and Management, LUT University, Lappeenranta, Finland C. Lohrmann e-mail: [email protected] P. Luukka e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_6

127

128

L. Soukko et al.

have focused on American stock indices [7–14], but there are also some studies that cover markets in different countries such [4, 15, 16] or exclusively look at one selected market, e.g., China [17], Taiwan [18], Korea [19, 20], and Turkey [21]. A large share of these studies approach forecasting of the stock market as a classification problem (e.g., [10, 14, 20]). In particular, neural networks are a common choice in this context [2, 10, 15, 18, 21], but also other machine learning and statistical methods such as logistic regression [16, 21], linear discriminant analysis [4, 17] support vector machines [18, 22], decision trees or random forests [14, 18] are commonly found in the literature. The vast majority of studies using classification uses a binary class label that only reflects whether the market is expected to provide an investor with a positive or negative return [14]. However, using such a simple cut-off between positive and negative returns does not account for the expected magnitude of returns, which makes a trading strategy based on such predictions more vulnerable to misclassification [14]. In addition, a crucial aspect for the implementation of trading strategies are transaction costs and other factors such as the bid-ask spread which impacts the return a trader actually can earn with a transaction. However, there is only limited research into multi-class predictions for stock markets (e.g., [14]). Thus, this research work will use the well-known S&P 500 index and compare three different target class label types that represent different cut-off points for the classes. In particular, a simple binary strategy (buy, sell) is compared with two thresholdbased strategies that create three classes (buy, neutral, sell) using a ±0.5% return threshold and a ±1% return threshold to identify the buy and sell classes. Moreover, a trading strategy for daily returns (intraday, open-to-open, close-to-open) will be implemented based the neural-network predictions and using ETFs that represent long and short exposure to the S&P 500. Additionally, different confidence thresholds for the class probabilities predicted by the neural network are tested to measure the impact of higher confidence of class predictions on the performance. The paper is structured as follows: first the data including the explanatory and target variables are presented, then, secondly, the feature selection methods that are used to remove irrelevant and redundant features from the data set are presented. Subsequently, the results of feature selection are presented, and the neural network and corresponding model selection is discussed. Lastly, the classification results and performances of the trading strategies are discussed, and the concluding remarks are presented.

2 Data 2.1 Data Gathering and Feature Engineering The target for this study is the S&P 500, a well-known American stock market index containing the shares of the 500 largest companies publicly listed in the US. Similar to many previous studies, daily returns will be used [8, 10, 17, 20]. The time period covered in this study is from 06.09.2006 to 09.03.2020, which

A Neural Network Based Multi-class Trading …

129

is successively divided into 70% training data (from 06.09.2006 to 18.02.2016), 15% validation data (19.02.2016 to 27.02.2018) and 15% test data (28.02.2018 to 09.03.2020) for the trading simulation. Four different returns are used to create the targets: intraday (meaning open-to-close), overnight (meaning close-to-open), opento-open and close-to-close returns. Initially for each of these different returns three classification targets were created. Targets created are either binary (either sell or buy) for the first case and multiclass for the second and third set of targets. The three multiclass labels are sell, neutral (no action) and buy, and the corresponding thresholds for the returns to create these classes are (−0.5, 0.5%) in the second case and (−1, 1%) in the third case. On the one hand, these thresholds are supposed to reduce the possibility of incorrect sell/buy actions and, thus, improving the average return associated with correct predictions. On the other hand, the thresholds are used to discover if predicting extreme cases is more profitable than predicting simply a binary outcome. In addition, using thresholds may indicate whether distinguishing extreme returns (positive and negative) is easier for the learning algorithms than simply dividing the classes into all positive and all negative returns. The explanatory variables and target variable contained in this study are listed in Table 1, where the variables are sorted into market indices (light green), volatility indices (blue), commodities (yellow), exchange rates (purple), technical indicators (orange) and treasury yields and yield spreads (dark green). Overall, 89 variables are initially selected, which is similar to previous studies that used commonly between 40 and 150 variables [8, 10, 14, 21]. Multiple stock market indices are included in this study to provide information about recent economic changes in selected global markets. These stock market indices are Hang Seng index (HSI) [C], Nasdaq [USA], STOXX 50 [EU], Deutscher Aktienindex (DAX) [GER], Dow Jones Industrial Average (DJIA) [USA], Nikkei 225 [JAP], Shanghai Stock Exchange (SSE) [C] and the target index S&P 500 [USA]. The volatility index (VIX) is included to provide an estimate for the expected shortterm volatility of the S&P 500 index. It reflects fear in the market as it tends to increase as stock markets fall and are characterized by uncertainty [23]. A Gold ETF is incorporated in this study as it can indicate future inflation [24] and be a protection against the impact of uncertainty in the market since precious metals perform the best when stock markets are characterized by high volatility [25]. Another commodity-related ETF included in this study covers crude oil prices as they can have a considerable impact on the economy and financial markets [26, 27]. Apart from stock markets indices and commodity ETFs, multiple exchange rates of the USD to some currencies of the main trading partners of the USA are included in this study since these impact the attractiveness of American products abroad as well as influence the prices of imports for American companies and consumers. In particular, the selected exchange rates are USD/CAD, USD/CNY, USD/JPY, USD/GBP and USD/EUR. Previous studies have included a similar selection of these USD exchange rates [8, 10, 14] and, for instance, Lohrmann and Luukka [14] found that exchange rates were among the most relevant variables in their study on the S&P 500. For each of the stock market indices, commodity ETFs and exchange rates covered in this study, four different variables are calculated: (1) intraday return, (2) overnight

130

L. Soukko et al.

Table 1 Dependent and explanatory variables Name

Explanatory variables

Hang Seng Index

Intraday Return Overnight Return Difference of Intraday Return and Overnight Return Range

Deutscher Aktienindex (DAX) Shanghai Stock Exchange Index (SSE) Dow Jones Industrial Average Index (DJIA) Nasdaq Nikkei 225 STOXX 50 VIX

Intraday Return Overnight Return Difference of Intraday Return and Overnight Return Range

iPath S&P GSCI Crude Oil Total Return Index

Intraday Return Overnight Return Difference of Intraday Return and Overnight Return Range

SPDR Gold Shares

USD/GBP USD/JPY USD/CA USD/EUR

Intraday Return Overnight Return Difference of Intraday Return and Overnight Return Range

USD/CNY US Treasury Yield (30 years) US Treasury Yield (10 years)

Intraday Return Overnight Return Difference of Intraday Return and Overnight Return Range

US Treasury Yield (13 weeks)

Intraday Momentum (1) Overnight Momentum (1) Difference of Intraday Momentum (1) and Overnight Momentum (1) Range

Treasury Yield Spread

Treasury Yield Spread (30 years–10 years) Treasury Yield Spread (30 years–13 weeks) Treasury Yield Spread (10 years–13 weeks)

S&P 500

Intraday Return Overnight Return Difference of Intraday Return and Overnight Return Range (continued)

A Neural Network Based Multi-class Trading …

131

Table 1 (continued) Name

Explanatory variables Stochastic %K (14) Discrete Stochastic %K (Momentum 1) Discrete Stochastic %K (Signal, 3) MACD (12–26) Discrete MACD (Momentum 1) Discrete MACD (Signal, 9) RSI (14) Discrete RSI (without no-action) Discrete RSI (with no-action)

Weekdays

Categorical (from 0 to 6)

Name

Dependent Variable (Target)

S&P 500

Binary Target (Sell, Buy) ±0.5% Threshold Target (Sell, Neutral, Buy) ±1% Threshold Target (Sell, Neutral, Buy) [for each return class]

return, (3) difference between these two returns, and (4) previous day’s range (high price—low price). Additional variables include US Treasuries with maturities of 13 weeks, ten years and 30 years and the term spreads between them. Spreads are included as they can be useful at indicating economic changes [28]. For the 13-week treasury bills, differences are used instead of returns to avoid large outliers that can occur for this variable. Weekdays are included as predictors to evaluate if different weekdays provide meaningful information for forecasting the S&P 500. One known possible effect of weekdays is the so-called Weekend effect, according to which, the stock prices are on average lower on Mondays [29]. In addition to the intraday and overnight return, the difference between these two and the daily range—the explanatory variables included for each stock market index in this study—nine technical indicators or discretized versions of technical indicators are generated using the S&P 500. These technical indicators are Stochastic %K, the relative strength index (RSI) and the moving average convergence divergence (MACD), and two discretized versions are created from each of the indicators. The selection of technical indicators is similar to that in several previous studies [8, 14, 18, 21]. All time-series included in this study were obtained free of charge from Yahoo Finance.

2.2 Missing Values US treasuries had over 700 missing values due to non-business days and these missing values were simply dropped. The number of missing values for the other variables was comparably small (up to 28 missing values) which were imputed using linear interpolation. Since some of the timeseries variables had different observational

132

L. Soukko et al.

dates, combining the variables into the data set created additional missing values since not all variables were observed for each observational date. For those rows where the missing value occurred for the dependent variable, the rows were removed since any imputation of the dependent variable should be avoided. Missing values for the variables calculated from stock indices were filled with zeros because these dates could be holidays and imputing these variables with zero appears suitable. Missing values for the treasury yields were linearly interpolated as they do not fluctuate in the same manner as stock markets. The four datasets, with one data set for each target return type (intraday, overnight, open-to-open, close-to-close), that were created this way each have 3399 observations (intraday, overnight, close to close) or 3 398 (opento-close) with 89 explanatory variables and three different dependent variables. The reason for the small difference is observations is that the RSI was calculated based on the open-to-close return, which required to shift the data for the open-to-open target by one day.

2.3 Target Variables This study considers four separate S&P 500 return-based variables to construct the class labels: the intraday (OC), the overnight (CO) the open-to-open (OO) and the close-to-close (CC) return. The behaviour of these returns in the training (red), validation (green) and forecast (blue) data is illustrated in Fig. 1. It is noteworthy that the returns in 2008 are generally more volatile due to the financial crisis than during the remaining time period (with exception of the overnight return (CO)). Thus, the spikes in the returns are considerably higher in the training, than validation and forecast period. However, given that the training period is comparably large (70% of the observations) and covers different market environments, the fact that extraordinary conditions are included in the model training is regarded as an advantage

Fig. 1 Returns of the four S&P 500 return-based target variables

A Neural Network Based Multi-class Trading …

133

for obtaining a robust model. Since during the training period the overnight returns (CO) are mostly small compared to the intraday return (OC), it is unsurprising that the intraday return is very similar to the close-to-close and open-to-open returns. Only in periods where the overnight returns fluctuate considerably, as during the financial crisis as well as during the validation and forecast period, the opento-open and close-to-close returns are noticeably different to the intraday return. It is remarkable that even during the financial crisis overnight returns are much lower in magnitude than the corresponding intraday returns. This difference changes considerably during the validation and forecast period where the magnitude of the overnight returns increases even during periods of less intraday volatility. In the training data correlations among OC, CC and OO returns were between 0.98 and 0.995, while in the validation data these correlations fell between 0.88 and 0.95. In the test data where the overnight returns fluctuate the most, the correlations were between 0.76 and 0.89. These returns clearly differ from each other in the validation and test datasets. Each of these four return-based variables are converted into the three sets of target class labels by applying different thresholds. For the binary dependent variable, that only differentiates “Buy” and “Sell” classes, all positive returns are associated with the “Buy” class and all negative or zero returns are assigned to the “Sell” class. For the second target variable, threshold values for the return of (±0.5%) were applied, where the “Buy” class is associated with a return higher than 0.5%, the “Sell” class with a return less than −0.5% and the “Neutral” class with a return of smaller magnitude (between −0.5% and 0.5%). The same logic is applied to the stronger thresholds (±1%) for the third target variable. The distributions of the target variables are displayed in Figs. 2, 3 and 4. The bars are grouped by target type (binary, middle threshold and extreme threshold) and the first bar in each group represents the classes based on the overnight return (CO), the second bar the intraday return (OC), the third bar the open-to-open return (OO) and the fourth bar the close-to-close return (CC). In the training data the binary targets appear to be evenly distributed and the reason for a higher number of “Sell” observations in the overnight target is that it has more zero-returns, which

Fig. 2 Returns of the four S&P 500 return-based variable (training data)

134

L. Soukko et al.

Fig. 3 Returns of the four S&P 500 return-based variables (validation data)

Fig. 4 Returns of the four S&P 500 return-based variables (test data)

are in general included in the “Sell” class. In the validation and test datasets there seems to be a bit more variation between different return targets. For the middle threshold (±0.5%), the neutral class is prevailing, especially for the overnight return (CO), which reflects the low magnitude of the returns in this class. For the intraday (OC), open-to-open (OO) and close-to-close (CC) return, the share of the neutral class in the training data is with roughly 50% still the largest class. In the validation and test datasets there are bigger differences between the return classes, and in the validation data there are much less non-neutral observations than in the training and test datasets. For the extreme threshold (±%) the situation is similar to the middle threshold (±0.5%) where the neutral class is also prevailing, in particular, for the overnight return (CO). In the training data the “Buy” and “Sell” classes for the intraday (OC), open-to-open (OO) and close-to-close (CC) return still include relatively many observations considering that the return is plus/minus one percent for a single trading day. In the validation data there are only few non-neutral observations but in the test data the share of non-neutral observations is again larger. Since the overnight returns (CO) are lacking non-neutral class observations when using the

A Neural Network Based Multi-class Trading …

135

middle and extreme threshold, they are not considered useful for model training and creating trading strategies. Thus, they are discarded.

3 Feature Selection As the dimensionality of data in many machine learning applications is growing, the importance of focusing on features (= variables) that contain relevant information has increased as well [30, 31]. Including irrelevant or redundant features means increasing the number of dimensions of the data set, which is linked to the “curse of dimensionality”. This phenomenon reflects that data becomes sparser when the dimension of it is increasing, making model fitting on such data more susceptible to overfitting [32]. Thus, focusing on only those features that are highly predictive and, thus, relevant, is crucial [33]. Decreasing the number of features in a data set is referred to as dimensionality reduction, which can be implemented using feature selection and feature extraction [32, 34]. While feature selection refers to selecting a subset of features from the original features [35–39], feature extraction (e.g., principal component analysis) creates new features using some form of transformation on the original features [34, 40]. The reduction in the number of features associated with feature selection reduces the computational cost (storage and speed), can improve the generalization ability of learning algorithms trained on the reduced data set, and supports the interpretability and ability to visualize the results [38, 41, 42]. Depending on the context of the learning task, feature selection can be supervised, semi-supervised or unsupervised [31]. In this paper, supervised feature selection is implemented in the context of classification (supervised learning). Supervised feature selection is in general divided into three types: filter methods (including feature ranking), wrapper methods, and embedded methods [37, 38, 43]. Filter methods are used as pre-processing, focus on the general characteristics of the data to determine the relevance of features and are independent of any learning algorithm [36, 37, 44]. In contrast to that, wrapper and embedded methods both incorporate a learning algorithm for feature selection. The wrapper does so by using the learning algorithm just as the evaluation criterion for feature subsets (e.g. classification accuracy) [45–47] whereas the embedded method incorporates the selection of features as part of the model training of the learning algorithm (e.g. classifier) [32, 41, 48]. In particular, the feature ranking method information gain and Pearson correlation as well as the feature ranking obtained from a random forest (embedded feature selection) will be used in this study. The feature selection methods were only used on the training data in order to avoid overfitting the test set results to the feature subset. The training data for this step did not have to be normalized since these feature selection methods are scale invariant (e.g. correlation scales by the standard deviation of the features). It is noteworthy that in the context of stock market predictions, many studies use some form of dimensionality reduction such as feature selection to reduce the number of variables for model training [8–10, 14, 21]. It is noteworthy that Tufecki [21] and Liu et al. [9] found that using feature

136

L. Soukko et al.

selection to reduce the number of input variables also increased the accuracy of the prediction model.

3.1 Information Gain Information gain originates from information theory and is a filter method that provides an ordered ranking (= feature ranking) of the features in a data set [49]. It deploys entropy and conditional entropy as measures of impurity. Entropy within the classes c = (c1 , c2 , . . . , ck ) is computed as H (c) = −

k

p(ci ) ∗ log2 p(ci )

(1)

i=1

where p(ci ) is the probability of class i, and k is the number of classes [50, 51]. The conditional entropy of each class given a feature’s values t = (t1 , t2 , . . . , tn ) is [32, 51] H (c|t) = −

n k p tj ∗ p ci |t j ∗ log2 p ci |t j j=1

(2)

i=1

where p t j is the probability that t takes the value t j , n is the number of feature discrete values for feature t, and p ci |t j is the conditional probability of observations with feature t taking a value of t j to belong to class ci [51]. The information gain I G(c; t) is then calculated as the difference between the (unconditional) entropy H (c) and the conditional entropy H (c|t) as [51] I G(c; t) = −

k i=1

p(ci ) ∗ log2 p(ci ) +

n k p tj ∗ p ci |t j ∗ log2 p ci |t j (3) j=1

i=1

Thus, the information gain represents how much the entropy of the class labels (dependent variable) decreases with the knowledge of the value feature t takes. It is noteworthy that the information gain is often used in decision tree classifiers as the selection criterion for the feature to conduct a split on [52]. It is apparent in Eqs. (1)–(3) that feature t is assumed to be discrete and not continuous. To use it with continuous features, the continuous feature’s values can be binned, meaning that multiple bins can be created using different cut-off points to discretize the feature. Using such a binning approach, two different information gains are computed for continuous variables in this study. The first uses four discrete bins with three cut-offs at the quartiles to create the partitions as [Min, Q 1 ), [Q 1 , Q 2 ), [Q 2 , Q 3 ) and [Q 3 , Max] where Q 1 , Q 2 , Q 3 are the quartiles

A Neural Network Based Multi-class Trading …

137

of the distribution. The second approach uses 20 discrete bins using 5 percentile steps as [Min, P5 ), [P5 , P10 ), . . . , [P95 , Max] where P5 , P10 , . . . , P95 are the percentiles used. Quantiles are used instead of just dividing the variables into equally distant bins because equal distant bins are vulnerable to outliers. However, equally distant bin approach had to be used with all four variables created from 13-week treasury bill, as well as the USD/CNY range, ten-year US treasury yield range and 30-year US treasury yield range. For a given number of bins B, the equal distance binning approach uses a step size of S = (Max − Min)/B so that the bins are [Min, Min + S), [Min + S, Min + 2 ∗ S), . . . , [Min + (B − 1) ∗ S, Max]. The reason is that some of the quantiles that are supposed to be used as cut-offs in these time-series had the same values (due to very small variation).

3.2 Pearson Correlation The Pearson’s correlation coefficient is a simple filter method for feature ranking that measures the linear association between two features to detect linear dependency between variables [43, 53]. In the context of supervised feature selection, it is usually measured between an explanatory variable and the class labels (dependent variable). The formula for Pearson correlation for two variables x and y can be stated as [53]: n

− x) ∗ (yi − y) n 2 2 (x − x) ∗ i=1 i i=1 (yi − y)

R(x, y) = n

i=1 (x i

(4)

where x and y are the means of x and y, and n is the number of observations. The numerator represents the covariance between x and y which measures to what extent the variables move in the same direction from their mean. The denominator contains the product of the variables’ standard deviations and is used to scale the covariance in the numerator between [−1, 1].

3.3 Random Forest Feature Importance A random forest [54] is an ensemble classifier. In particular, it consists of multiple decision tree classifiers as base learners, who are trained on different (random) partitions of the features [55]. Each decision tree generates a separate vote for the class label, which are aggregated to assign each observation to one class [55, 56]. The decision tree classifier starts at the root node and then generates consecutive if–then rules using selected features to generate partitions of the feature space until the socalled leaves of the tree are reached [55, 57] that each represent a certain class label. The rule at each so-called node is part of the set of rules that form one of the paths

138

L. Soukko et al.

from the root node to a leaf (terminal node) that represent the predicted class for the observations following that path [57]. These rules are generated using some information metric, such as information gain [52], to conduct the split at a node into new child nodes using the feature that is “best” for a split according to this metric. A decision tree calculates the information metric of each split by comparing how much the weighted impurity decreases in this step after the split is done. The information metric calculated in this study is entropy (Eq. 1) but other measures do exist. The formula for calculating the change in the entropy caused by a binary split is: Hti ,n j = p n j ∗ Hn j (c) − p n le f t( j,ti ) ∗ Hnle f t ( j,ti ) (c) − p n right( j,ti ) ∗ Hnright ( j,ti ) (c)

(5)

where p n j is the probability of being at node j, and Hn j (c) is the entropy for the classes at node n j . The two child nodes created by the binary split at node n j are denoted n le f t( j,ti ) and n right( j,ti ) . The notation reflects that these child nodes are obtained when the node n j is split using the value feature value ti of the t-th feature. The entropy values of the classes at the child nodes created via the split are Hnle f t ( j,ti ) (c) and Hnright ( j,ti ) (c). For each feature t and tree, the change in the entropy H ti ,n j of all splits that feature was used in is then summed and divided by the sum of the change in entropy of all splits. This value represents the relative importance of that feature for all splits in the decision tree and is within [0, 1]. For the random forest this relative feature importance of a feature in each tree is then averaged over all decision trees and, once again, normalized by dividing it by the sum of all averaged relative feature importance values of all features. In this study, the random forest feature importance, which for a feature t can be denoted R F I t , is additionally adjusted for correlation. The reason for this adjustment is that highly correlated features (which may be redundant) may all end up with small importance values since decision trees may for a split almost randomly select among these redundant or close to redundant features—making each of them appear less important within the decision tree overall. However, removing redundant variables is clearly not desirable in case each of these features by themselves is relevant. To avoid removing features that are redundant but by themselves relevant, the correlationadjusted random forest importance for a feature t, denoted c R F I t , is calculated as the random forest feature importance multiplied by the sum of the absolute correlation values of feature t with all other features. |R(x, t)| (6) cRF It = RF It ∗ x∈X,x=t

where R(x, t) is the Pearson correlation of feature t with feature x, and the sum includes all absolute correlations of feature t with all other features contained in the data with exception of t itself.

A Neural Network Based Multi-class Trading …

139

The random forest is implemented using Python’s sklearn package with 250 decision trees, minimum sample size in a leaf equal to five percent of the training data and the information gain as the splitting criteria.

3.4 Feature Importance Results The feature importance results of the information gain and the correlation-adjusted random forest importance are displayed in the Appendix in Table 5, 6, 7 and 8. For the intraday return (OC) (see Table 5 in the appendix) the results for the three target classes have in common that the Nasdaq (CO) overnight return is the most important feature according to both information measures. It is also noteworthy that two returns of the STOXX 50 (CO and CO-OC) are also for all target variables consistently within the top 5 features. The same holds true for the VIX volatility index for which either the overnight return or the range are within the top 5. Interestingly, the features ranked highest for the prediction of the intraday returns are the overnight return, which seems intuitive since they represent the most recent information available at the time of the prediction. For the different target variables (binary, ±0.5 and ±1%), there are some interesting differences. Whereas the simple binary target includes two features representing the treasury bonds, both the ±0.5 and ±1% targets include instead technical indicators within the top 10 features. This may highlight, that technical indicators may be relevant to predict the returns that are larger in magnitude. For the open-to-open return (OO) (see Table 6 in the appendix) the results are very similar. Nasdaq (CO) seems to be the most important feature and STOXX 50 (CO and CO–OC) as well as the VIX (CO and Range) generally seem to be within the Top 5 features. Once again, only the intraday return includes treasury features whereas for the two other threshold-based targets technical indicators (MACD, Stochastic %K) are included in the Top 10 features. For the close-to-close return (CC) (see Table 7 in the appendix) the results differ considerably from those of the intraday and open-to-open (OO) returns. In particular, none of the variables seemed to be useful in predicting the close-to-close binary return target. The information gain values were about ten times smaller for the close-to-close binary target than for the open-toclose and open-to-open binary targets. Moreover, the rankings for information gain (IG) and the correlation-adjusted random forest importance (cRFI) seem to be quite different. In contrast to that, for the two threshold-based classes, technical indicators (Stochastic %K, RSI and MACD) as well as the VIX now form the Top 5 features. The STOXX 50, the domestic indices (S&P 500, DIJA) as well as exchange rates (especially USD/CAD) appear also repeatedly within the Top 10. But as for the binary target, also for the threshold-based classes there are fewer relevant variables than for the same targets formed from OO and OC returns. For the overnight return (CO) only the binary target was investigated since the threshold-based targets lead to extremely imbalanced classes (see Figs. 2, 3 and 4). The feature importance scores for the overnight returns (see Table 8 in the appendix) indicate that spreads between treasury yields (long to moderate and long to short duration e.g., 30y–10y, 30y–13w)

140

L. Soukko et al.

are the most important features. Both, IG and cRFI agree that domestic indices (S&P 500, DIJA, Nasdaq) also belong to the Top 10 and even Top 5 features. Overall, from the perspective of the American S&P 500 index, the domestic stock index Nasdaq, the foreign index STOXX 50 as well as the VIX volatility index seem to be the most relevant variables for the intraday, open-to-open and close-toclose returns. Moreover, it is noteworthy that for each return-type the most recent information prior to time window of the return-type is the most common feature type (e.g., CO before OO; OC before CC). Commodities seem to play a minor or no role in the Top 10 features (likely only contained in the Top 10 due to randomness). For the threshold-based targets, technical indicators appeared to be relevant whereas for the binary target (including for the overnight return) treasury yields and spreads were relevant. It is noteworthy that for the binary target of the overnight return (CO) only treasury spreads were relevant whereas for the binary target of the intraday and open-to-open return only the treasury yields themselves were of importance. Finally, it is remarkable that IG and cRFI appear to have similar rankings, especially for the intraday and open-to-open return but show many disagreements for the closeto-close returns—indicating a potential difficulty of ranking features for this returntype. In particular, the feature importance values, especially for the binary target of the close-close return, were much lower than for the intraday and open-to-close returns. Therefore, the decision was made to focus on the prediction of the remaining three return timeseries (intraday, open-to-open, overnight) and discard the close-toclose return targets. For the final selection of features for each of the three return types, Pearson correlation is used. In particular, the cRFI was the correlation-adjusted random forest importance, which had the objective to use Pearson correlation to give higher weights to redundant features since the random feature importance may understate their importance even if they are relevant by themselves. However, of the features that are highly relevant according to IG and cRFI, some may be highly correlate with each other, which means that they provide no or limited additional information for making predictions but at the same time increase the computational complexity of the model training. Of those variables that were among the most important ones using IG and cRFI, only three pairs of features are characterized by correlation exceeding 0.9. These are the treasury yield 30y–13w spread and the treasury yield 10y–13w spread (0.98), S&P 500 range and DIJA range (0.97), as well as the treasury yield 10y with the treasury yield 30y (0.92). Thus, the DIJA range, the treasury yield 30y–13w spread, and the treasury yield 30y were removed from the data set. It was confirmed with the “regular” RFI, which is not correlation-adjusted, that the feature importance values after the removal of these three features remained consistent and that the same features tended to be important. Figure 5 reflects the final selection of features for the three return types. The selection for the intraday (OC) and open-to-open return (OO) is the same whereas for the overnight (CO) return, only the binary target class is covered, and the selected features are distinctly different to the remaining two return types. For the binary target, ten features were selected for the intraday and open-to-open returns which mainly include market indices (Dow, Nasdaq, S&P 500, STOXX 50), including the VIX, as well as one commodity (Oil) and a treasury yield (10y). In contrast to

A Neural Network Based Multi-class Trading …

141

Fig. 5 Final selection of features

that, for the overnight return (CO) only five features were selected, and these include many treasury spreads and yields as well as two market indices (Nasdaq, S&P 500). It is noteworthy that for the four highest ranking features according to the information gain (and correlation-adjusted random forest feature importance) were selected (excl. the highly correlated treasury yield 30y–13w spread) as well as the treasury bill 13w range. The treasury bill 13w range was included in the top 10 features and was only selected since (1) the equal distance binning that needed to be used with this variable made it vulnerable to outliers which may be the reason for the low information gain and (2) the random forest feature importance (excl. correlation adjustment) indicated that it may be a relevant variable. The features selected for the two threshold-based targets are quite similar to those of the binary target for the intraday and open-toopen return. However, they also include three technical indicators (RSI, MACD, and stochastic %K), the STOXX 50 and the VIX range. Apart from that, the + /−1% threshold additionally includes an exchange rate (USD/CAD range) and the S&P 500 range. The selected features stood out from the irrelevant variables in at least one feature selection metric and did not have very high correlations with the other chosen features.

4 Model Performance Evaluation 4.1 Artificial Neural Networks Artificial neural networks (ANN) are inspired by biological neurons and are a commonly used classification algorithm. Within each so-called artificial neuron, the variables x1 , x2 , . . . , x D are multiplied by weights (linear combination) and a bias is added before the corresponding sum is inputted into a nonlinear activation function

142

L. Soukko et al.

φ. The output of a neuron j, denoted z j is zj = φ

D

(l) w (l) j,i ∗ x i + w j,0

(7)

i=1 (l) where w (l) j,i is the weight for variable x i for neuron j in layer l, and w j,0 is the corresponding bias in the same layer [57]. Neural networks have multiple layers, the input layer often just being the initial input variables, one or more hidden layers and an output layer that provides the probabilities for the class predictions [55, 57]. For neurons in subsequent layers, the inputs are not the variables x1 , x2 , . . . , x D but the activations from the previous layer of neurons z 1 , z 2 , . . . , z M . The process to go through the network to calculate the outputs of neurons and input them to the subsequent layer is referred to as the feedforward phase [58]. The training of the neural network represents minimizing the loss function L, which is a function of the weights w, by learning and adjusting these weights. This process is referred to as backpropagation by using the gradient of the loss function with respect to the weights to update the weights in the network.

w t+1 = w t − η∇ L w t

(8)

where w t are the weights at the beginning of the step, ∇ L w t the gradient of the loss function with respect to these weights, η the learning rate that controls the size of the weight update and wt+1 the weights after the update [57]. It is noteworthy, that the weights are updated by going against the gradient since the negative gradient points in the direction of the (local) minimum. In this paper, the adaptive moment estimation (Adam) algorithm is used to implement the stochastic optimization of the weights in the network. The loss functions which are minimized during the training process for the binary and threshold-based targets are the binary and categorical cross-entropies, respectively. The binary and categorical cross-entropy loss functions are common choices for neural networks for classification problems and, thus, were also used in this study. The authors acknowledge that the selection of the loss function may also impact the parameters of the trained neural network.

4.2 Model Evaluation The comparison of the neural networks in this study is based on binary and categorical cross-entropy. Area under the ROC curve (AUC) is used to compare the models using binary targets. In the study multiple different thresholds are used to make the trading simulations and therefore AUC is more suitable. It represents the probability that a classifier will give a randomly chosen positive instance higher probability to belong

A Neural Network Based Multi-class Trading …

143

to the positive class than for a randomly chosen negative instance [59]. In multiclass models instead of using AUC as an additional measure, area under precision recall curve is used. It works similarly as the area under ROC-curve, but its x-axis is recall and y-axis is precision and the interpolation between points is non-linear.

4.3 Model Selection The model selection is conducted for the intraday (OC) and open-to-open (OO) return for the three targets (binary, ±0.5, ±1%) and for the overnight (CO) return with the binary target. The parameters included in the model selection for the neural network are the number of hidden layers, the neurons in each layer as well as the class weights (only threshold-based classes). Class weights larger than one are tested for the threshold-based classes due to the imbalance of the buy and sell classes compared to the larger neutral class. The ranges for the neural network parameters are displayed in Table 2. The parameter setup also includes the activation function (see Table 3). Three activation functions, namely Tanh, Rectified Linear Unit (ReLU) and Exponential Linear Unit (ELU) are included in this study. Tanh is selected since it is less vulnerable to vanishing gradients than the commonly used sigmoid activation function and has a mean of zero instead of 0.5 [60]. ReLu creates a sparser network, since negative inputs are set to zero, which can improve the robustness of the neural network [60]. ELU is similar to ReLU but has a mean of zero [61]. It is noteworthy that the selected activation function is used for all neurons in the network (with exception of the output neuron). Moreover, it is apparent that the choice of the activation function is linked to the weight and bias initialization, batch normalization and the type of regularization. The selected combinations for ReLu are based on findings in Glorot et al. [60] and Table 2 Neural network architectures and class weights Target

Layers

Binary (OC/OO) Binary (CO)

1–3 1–2

First layer

Second layer

Third layer

Class weights

10–16

6–10

3–6

1

5–10

3–5

0

1

±0.5% (OC/OO)

1–3

15–20

6–10

4–6

1–2

±1% (OC/OO)

1–3

17–20

6–10

4–6

1–5

Table 3 Activation function parameter differences Activation

Weight initialization

Batch norm

Regularization

Bias init

ReLu

He (2015)

Yes

L1

0.01

Tanh

Glorot & Bengio (2011)

Yes

L1 + L2

0

Elu

He et al. (2015)

No

L1 + L2

0

144

L. Soukko et al.

Table 4 Selected models for the trading simulation Target

Activation Dropout Learning Regularization #L1 #L2 #L3 Weights Rate Neurons Neurons Neurons

Binary ReLu OC

25.92

3.19

3.49

16

0

0

1

Binary ReLu OO

45.03

0.54

3.23

15

9

0

1

Binary ReLu CO

30.94

0.18

3.52

9

4

0

1

±0.5% ReLu OC

20.54

0.16

4.5

18

7

0

1

±0.5% Tanh OO

24.16

4.31

0.51

15

7

6

1

±1% OC

ReLu

20.33

0.12

4.06

20

6

0

5

±1% OO

ReLu

29.84

1.17

0.83

20

9

0

4

He et al. [62] while for Elu they based on findings in Clevert et al. [61] and for Tanh based on findings in Glorot and Bengio [63]. For each of the seven targets, 100 neural networks with a randomly selected parameter setup are trained and evaluated. The learning rate (logarithmic distribution within [0.001, 0.05]), dropout percentage (uniform distribution within [20, 50]) and regularization parameter (uniform distribution within [0.001, 0.05]), are randomly chosen after the activation function was selected. The model training (150 epochs) is conducted using the training data while the model selection is based on the validation data and the independent test data for the simulation is neither involved in the training nor in the model selection process. The selected models for each of the seven targets that will be used for the trading simulation (test data) are displayed in Table 4. These models were selected based on cross-entropy and AUC for binary targets and cross-entropy and the area under PR-curve for multiclass targets. Also, the progress during epochs was considered: a model with high scores in only one epoch was not chosen because this could be a coincidence. Instead, the chosen models had high but not necessarily the highest scores and had a decreasing cross-entropy for at least some iterations during training. It is noteworthy that the models for the threshold-based targets where the neutral class represents the vast majority of observations, the models tended to predict the neutral class more often than the buy and sell classes. However, the selected models for the ± 1% threshold target, for which the class imbalance is most notable, have class weights of four or five for buying/selling classes meaning that they are tuned to make more non-neutral predictions.

A Neural Network Based Multi-class Trading …

145

5 Results 5.1 Trading Strategies and Threshold The starting point for the trading strategy using the trained neural networks is that the hypothetical investor is invested in the S&P 500 index with 100 units of capital. In particular, being long the S&P 500, meaning owning this index, is represented by holding the SPDR S&P 500 ETF (SPY) whereas being short the S&P 500 is achieved by owning the ProShares Short S&P 500 (SH) ETF. These ETFs are a simple and practical way how the index returns can be replicated. In this simulation, a buying decision is reflected by holding the SPY and achieving the SPY return (S&P 500 return) whereas a sell decision means holding the SH and achieving the SH return (inverse S&P 500 return). The trading strategy is presented in terms of the neural network prediction and the current status (owning SPY or SH) at the time the prediction is made, in Fig. 6. In order to integrate the confidence in predictions, a confidence threshold value t for the prediction probability p is introduced. This means for the binary target (with a single output neuron representing the probability to buy ∈ [0, 1]) that only a probability p larger than the confidence threshold t will lead to a “buy” decision, and a “sell” decision is only implemented with p < (1−t). Thus, even with the binary target class a “neutral” prediction is possible when the probability p is (1 − t) ≤ p ≤ t. For the threshold-based targets that have a dedicated neutral class, no action will be performed when either (1) the “neutral” class is has the highest probability (larger than the confidence threshold t) or (2) the probability of all classes (three output neurons providing the probability to be in the “buy”, “neutral”, and “sell” class, respectively) is below the confidence threshold t. The results for all return and target combinations retained from the previous analyses are displayed in Fig. 7. Largest figure is from hybrid model, which uses the binary intraday model for intraday trading

Fig. 6 Trading strategy based on prediction and status of holdings

146

L. Soukko et al. (a) Intraday (OC) returns with binary target

(b) Open-to-Open (OO) returns with binary target

(c) Intraday (OC) returns with +/ - 0.5% threshold target

(d) Open-to-Open (OO) returns with +/- 0.5% threshold target

(e) Intraday (OC) returns with +/- 1% threshold target

(f) Open -to-Open (OO) returns with +/- 1% threshold target

(g) Hybrid (Intraday (OC) & Close -to-Open (CO) returns) with binary target

Fig. 7 Evaluation metrics by confidence threshold for all returns and thresholds

A Neural Network Based Multi-class Trading …

147

and binary overnight model for overnight trading. The overnight return model was not used alone because the model did not learn generalizable weights from the training data. It was instead combined with intraday model to see if it would improve the intraday model which suffers in the trading simulation from the fact that it predicts intraday returns but gains open-to-open returns. All the metrics were calculated considering only non-neutral predictions, for example accuracy is correctly predicted buy and sell predictions divided by all buy and sell predictions. It is intuitive that the higher the probability threshold, the fewer non-neutral predictions made by each of the models. At some point, the probability threshold for a non-negative prediction is so high that no non-neutral predictions are made anymore and, thus, a buy-and-hold strategy would be implemented. This is the reason for not displaying the values of certain or all metrics after a certain magnitude of the probability threshold. Some metrics are not displayed at earlier thresholds than others due to zero-division in the calculation of the metric. Most models have a much higher TPR than TNR which implies that the models predict buying more often than selling. Intraday return models perform better when the probability threshold is increased but, overall, the model performance is rather poor. With high probability thresholds the models often do not make any selling predictions. Multiclass target models make fewer non-neutral predictions than binary target models but the model predicting the ±1% target for intraday returns makes more non-neutral predictions that the model that predicts the ±0.5% target because the ±1% model had high class weights. It also needs to be noted that the intraday model metrics are calculated from intraday returns, but the return gained in the simulation is actually open-to-open return, as it would be in real life.

5.2 Performance For the implementation of the trading strategy in this study, transaction cost of 0.2% of the transaction value per transaction are assumed. The actual transaction costs are generally incurred twice because in the simulation when the SPY is bought the SH needs to be sold first, and vice versa. The results for the trading strategy with all confidence thresholds on the intraday returns are displayed in Fig. 8. It is apparent that the strategy clearly underperforms the buy & hold (B&H) with confidence thresholds of less than 76%. The performance of thresholds exceeding 82% (such as 86 and 91% in the Figure) are until the onset of the Covid epidemic (Feb 2020) essentially identical to those of the B&H strategy. One contributing factor to this is that the high confidence thresholds lead to few non-neutral predictions and, thus, few transactions. However, the performance of these confidence thresholds, especially the 91% threshold, are clearly higher during the Covid interval due to the use of short selling (buying the SH) when the market was falling. In general, for the intraday return with the binary target, higher confidence thresholds resulted in higher returns and those highest confidence thresholds only outperformed the B&H strategy due to the outperformance during the onset of the Covid

148

L. Soukko et al.

Fig. 8 Intraday return binary model on a timeline, 0.2% transaction costs

pandemic. It is noteworthy that for open-to-close predictions, the close-to-open return is not predicted but cannot be omitted from the trading simulation. Thus, each strategy incurs the open-to-open return for the SPY or SH held overnight on account of the intraday predictions. However, this seems to have a negligible effect on the performance of the strategy with the different thresholds since the results for the opento-open return are almost identical (Fig. 12, appendix). The main difference is that already a confidence threshold above 73% results in an essentially identical performance to the B&H strategy until Covid and the best performance is achieved with a threshold of 85%. Since the evaluation measures for the overnight returns were comparably poor and the fact that overnight returns are incurred by any strategy that is only based on decisions at the market open, the overnight returns are only included in a hybrid strategy meaning that the strategy uses both the predictions of the binary class for the intraday (OC) and overnight (CO) returns. Thus, for this hybrid strategy the predictions and decisions based on them can be made at both the opening and the closing of the market. The results for the hybrid strategy are displayed in Fig. 9. There is still some weak link between the confidence threshold and the performance of the strategy, but it is clearly weaker than for the intraday and open-to-open return for which the higher confidence thresholds almost consistently outperformed the lower thresholds. The 91 and 96% threshold strategies show a similar pattern as for the intraday and open-to-open return and lead to the highest returns of all strategies. Using the open and close as decision points generally leads to each confidence interval to a larger number of transaction costs, thus making this strategy’s returns more susceptible to higher transaction cost. It is noteworthy that the 71% threshold outperforms the B&H even before Covid whereas the 86% confidence threshold already clearly underperforms the B&H strategy pre-Covid. This highlights that higher confidence thresholds do not always lead to a higher performance. Notwithstanding, the highest confidence thresholds clearly outperform the lowest thresholds also for the binary strategy.

A Neural Network Based Multi-class Trading …

149

Fig. 9 Hybrid binary return model on a timeline, 0.2% transaction costs

In the following the results for the ±0.5% target for the intraday return-type are presented, for which the minimum threshold is set to one third, reflecting that there are three target classes (buy, neutral, sell). The results for the intraday returns are presented in Fig. 10 and show that the strategy outperforms the B&H when the confidence threshold is at least 57%. With this threshold, the strategy is demonstrating for the first half of the period an identical performance to the B&H but outperforms it at the two market corrections around June 2019 and during the onset of Covid in February 2020 while using only 9 transactions to do so. It is apparent that the highest thresholds of 70 and 79% once again perform similar to the B&H and start to outperform it only during the beginning of the Covid pandemic. Thresholds above 79% are too high and do not result in any transactions, thus being identical to the B&H strategy. Interestingly, the strategies based on lower thresholds that underperform the B&H over the entire period until Covid close the

Fig. 10 Intraday return ±0.5% model on a timeline, 0.2% transaction costs

150

L. Soukko et al.

Fig. 11 Intraday return ±1% model on a timeline, 0.2% transaction costs

gap in performance almost entirely during the short period associated with the onset of Covid. The results for the open-to-open return with the ±0.5% target are again very similar to the intraday returns and are, thus, not discussed in more detail. Strategies based on higher confidence thresholds outperform the B&H but do so only for the Covid time and are otherwise identical to the B&H strategy. Lastly, the performance values for the ±1% target are discussed. The results for the intraday return for this threshold are illustrated in Fig. 11. For this stronger threshold, the performances for confidence thresholds of at least 44% outperform the B&H strategy. The highest performance is accomplished using the 45% confidence threshold. Interestingly, the outperformance is not mainly based on the Covid-linked market correction but originates from shorting the S&P 500 (using SH) during smaller market downturns such as in October 2018, August 2019 and October 2019. The performance corresponding to confidence thresholds of over 56% perfectly reflect the B&H strategy since not a single transaction is conducted with such confidence. Another insight is that most thresholds do not perform well during the Covid onset since they are not short or only briefly short the S&P 500 which is in contrast to the models for the binary and ± 0.5% target. Another apparent behaviour is that low confidence thresholds seem to have a reverse trend to the S&P 500 index for long periods of time. This is likely a drawback of the strict ± 1% target since a short position that was initiated at some point can only be reversed once a “buy” decision is achieved. Thus, such short positions were kept for a long time and deteriorated the performance of the low confidence strategies. When turning to the open-to-open return with the ±1% target the effect of being short for the low confidence threshold strategies was even stronger with selling decisions made in December 2019 not being reverted until the onset of Covid. Once again, the highest confidence threshold strategies with a confidence of 70% and above performed identical to the B&H but outperformed it during Covid. Compared to ±0.5% target one notable aspect is that both strategies using the ±1% are making

A Neural Network Based Multi-class Trading …

151

more transactions and also tend to have more or at least a comparable number of non-neutral predictions. This is due to using high class weights of four and five for the ±1% target to compensate for the high class imbalance. Overall, for the binary target, the performance exceeds that of a B&H starting with a high confidence level of around 70% and remains high until around 90% or higher confidence. When the confidence threshold exceeds this level, it becomes too high to conduct any transactions. In addition, the highest outperformance is often accomplished with very high confidence of 70, 80, 90% and above and occurs essentially exclusively during the Covid period. For the ±0.5% target, this seems to be shifted, so that an outperformance of the B&H is accomplished earlier, with a confidence of less than 50% (even with transaction costs of up to 0.4%). However, lower confidence levels of 41% and about 70% respectively mark the peak outperformances for strategies using the ±0.5% target. For the ±1% target, the situation for the intraday return is even more pronounced with comparably low confidence levels between 43 and 56% marking the outperformance of the strategy. The best performance is achieved with a low confidence of 45% but also for 50 and 56% confidence the outperformance is not just based on the Covid period but also on two additional market corrections (Oct. 2018 and Aug. 2019). In contrast to that the outperformance for the open-toopen return occurs during moderate confidence thresholds but has its peak lower than for the binary target. Thus, for the intraday return, and to some degree to the open-to-open return, there seems to be an inverse relationship between the thresholds used for the target classes (binary, ±0.5, ±1%) and the profitable as well as optimal confidence thresholds. In particular, the confidence thresholds that outperform the B&H strategy decrease from the binary, over the ±0.5% to the ±1% target.

6 Conclusion This study focused on different return types (intraday, open-to-open and overnight) and created binary (buy, sell) and three class targets (buy, neutral, sell) in order to test different neural network-based trading strategies. Feature selection in form of information gain and correlation-adjusted random forest importance showed that the relevant features for the intraday (open-to-close) and open-to-open return were very similar whereas for the close-to-open returns the selection of features was quite different. In particular, for the simple binary target (buy, sell) the former two returns emphasized the importance of the most recent returns (close-to-open) of other market indices such as the Nasdaq and STOXX 50, the VIX volatility index as well as treasury yields, whereas the latter—the overnight return—stressed the relevance of changes in the yield spread between long-term and medium/short-term treasury yields, and included also market indices. For the two threshold-based targets investigated only for the intraday and open-to-open return, the results showed that additionally technical indicators become relevant, namely the Stochastic %K, MACD and RSI. A confidence threshold for the neutral network output neurons was selected so that a

152

L. Soukko et al.

buy and sell decision could only occur if the model showed sufficient confidence in the prediction. As can be expected, extremely high confidence is rarely obtained, thus such confidence thresholds (usually around 90% and higher) did not result in any non-neutral prediction and earned also only the B&H return. However, moderate to high confidence thresholds often led to a better performance than using very low thresholds. Interestingly, the confidence thresholds that were the most profitable ones seemed dependent on the target and how strict the cut-off for the inclusion in the buy and sell class was. In particular, for the binary target for which the border between buy and sell is 0%, higher confidence thresholds between 70 and 90% performed best, whereas for the cut-off using the ±0.5% and the stricter ±1% target, the best confidence thresholds tended to be lower (41–70% and 45–56%). Thus, higher (absolute) returns for the buy and sell classes seemed to require less confidence for the model prediction to perform well than lower cut-offs (e.g., only 0% for the binary model). This appears plausible since errors for the binary class automatically mean an opposite return, whereas strict cut-off points still enable a positive performance even when the class may not be predicted correctly, allowing the acceptance of less confidence in the prediction. In comparison with the B&H strategy, many of the strategies with moderate to high confidence (depending on the class), performed identical to the B&H and only outperformed it during the Covid onset due to some or extensive shorting of the market. Since the 2007/08 financial crisis was included in the training data for the neural network, this may indicate that the models have learned to recognize very strong market corrections. Only with the ±1% target and moderate confidence thresholds, multiple models have accomplished also to outperform the B&H strategy in selected smaller market corrections. However, strategies with such a strict target tend to make fewer transactions, which can be problematic with lower confidence thresholds as the inability to reverse an unprofitable short of the market demonstrated. Overall, strategies based on the three-class target seemed to perform better than the binary target, but the success using more classes is dependent on the choice of cut-off for the classes and the confidence threshold selected. For future research it would be of interest to test different multi-class targets and cut-offs and the link of the confidence threshold to the performance of the trading strategies. Acknowledgements The authors acknowledge that this paper is based on Leo Soukko’s master’s thesis titled “Neural network based binary and multi-class trading strategies using probability thresholds for trading actions on S&P 500 index”. The authors would like to thank Artur Vuorimaa for his help in editing the text.

A Neural Network Based Multi-class Trading …

Fig. 12 Open-to-open return binary model on a timeline, 0.2% transaction costs

Fig. 13 Open-to-open return ±0.5% model on a timeline, 0.2% transaction costs

Fig. 14 Open-to-open return ±1% model on a timeline, 0.2% transaction costs

153

DJIA (CO)

10

Market Index

10y Treasury (CO) VIX (CO-OC)

cRFI Nasdaq (CO) VIX (CO) STOXX50 (CO) STOXX50 (CO-OC) S&P 500 (CO) Nasdaq (CO-OC) 30y Treasury (CO)

8 9

1 2 3 4 5 6 7

Rank

0.30

0.42 0.30

2.02 1.28 1.25 1.08 0.85 0.60 0.43

IG

Volatility Index

Crude Oil (CO)

Nasdaq (CO-OC) 30y Treasury (CO)

Nasdaq (CO) VIX (CO) STOXX50 (CO) S&P 500 (CO) STOXX50 (CO-OC) DJIA (CO) 10y Treasury (CO)

Binary Target

0.027

0.032 0.031

0.093 0.074 0.073 0.052 0.047 0.038 0.034

0.45

0.58 0.51

1.69 1.27 1.21 1.15 0.96 0.94 0.77

IG

Exchange Rate

Crude Oil (CO)

DJIA (CO) VIX (Range)

Nasdaq (CO) STOXX50 (CO) VIX (CO) STOXX50 (CO-OC) S&P 500 (CO) Nasdaq (CO-OC) Stochastic K%

+/- 0.5% Target

Technical Indicator

VIX (CO-OC)

S&P 500 (CO) MACD

cRFI Nasdaq (CO) STOXX50 (CO) STOXX50 (CO-OC) Stochastic K% VIX (Range) Nasdaq (CO-OC) VIX (CO)

0.056

0.062 0.062

0.163 0.126 0.108 0.093 0.086 0.082 0.063

RSI Treasury

Nasdaq (CO-OC) S&P 500 (CO)

cRFI Nasdaq (CO) VIX (Range) STOXX50 (CO-OC) STOXX50 (CO) Stochastic K% MACD VIX (CO)

0.42

0.66 0.55

1.60 1.42 1.07 1.03 0.72 0.71 0.67

RSI Commodity

MACD Stochastic K%

IG Nasdaq (CO) STOXX50 (CO) VIX (CO) VIX (Range) STOXX50 (CO-OC) S&P 500 (CO) Nasdaq (CO-OC)

+/- 1% Target

Table 5 Top 10 features using information gain and correlation-adjusted random forest importance for intraday returns (OC) by target

0.052

0.062 0.055

0.142 0.107 0.087 0.086 0.082 0.074 0.068

154 L. Soukko et al.

STOXX50 (CO)

STOXX50 (CO-OC) VIX (CO)

S&P 500 (CO)

Nasdaq (CO-OC)

VIX (CO-OC) DJIA (CO)

10y Treasury (CO) 30y Treasury (CO)

2

3 4

5

6

7 8

9 10

Market Index

Nasdaq (CO)

cRFI

1

Rank

0.34 0.33

0.60 0.38

0.72

0.73

1.23 1.02

1.55

1.75

IG

Volatility Index

30y Treasury (CO) VIX (CO-OC)

Nasdaq (CO-OC) 10y Treasury (CO)

DJIA (CO)

STOXX50 (CO-OC)

STOXX50 (CO) S&P 500 (CO)

VIX (CO)

Nasdaq (CO)

Binary Target

0.025 0.025

0.034 0.028

0.036

0.046

0.063 0.048

0.064

0.087

0.53 0.46

0.84 0.61

0.85

0.97

1.18 1.14

1.25

1.40

IG

Exchange Rate

VIX (Range) VIX (CO-OC)

Stochastic K% DJIA (CO)

Nasdaq (CO-OC)

S&P 500 (CO)

VIX (CO) STOXX50 (CO-OC)

STOXX50 (CO)

Nasdaq (CO)

+/- 0.5% Target

Technical Indicator

MACD VIX (CO-OC)

Nasdaq (CO-OC) S&P 500 (CO)

VIX (Range)

VIX (CO)

STOXX50 (CO) Stochastic K%

STOXX50 (CO-OC)

Nasdaq (CO)

cRFI

0.060 0.051

0.065 0.061

0.078

0.079

0.104 0.089

0.114

0.153

Treasury

STOXX50 (Range) S&P 500 (CO)

Stochastic K% VIX (CO)

Nasdaq (CO-OC)

STOXX50 (CO-OC)

STOXX50 (CO) MACD

Nasdaq (CO)

VIX (Range)

cRFI

0.49 0.44

0.56 0.54

0.57

0.74

1.24 0.96

1.44

1.65

IG

Commodity

Stochastic K% STOXX50 (Range)

S&P 500 (CO) MACD

Nasdaq (CO-OC)

VIX (CO)

VIX (Range) STOXX50 (CO-OC)

STOXX50 (CO)

Nasdaq (CO)

+/- 1% Target

Table 6 Top 10 features using information gain and correlation-adjusted random forest im-portance for open-to-open returns (OO) by target

0.058 0.057

0.069 0.065

0.072

0.081

0.092 0.082

0.110

0.136

A Neural Network Based Multi-class Trading … 155

USD/GBP (OC-CO) STOXX50 (OC)

S&P 500 (CO) DJIA (OC)

S&P 500 (OC-CO)

Stochastic K% STOXX50 (Range)

VIX (CO)

3 4

5 6

7

8 9

10

Market Index

S&P 500 (OC) Nasdaq (CO)

cRFI

1 2

Rank

0.23

0.26 0.24

0.26

0.28 0.28

0.35 0.34

0.37 0.36

IG

Volatility Index

Dis. Stochastic K%

USD/EUR (CO) Nasdaq (CO)

Gold (CO)

USD/GBP (OC-CO) 30y Treasury (OC)

10y Treasury (OC) STOXX50 (OC)

30y Treasury Range USD/EUR (OC-CO)

Binary Target

0.003

0.004 0.004

0.004

0.004 0.004

0.004 0.004

0.005 0.005

0.35

0.37 0.36

0.37

0.91 0.38

0.98 0.97

1.90 1.76

IG

Exchange Rate

Crude Oil (Range)

S&P 500 (Range) Nasdaq (Range)

USD/CAD (Range)

STOXX50 (Range) Discrete RSI

RSI MACD

Stochastic K% VIX (Range)

+/- 0.5% Target

Technical Indicator

Nasdaq (Range)

Discrete RSI Crude Oil (Range)

DJIA (Range)

STOXX50 (Range) S&P 500 (Range)

MACD RSI

Stochastic K% VIX (Range)

cRFI

0.027

0.029 0.028

0.029

0.043 0.031

0.049 0.047

0.070 0.061

Treasury

HSI (Range)

USD/CAD (Range) DAX (Range)

S&P 500 (Range)

DJIA (Range) STOXX50 (Range)

MACD RSI

VIX (Range) Stochastic K%

cRFI

0.29

0.36 0.30

0.56

0.60 0.59

1.19 0.97

2.20 1.81

IG

Commodity

USD/EUR (Range)

DJIA (Range) Discrete RSI

S&P 500 (Range)

STOXX50 (Range) USD/CAD (Range)

Stochastic K% RSI

VIX (Range) MACD

+/- 1% Target

Table 7 Top 10 features using information gain and correlation-adjusted random forest importance for close-to-close returns (CC) by target

0.034

0.043 0.034

0.043

0.046 0.045

0.058 0.055

0.086 0.064

156 L. Soukko et al.

0.24

Nasdaq (CO) S&P 500 (OC) S&P 500 (OC-CO) Stochastic K%

7 8

9

10

Technical Indicator

Market Index

0.34 0.26

DJIA (OC-CO) DJIA (OC)

5 6

0.24

0.42 0.34

0.82

Treasury (10y-13w)

4

0.86 0.85

0.96

Treasury (30y-13w) S&P 500 (CO)

cRFI Treasury (30y-10y)

Treasury

STOXX50 (CO)

Stochastic K%

DJIA (OC-CO) Nasdaq (Range)

Nasdaq (CO) DJIA (OC)

S&P 500 (CO)

Treasury (10y-13w) Treasury (30y-13w)

IG Treasury (30y-10y)

Binary Target

2 3

1

Rank

0.008

0.009

0.009 0.009

0.011 0.010

0.024

0.036 0.026

0.037

Table 8 Top 10 features using information gain and correlation-adjusted random forest importance for overnight returns (CO) by target

A Neural Network Based Multi-class Trading … 157

58.90

93.27 110.41

106.14

108.86

40% Confidence 50% Confidence

60% Confidence 70% Confidence

80% Confidence

90% Confidence

Binary

34% Confidence

B&H

98.99 -

Strategy

98.99

123.30

132.95 130.30

95.46 109.14

Intraday (OC) +/- 0.5% Threshold 98.99 96.81

98.99

98.99

98.99 98.99

75.68 106.61

+/- 1% Threshold 98.99 78.08

98.99

123.30

98.25 121.52

59.81

98.99 -

Binary

98.99

98.99

98.99 98.99

110.32 98.99

Open-to-open (OO) +/- 0.5% Threshold 98.99 110.32

Table 9 Total return during the forecast time period for each strategy and target with 0.2% transaction costs

98.99

98.99

89.26 135.17

71.58 82.72

+/- 1% Threshold 98.99 70.71

123.46

105.88

92.02 117.59

11.20

98.99 -

Binary

Hybrid (OC + CO)

158 L. Soukko et al.

A Neural Network Based Multi-class Trading …

159

Appendix See Figures 12, 13, and 14 for the open-to-open returns of the three target types, Tables 5, 6, 7, and 8 for the top 10 features for the intraday (OC), open-to-open (OO), close-to-close (CC), and overnight (CO) targets based on the information gain and the correlation-adjusted random forest importance, as well as Table 9 for the total returns with 0.2% transaction cost.

References 1. Fadlalla, A., Amani, F.: Predicting next day closing price of Qatar Exchange Index using technical indicators and artificial neural network. Intell. Syst. Account. Financ. Manag. 21, 209–223 (2014) 2. Lu, C.-J., Wu, J.-Y.: An efficient CMAC neural network for stock index forecasting. Expert Syst. Appl. 38(12), 15194–15201 (2011) 3. Enke, D., Grauer, M., Mehdiyev, N.: Stock market prediction with multiple regression, fuzzy type-2 clustering and neural networks. Procedia Comput. Sci. 6, 201–206 (2011) 4. Leung, M., Daouk, H., Chen, A.S.: Forecasting stock indices: a comparison of classification and level estimation models. Int. J. Forecast. 16, 173–190 (2000) 5. Leigh, W., Purvis, R., Ragusa, J.M.: Forecasting the NYSE composite index with technical analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision support. Decis. Support Syst. 32, 361–377 (2002) 6. Krauss, C., Do, X.A., Huck, N.: Deep neural networks, gradientboosted trees, random forests: statistical arbitrage on the S & P 500. Eur. J. Oper. Res. 259, 689–702 (2017) 7. Lahrimi, S.: An Entropy-LVQ system for S&P 500 downward shifts forecasting. Manag. Sci. Lett. 2, 21–28 (2011) 8. Niaki, S.T.A., Hoseinzade, S.: Forecasting S&P 500 index using artificial neural networks and design of experiments. J. Ind. Eng. Int. 9(1), 1–9 (2013) 9. Liu, C., Wang, J., Xiao, D., Liang, Q.: Forecasting S&P 500 stock index using statistical learning models. Open J. Stat. 6, 1067–1075 (2016) 10. Zhong, X., Enke, D.: A comprehensive cluster and classification mining procedure for daily stock market return classification. Neurocomputing 267, 152–168 (2017) 11. Abolghasem, S.N., Dang, L.M., Huynh, D.C., Hyenijoon, M., Kyungbok, M.: Deep learning approach for short-term stock trends prediction based on two-stream gated recurrent unit network. IEEE 6, 55392–55404 (2018) 12. Wu, L., Zhang, Y.: Stock market prediction of s&p 500 via combination of improved BCO approach and BP neural network. Expert Syst. Appl. 36, 8849–8854 (2009) 13. Gao, T., Chai, Y.: Improving stock closing price prediction using recurrent neural network and technical indicators. Neural Comput. 30, 2833–2854 (2018) 14. Lohrmann, C., Luukka, P.: Classification of intraday S&P500 returns with a Random Forest. Int. J. Forecast. 35, 390–407 (2019) 15. Chiang, W.C., Enke, D., Wu, T., Wang, R.: An adaptive stock index trading decision support system. Expert Syst. Appl. 59, 195–207 (2016) 16. Karhunen, M.: Algorithmic sign prediction and covariate selection across eleven international stock markets. Expert Syst. Appl. 115, 256–263 (2018) 17. P. Ou, H. Wang, Prediction of stock market index movement by ten data mining techniques. Mod. Appl. Sci. 3(12) (2009)

160

L. Soukko et al.

18. Huang, C.S., Liu, Y.S.: Machine learning on stock price movement forecast: the sample of the taiwan stock exchange. Int. J. Econ. Financ. Issues 9(2), 189–201 (2019) 19. Chung, H., Shin, K.S.: Genetic algorithm-optimized long short-term memory network for stock market prediction. Sustainability 10, 3765 (2018) 20. Chung, H., Shin, K.S.: Genetic algorithm-optimized multi-channel convolutional network for stock market prediction. Neural Comput. Appl. 32, 7897–7914 (2020) 21. Tufecki, P.: Classification-based prediction models for stock price index movement. Intell. Data Anal. 20, 357–376 (2016) 22. Kim, K.: Financial time series forecasting using support vector machines. Neurocomputing 55(1–2), 307–319 (2003) 23. Rossilo, R., Giner, J., de la Fuente, D.: The effectiveness of the combined use of VIX and support vector machines on the prediction of S&P 500. Neural Comput. Appl. 25, 321–332 (2014) 24. Baur, D.G.: Asymmetric volatility in the gold market. J. Altern. Invest. 14(4), 26–38 (2012) 25. Hillier, D., Draper, P., Faff, R.: Do precious metals shine? An investment perspective. Financ. Anal. J. 62, 98–106 (2006) 26. Gokmenoglu, K.K., Fazlollahi, N.: The interactions among gold, oil, and stock market: evidence from S&P 500. Procedia Econ. Financ. 25, 478–488 (2015) 27. Apergis, N., Miller, S.: Do structural oil-market shocks affect stock prices? Energy Econ. 31, 569–575 (2009) 28. Rudebusch, G.D., Williams, J.C.: Forecasting recessions: the puzzle of the enduring power of the yield. J. Bus. Econ. Stat. 27(4), 492–503 (2009) 29. Pettengill, G.: A survey of the monday effect literature. Quart. J. Bus. Econ. 42, 3–27 (2003) 30. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997) 31. Dessì, N., Pes, B.: Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst. Appl. 42(10), 4632–4642 (2015) 32. Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 1–45 (2017) 33. M. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in Proceedings of the 17th International Conference on Machine Learning, pp. 359–366 (2000) 34. H. Liu, H. Motoda, Feature extraction, construction and selection: a data mining perspective. Springer Sci. Bus. Media (2001) 35. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(1–4), 131–156 (1997) 36. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997) 37. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005) 38. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 39. Ang, J.C., et al.: Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinf. 13(5), 971–989 (2016) 40. Motoda, H., Liu, H.: Feature selection, extraction and construction. Commun. IICM (Institute of Information and Computing Machinery, Taiwan) 5(2), 67–72 (2002) 41. I. Guyon, A. Elisseeff, An introduction to feature extraction. in Feature Extraction: Foundations and Applications, ed. by I. Guyon et al. (Springer, Berlin, Heidelberg, 2006), pp. 1–25 42. N. Sánchez-Maroño, A. Alonso-Betanzos, M. Tombilla-Sanoromán, Filter methods for feature selection—a comparative study, in Proceedings of Intelligent data engineering and automated learning-IDEAL 2007, ed. by H. Yin, et al. (Springer, 2007), pp. 178–187 43. Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40, 16–28 (2014) 44. John, G. H., Kohavi, R. and Pfleger, K. (1994). Irrelevant Features and the Subset Selection Problem. Proceedings of the Eleventh International Conference on Machine Learning, 121–129. 45. Caruana, R., Freitag, D.: Greedy Attribute Selection. International Conference on Machine Learning 48, 28–36 (1994)

A Neural Network Based Multi-class Trading …

161

46. R. Kohavi, D. Sommerfield, Feature subset selection using the wrapper method: Overfitting and dynamic search space topology, in First International Conference on Knowledge Discovery and Data Mining (1995) 47. S. Das, Filters, wrappers and a boosting-based hybrid for feature selection, in Proceedings of the 18th International Conference on Machine Learning, pp. 74–81 (2001) 48. Bolón-Canedo, V., et al.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014) 49. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012) 50. Railenau, L., Stoffel, K.: Theoretical comparison between the Gini Index and Information Gain criteria. Ann. Math. Artif. Intell. 41, 77–93 (2004) 51. Hsu, H.-H., Hsieh, C.-W., Lu, M.-D.: Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl. 38(7), 8144–8150 (2011) 52. Liu, X., Li, Q., Li, T., Chen, D.: Differentially private classification with decision tree ensemble. Appl. Soft Comput. 62, 807–816 (2018) 53. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 54. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 55. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Series in Statistics, 2009) 56. C. Adele, D.R. Cutler, J.R. Stevens, Random forests, in Ensemble Machine Learning: Methods and Applications, pp. 157–175 (2012) 57. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer Science Business Media, New York (2006) 58. C. Aggarwal, Neural Networks and Deep Learning (Springer International Publishing AG, part of Springer Nature, 2018) 59. Fawcett, F.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006) 60. X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in Appearing in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA. Volume 15 of JMLR: W&CP 15 (2011) 61. D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) (2015). arXiv, 1–14 62. K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (2015) 63. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Appearing in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9 (2010)

Predicting Short-Term Traffic Speed and Speed Drops in the Urban Area of a Medium-Sized European City—A Traffic Control and Decision Support Perspective Teemu Mankinen, Jan Stoklasa , and Pasi Luukka Abstract Traffic speed and traffic jam prediction are necessary for a successful regulation of traffic flow and also for the prevention of accidents. This chapter contributes to the body of knowledge on traffic characteristics prediction by focusing on the possibilities of traffic speed prediction in an urban area of a medium-sized European city—the Finnish capital of Helsinki. The predictive ability of simple models such as ARIMA-family models, Linear Regression, K-Nearest Neighbor and Extreme Gradient Boosted Tree (XGBoost) is investigated with the prediction horizons of 5, 10 and 15 min. The main goal is to find out if the results provided by these models can be sufficient for traffic control in medium-sized city areas. Open data is obtained from the Finnish Transport Agency and the city of Helsinki is chosen for the purpose of the analysis. Particular attention is paid to the possibilities of predicting sudden speed drops and traffic jams in the highly regulated metropolitan area of Helsinki. Traffic and weather data are considered as inputs and traffic jams are identified from the predicted speed, i.e. using a timeseries approach, and using a classification approach. The results indicate that XGBoost outperforms all the other considered models for all prediction horizons, but the speed drops are clearly underestimated by the timeseries models. On the other hand classification-oriented models such as decision trees seem to be better suited for the prediction of traffic jams (speed drops below 40 km/h) from the same data and provide promising results.

T. Mankinen (B) · J. Stoklasa · P. Luukka School of Business and Management, LUT University, Yliopistonkatu 34, 53850 Lappeenranta, Finland e-mail: [email protected] J. Stoklasa e-mail: [email protected] P. Luukka e-mail: [email protected] J. Stoklasa Faculty of Arts, Department of Economic and Managerial Studies, Palacký University Olomouc, Kˇrížkovského 8, 771 47 Olomouc, Czech Republic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Luukka and J. Stoklasa (eds.), Intelligent Systems and Applications in Business and Finance, Studies in Fuzziness and Soft Computing 415, https://doi.org/10.1007/978-3-030-93699-0_7

163

164

T. Mankinen et al.

Keywords Short-term · Speed · Irregularity · Prediction · Traffic control · Traffic jam · Medium-sized city · Finland · Decision tree · Classification · XGBoost

1 Introduction The estimation of traffic related variables has been in the center of interest of many researchers during the last decade. The ability to predict traffic flow and congestions plays a crucial role in sustainable traffic control and constant improvement of the reliability of transportation systems [18]. Ongoing expansion and changes in the cities increase the complexity of travelling from one place to another which results in greater demand for traffic control. This in turn influences the travelling time within the city areas. If traveling times and/or traffic density increase, it affects the life quality among the citizens in a deteriorative way [27, 33]. The desire to save energy and reduce traffic pollution are one of the key incentives to predict travel time and congestion in urban areas. ITS (Intelligent Transportation Systems) consist of various technologies that track fluency and safety of the ongoing traffic. Both short-term [5, 19] and long-term [35, 36] prediction horizons are being considered in the current literature. These technologies include for example, GPS related systems [19], video cameras, mounted traffic sensors and license-plate recognizers [21]. It is estimated that the global ITS market will increase from 20.22 billion US dollar (year 2015) up to 57.44 billion US dollar by the end of 2024 [30]. The driver for the rapid increase is the overall increase in demand for traffic control solutions, more precisely, adaptive signaling systems. The research is also going quickly forward in the invehicle decision support systems and assistance systems area [25]. A lot of the current research is focused on the ability to predict highway traffic variables (see e.g. [16, 36]), when urban areas are considered, large (frequently Asian) cities are usually in the focus of the authors. Even though the traffic control in large and complex areas of large cities is logically being assigned research priority, research that would be focusing on middle-sized to small (European) cities is lacking. The aim of this paper is to examine the possibility to predict short-term traffic speed in an urban area of a medium-sized European city. To be able to obtain the results of the predictions in real time, we focus on the comparison of the performance of simple models (namely such that are not computationally extensive and that are not combinations of different models or modeling frameworks). To this end AR, ARIMA, Linear regression, KNN, and XGBoost models were selected for the analysis. We also address the effectiveness of the models in dealing with abnormal events. We are interested in the ability to predict sudden speed drops, indicating accidents or other causes for the formation of traffic jams. We also seek to identify variables that can be used for such predictions. Successful real-time prediction of the speed drops can provide needed inputs for adaptive network traffic models and enhance their performance. Even though we are well aware that complex network and system models, hybrid machine learning models and similar are available in the literature, we pose the question of their necessity for dealing with a European city of a medium size.

Predicting Short-Term Traffic Speed and Speed Drops …

165

This is why this study focuses on simpler prediction models and their performance in this setting. The data for our study is gathered from the urban area of the Finnish capital, Helsinki and provided in the form of open data by the Finnish Transport Agency (see [12]). The paper is structured in the following way: after the introduction, in Sect. 2, we present a short overview of the research on the prediction of traffic flow in different context and the methods used for this purpose. Section 3 introduces a publicly available Finnish traffic data set used for the study presented in this paper and describes it briefly. A subset of the data corresponding with the urban area of the city of Helsinki (Finland) is described more in detail here as well. The methods applied for the analysis of the possibilities of speed (and speed drop) prediction in Helsinki are also summarized here. Section 4 focuses on the results of the speed-prediction models and discusses their performance on the chosen data, whereas Sect. 5 summarizes the results of the models built for the prediction of speed drops (traffic jams). The last section summarizes the findings and considers the implications of these findings for practice and future research on the modelling of traffic flow.

2 A Brief Overview of Previous Studies Various parameters and variables connected with traffic, its flow and management have been under investigation of scientists and practitioners for a few decades already. As pointed out by Vlahogianni et al. [31] there are distinguishable stages or trends in the traffic forecasting research. The defining characteristics of the research are the availability of computational power, sophisticated mathematical models and sufficient quantity of data of proper quality. The shift from simple statistical (time-series) models such as AR(I)MA and GARCH towards more sophisticated adaptive network and artificial intelligence is an apparent trend in recent literature [13, 34], also combinations and ensembles of standard econometric techniques with deep learning and other machine learning method are being proposed [5]. Recently introduced results which cannot be gained with simpler models (see e.g. [4, 10, 20, 22]) are due to the increase in computing power, growing number of available complex models and availability of more extensive data sets. Artificial intelligence, machine learning, genetic algorithms and their combinations are being proposed in the literature and practice to increase forecasting precision in complex traffic settings (see e.g. [7]). Several researchers have investigated short-term traffic speed prediction by using AIbased methods. For example Wang and Shi [32] used chaos-wavelet support vector machine models, Cheng et al. [4] used adaptive spatiotemporal K-nearest neighbor (KNN). Secondary incidents (crashes resulting from other “primary” crashes) are being predicted using stochastic gradient boosted decision trees [22]. Neural network-based models and deep learning have also been applied in traffic condition and traffic jam forecasting [13], long-term highway speed prediction [36], lane-level traffic state prediction in highways [16], long-term traffic speed forcasting in urban areas [35] and to traffic flow forecasting [9, 28]. From recent literature it is clear

166

T. Mankinen et al.

that the search for methods capable to provide high prediction accuracy is clearly in progress. However, the complexity of the forecasting models and high demand for computational power can create problems on their own—incomplete data, too many sources of data to be maintained, high cost of data acquisition etc. can become an issue in the forecasting [6]. The tradeoff between complexity of the models and quality of the forecasts on one hand and time, data and computational power requirements on the other hand is apparent. Simpler models, if applicable, could offset slightly lower precision by speed and low costs; particularly now when ensembles of simpler models are being proposed for predictive purposes [22, 38]. Vlahogianni et al. [31] also confirm the prevailing focus on easily available data (motorway, freeway data) is shifting towards the creation and utilization of open data sets and towards new methods of traffic data acquisition, for example, through the tracking of mobile phones. The question of variables that need to be monitored to successfully predict traffic-management-relevant variables is still an open one. It is also worth noting that the analyses and validations of the methods are frequently done using data on large cities or busy roads. This is to some extent understandable, as the relevance of such areas and the potential impact of the methods measured by the amount of people/vehicles present there is undoubtedly large. It, however, seems that small and medium-sized cities are left out from the main scope of the analyses. This study will therefore also attempt to add a medium-size city perspective to the current body of knowledge on traffic speed prediction. Traffic-related observations are usually combinations and relations of different attributes (speed, traffic flow, vehicle type, weather characteristics) within a certain time frame. Predicting traffic occurrences whether it is speed, flow or some other event, includes a time-component that seems to justify and encourage to use timeseries forecasting, either as a benchmark or addition to more sophisticated models. For example Guo and Williams [14] argued that predicting short-term traffic state considering the level of traffic and uncertainty, has the possibility to reduce congestion through the traffic operating systems. Overall the selection of input variables, the choice of the model and also the context of modelling (urban area or less controlled motorway/highway setting) determine at least partially the success of prediction of either traffic parameters, or their sudden changes. Prediction of traffic parameters changes in a complex and highly controlled setting of highly populated areas remains an open challenge, particularly when realtime predictions need to be obtained and high data maintenance and acquisition costs are to be avoided.

3 Data Set and Applied Methods In this paper we investigate the possibilities of short-term (max 15 min ahead) traffic speed prediction in an urban setting. We also investigate the possibilities of predicting speed drops that may signify traffic jams or accidents for the purpose of the man-

Predicting Short-Term Traffic Speed and Speed Drops …

167

agement of traffic flow and traffic jam avoidance in an urban area. We use Helsinki, the capital of Finland, as an example of a highly controlled traffic setting for several reasons. First of all traffic and weather data are available for the city—open traffic data is obtained from the Finnish Transport Agency [12] and open weather data from Finnish Meteorological Institute [11]. The traffic data available in the automatic traffic monitoring system in Finland (in the traffic management system points chosen for the analysis) contains very little missing values and as such provides a good starting point for the analysis. Also the changes in weather conditions between summer and winter are significant in this city (see Table 3). Besides this it is also a recently rediscovered tourist destination which makes it attractive for investigating effects of holidays and weekends for traffic speed predictions. We are specifically interested in the performance of classical time-series models of the ARMA family compared to KNN and XGBoost in short-term average traffic speed prediction. Our aim is to investigate whether increased complexity of the models provides stronger forecasting power (in terms of 5-min average speed during rush-hours) in the highly controlled traffic setting of an urban area in the city of Helsinki. We also suggest a decision-tree based model for the prediction of sudden speed drops and compare its performance in this task with the other models. Based on the results we identify the traffic-related variables that are considered relevant by the selected predictive models. The traffic related data is obtained from the Finnish Transport Agency that has shared the information considering traffic state since 1995 [12]. The data is gathered using TMS (automatic Traffic Monitoring Systems) and there are around 500 different measurement stations currently in the Finnish roads. The variables that are gathered by TMS on the traffic are summarized in Table 1. In addition, the weather data is added in order to capture uncertainty related to different weather conditions. Several studies tend to focus on the effect of extreme weather condition to the traffic flow. For example, Guo et al. [15] have shown that intense rain can have a significant impact to the transportation. Furthermore, temperature and visibility variables can have a notable effect on traffic volumes [26]. Additionally, considering lethal accidents, weather data helps to improve crash risk analysis in a cost-effective way [6]. The data is gathered from the Finnish Meteorological Institute using their open source platform [11]. The open-source platform provides support for different research projects, cooperation and business for international and local use. The weather data is gathered from Kumpula weather station located in Helsinki.

3.1 Speed Prediction in Helsinki For the purposes of this paper, we have selected 6 measurement points located in the Helsinki area. The designations of the points in the dataset [12] are 149, 148, 107, 4, 126 and 145 (see Fig. 1 for their locations). The points 149, 107 and 126 act as main data points for speed predicting, i.e. the speed is predicted for these

168

T. Mankinen et al.

Table 1 Traffic data variables available in the Finnish Transport Agency’s open source platform [12] Variable Value TMS point Year Running day number Hour Minute Second 1/100 Second Vehicle length Lane Direction Vehicle class Speed Faulty Total time Time Interval Queue start

e.g. 149 e.g. 18 1–366 1–23 1–59 1–59 1–99 e.g. 3.6 (m) 1–6 1–2 1–7 e.g. 107 (km/h) 0–1 Technical measure (not used in the study) Technical measure (not used in the study) Technical measure (not used in the study)

points in our research. The measurement points 148, 4 and 145 act as their respective predecessors that provide values of traffic characteristics measured in a “previous” measurement point. The selected timeframe for the research is from November 1, 2017 until September 27, 2018. Even though the FTA data is available from 1995 onwards we do not carry out the analysis on the whole dataset. Our interest lies in the recent data and the selected period already contains a sufficient amount of observations for the analysis. Out of the three above mentioned TMS points, we will focus mainly on the point 149 in this paper. The reason for this choice is that this paper investigates the possibilities of predicting speed drops and TMS point 149 registered the most speed drops out of the selected points in the chosen period. Selected results for the other points will also be presented here when appropriate, particularly to stress the universality of our findings and to assess the performance of chosen predictive methods on more datasets. In terms of the size of the dataset for the analysis, the measurement point 149 provides totally over 10 million vehicle observations during the chosen period. The data is then aggregated to five-minute-average speed values resulting in 95313 time-wise equidistant values. The number of missing values after the aggregation in the TMS point 149 is 319 which is 0.33 % of the total amount (95313) of five-minute-average observations. Missing values in this context refer to such five-minute-intervals that did not register any vehicles coming through the TMS point. In these cases the missing values were substituted by the average value for the respective five-minute-period in the last 15 days to avoid misinterpreting the missing values as zero-speed peri-

Predicting Short-Term Traffic Speed and Speed Drops …

169

Fig. 1 A map of the Helsinki area with the identification of TMS points 107, 126 and 149 used in this research and their predecessors (4, 145 and 148 respectively). The direction of traffic for each of the analysed points is denoted by an arrow. Directions 1 and 2 refer to direction of the traffic flow (specific lanes of the road). (Underlying map obtained from Google Maps, 2019)

ods, which could bias our analysis. To get an understanding about the completeness of this Finnish traffic dataset, there were 95 missing five-minute-observations for the TMS point 107 and 26 missing five-minute-observations for the TMS point 126 after the aggregation. The datasets for all three TMS points 149, 107 and 126 can be considered complete with only short periods of missing measurements due to the inoperability of the sensors. This also attests the quality and potential usefulness of the currently available traffic data in Finland. Table 2 summarizes the variables used for the purposes of speed prediction in this research. The speed lag 1 expresses the 5-min-average speed one step earlier while speed lag 2 and speed lag 3 express the five-minute-average speeds of the vehicles 10- and 15-min ago. The previous TMS point speed lag expresses the predecessors (TMS point preceeding the analysed one in the given direction of traffic) speed five minutes ago. Count lag 1 is the last five-minute vehicle count, count lag 2 is the fiveminute vehicle count 10 min ago and count lag 3 is the five-minute vehicle count 15 min ago. The vehicle class 1–7 is actually a shortcut for 7 variables (vehicle class 1, ..., vehicle class 7) that are used to obtain five-minute frequencies of seven types of vehicles passing through the given TMS point (car or van, truck, bus, truck with

170

T. Mankinen et al.

Table 2 Variables used in the predictive models for the forecast of 5-min-average speed Variable Category Speed lag 1 Speed lag 2 Speed lag 3 Speed lag previous TMS point Count lag 1 Count lag 2 Count lag 3 Count lag previous TMS point Vehicle class 1–7 Hour Minute Morning Time Afternoon Time Weekend Holiday Rain Snow Temperature Visibility Wind speed

Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Integer Integer Integer Binary Binary Binary Binary Continuous Continuous Continuous Continuous Continuous

semi-trailer, truck with trailer, car or van with trailer and car or van with a caravan or long trailer). Source values for all these variables are obtained from [12]. The temporal scope is captured using the hour-of-the-day variable and minute-ofthe-hour variable. The hour can vary from 0 to 23 and minute from 0 to 55 in 5-min steps. The effect of the rush hour is included using morning time and afternoon time dummy variables. The morning time gets a binary value 1 if the hour of the day is between 7:00 and 9:00 and the value 0 otherwise, whereas the afternoon time is set to be considered between 15:00 and 18:00. See Fig. 2 for an overview of the possible temporal effects that can be observed in the speed and count variables on an hourly bases for the TMS point 149. Figure 3 describes the development of the fiveminute-average speed and of five-minute vehicle counts during the whole analysed period. The weekend effect on the traffic flow is taken into account in the weekend dummy variable that gets value 0 when the day is a normal weekday and 1 when it is weekend. Figure 4 shows the effect of the weekday on the traffic flow in the three considered TMS points. The holiday variable is a binary variable that gets value 1 when the observation date is included in the Finnish national calendar, that is, if the day is not considered to be a normal working day.

Predicting Short-Term Traffic Speed and Speed Drops …

171

Fig. 2 An overview of the average vehicle speed and count by the hour of the day for TMS point 149. A drop in the average speed in afternoon rush-hours is apparent, otherwise the average speed is stable and in accordance with the speed limit

Fig. 3 An overview of the five-minute-average speed and five-minute vehicle count for the TMS 149 for the whole period under investigation

Fig. 4 Average vehicle counts (for 5-min intervals) for all three analysed TMS points by weekdays

172

T. Mankinen et al.

The descriptive statistics of the traffic and weather variables for TMS 149 are presented in Table 3.

3.2 Models Used in the Analysis To obtain the prediction of the 5-min average speed with a prediction horizon 5, 10 and 15 min ahead we have first employed the standard univariate ARMA-family models [2]. The order of the models was decided based on the Akaike information criterion with subsequent residual check for autocorrelation. Linear regression models were fitted for the same purpose, utilising the lagged values of speed, as well as other pieces of information summarized in Table 2, including the counts of vehicles in the previous 5 min slots, time and weather data as additional predictors. These models were also used to identify variables that are potentially useful for the prediction of the traffic speed. To capture possible nonlinearities in the data, we have employed the K-Nearest Neighbor (KNN) approach to timeseries prediction (see e.g. [4, 37]). In this case K observations with conditions closest to the current conditions are used to calculate the prediction of the 5-min average speed as an average of the speeds corresponding with these observations. The value of K was optimized with respect to the prediction horizon based on the minimization of RMSE [8, 29]. 70% of the data was used for 5-fold cross-validation for the purpose of determining the optimal value of K. Finally the Extreme Gradient Boosting (XGBoost) method was used to obtain the predictions and also to validate the selection of variables relevant for predictive purposes (see e.g. [3, 22, 38]). The parameters of XGBoost were set in the following way: loss reduction set to 0, minimum child weight set to 1; learning rate, column subsample ratio, maximum number of iterations and maximum tree depth obtained by grid-search. The methods were selected based on an initial investigation carried out in [17]. To assess the ability of the above mentioned models to identify anomalies in the traffic flow in terms of reduced traffic speed (i.e. to identify possible traffic jams) based on the 5-min average speed predictions, a benchmark representing a different traffic jam identification approach was employed. For this purpose we have fitted a decision-tree classifier [1, 23, 24] for the detection of traffic jams. Traffic jams are operationally defined for this paper as situations where the 5-min average speed drops below 40 km/h. The occurrence of a traffic jam in 5, 10 and 15 min was predicted using this approach.

4 Results for Traffic Speed Prediction and Their Discussion As indicated above, we have approached the goal of the identification of speed drops from two directions. First we have fitted different models for the prediction of the actual speed (5-min-average), where the decrease of speed or the occurrence of a

75.8 78.9 115.0 113.0 106.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 6.6 37890.0 4.0

74.4 78.2

105.5 106.6

97.8 2.3 0.9 1.0 1.4 1.6 0.5 0.0 4.3 7.8 32929.9 4.3

Speed (km/h) Speed lag previous TMS point (km/h) Count (obs.) Count lag previous TMS point (obs.) Vehicle 1 (obs.) Vehicle 2 (obs.) Vehicle 3 (obs.) Vehicle 4 (obs.) Vehicle 5 (obs.) Vehicle 6 (obs.) Vehicle 7 (obs.) Rain (mm) Snow (cm) Temperature (◦ C) Visibility (m) Wind speed (m/s)

Median

Mean

Variable

7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 50000.0 3.5

9.0 8.0

77 80

Mode

69.9 2.7 1.0 1.3 1.6 1.9 0.8 0.1 9.0 10.1 16570.2 2.0

74.4 78.0

7.6 6.0

Standard deviation 105.0 110.0

−3.9 −6.8 0.2 0.3 0.2 1.3 1.2 1.7 1.4 1.4 2.1 24.8 1.5 0.0 −0.5 0.7

16.6 56.6 −1.1 −1.0 −1.0 1.4 1.3 3.5 2.4 1.7 5.5 1094.4 0.7 −0.8 -1.2 0.5

330.0 18.0 9.0 12.0 14.0 15.0 8.0 6.5 30.0 31.6 50000.0 14.7

335.0 352.0

Max

Skewness

Kurtosis

9320459 223624 87232 96525 132275 151041 43926 1238.3 – – – –

10055158 10158543

– –

Sum

Table 3 Descriptive statistics for the selected variables in the dataset for TMS 149, and variables considered potentially relevant for the prediction of speed drops in this TMS point including weather-related variables. The characteristics in the table are presented for the 5-min average values of the variables, as they are used in the analyses

Predicting Short-Term Traffic Speed and Speed Drops … 173

174

T. Mankinen et al.

traffic jam can be inferred from the raw value of predicted speed. The other approach investigated the ability of the models to identify speed drops of a specific magnitude. We have employed all the previously considered models for this purpose. We have also added a binary classification model in this part of the analysis, where a Jam/No Jam state was predicted using a decision tree classifier. In all the approaches we are not only interested in the ability to predict the traffic speed or a sudden decrease thereof, but also in the variables that seem to be best indicating the occurrence of a drop in speed in the near future.

4.1 ARMA Models for Traffic Speed Prediction First we focus on the actual speed prediction. The ARMA modelling framework was applied first to see whether there is enough information in the traffic speed timeseries themselves (5-min-average speed is considered) to identify the future value of speed reasonably well. We have identified ARIMA(1, 0, 4) model as the most appropriate for TMS point 149, ARIMA(5, 1, 0) model for TMS point 107 and ARIMA (0, 1, 5) for TMS point 126. The results are summarized in Table 4. Note, that although we consider three TMS points in the same period in the same city, the models differ also in the assumed underlying process. For TMS 126, the best fitting model seems to be MA(5), that is, a model with no direct relationship between the current (predicted) value of the speed and its previous values. For the other two TMS points, at least the previous value of the 5-min-average speed in the given point seems to contribute to the current value of the speed. From these results we can conclude that the knowledge of the previous value(s) of the 5-min-average speed immediately preceding the value in which we are interested might be required. But the differences in the structure of the models suggest that external explanatory variables are needed as well. The datasets are split with 70/30 relation where 70% of the data is used for training and the remaining 30% is held out for validation for the ARMA models, since ARMA models are pure timeseries models. For the other methods, i.e. for KNN, linear regression and XGBoost, the best models are selected based on 5-fold cross-validation performed on the 70% of the training data and their performance is then tested on the remaining 30%.

4.2 Linear Regression Models for Traffic Speed Prediction Since the historical values of the timeseries of 5-min-average speed did not provide sufficient information for the prediction of the traffic speed, we can consider including additional explanatory variables as well. Table 5 summarizes the results for linear regression models aiming at the prediction of the average traffic speed 5, 10 and 15 min into the future for TMS point 149 using not only previous values of the average speed, but also other traffic characteristics at the given TMS point as well

−0.1116