1,138 97 86MB
English Pages [1285] Year 2021
Tilo Wendler Sören Gröttrup
Data Mining with SPSS Modeler Theory, Exercises and Solutions Second Edition
Data Mining with SPSS Modeler
Tilo Wendler • Sören Gröttrup
Data Mining with SPSS Modeler Theory, Exercises and Solutions Second Edition
Tilo Wendler University of Applied Sciences HTW Berlin Berlin, Germany
Sören Gröttrup Technische Hochschule Ingolstadt Ingolstadt, Germany
ISBN 978-3-030-54337-2 ISBN 978-3-030-54338-9 https://doi.org/10.1007/978-3-030-54338-9
(eBook)
Mathematics Subject Classification: 62-07, 62Hxx, 62Pxx, 62-01 # Springer Nature Switzerland AG 2016, 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface to the Second Edition
For the second edition, we revised the book and updated it to the IBM SPSS Modeler version 18.2, which, compared to the version of the first edition, received a complete new design and an extension of new functionalities and data mining algorithms packed in new available nodes. Besides corrections of small inaccuracies and text errors, images are now presented in the design of the Modeler version 18.2. A selection of the most important new added nodes is included in this book and explained in detail along with their functionalities. We further revised the theory explanations and node functionalities to be more detailed and comprehensive and complemented them with some new exercises. The major changes are two new chapters, the first focusing on the problem of dealing with imbalanced data and resampling techniques, whereas in the second chapter an extensive case study on the cross-industry standard process for data mining is described. We included these chapters to give the reader a much better impression of the challenges that can appear when performing a data mining project and possible ways to face them. Berlin, Germany Ingolstadt, Germany December, 2020
Tilo Wendler Sören Gröttrup
v
Preface to the First Edition
Data analytics, data mining, and big data are terms often used in everyday business. Companies collect more and more data and store it in databases, with the hope of finding helpful patterns that can improve business. Shortly after deciding to use more of such data, managers often confess that analyzing these datasets is resourceconsuming and anything but easy. Involving the firm’s IT experts leads to a discussion regarding which tools to use. Very few applications are available in the marketplace that are appropriate for handling large datasets in a professional way. Two commercial products worth mentioning are “Enterprise Miner” by SAS and “SPSS Modeler” by IBM. At first glance, these applications are easy to use. After a while, however, many users realize that more difficult questions require deeper statistical knowledge. Many people are interested in gaining such statistical skills and applying them, using one of the data mining tools offered by the industry. This book will help users to become familiar with a wide range of statistical concepts or algorithms and apply them to concrete datasets. After a short statistical overview of how the procedures work and what assumptions to keep in mind, stepby-step procedures show how to find the solutions with the SPSS Modeler. Features of This Book • • • • • • • • • • •
Easy to read. Standardized chapter structure, including exercises and solutions. All datasets are provided as downloads and explained in detail. Template streams help the reader focus on the interesting parts of the stream and leave out recurring tasks. Complete solution streams are ready to use. Each example includes step-by-step explanations. Short explanations of the most important statistical assumptions used when applying the algorithms are included. Hundreds of screenshots are included, to ensure successful application of the algorithms to the datasets. Exercises teach how to secure and systematize this knowledge. Explanations and solutions are provided for all exercises. Skills acquired through solving the exercises allow the user to create his or her own streams. vii
viii
Preface to the First Edition
The authors of the book, Tilo Wendler and Sören Gröttrup, want to thank all the people who supported the writing process. These include IBM support experts who dealt with some of the more difficult tasks, discovering more efficient ways to handle the data. Furthermore, the authors want to express gratitude to Jeni Ringland, Katrin Minor, and Maria Sabottke for their outstanding efforts and their help in professionalizing the text, layout, figures, and tables. Berlin, Germany Ingolstadt, Germany
Tilo Wendler Sören Gröttrup
Contents
1
2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Concept of the SPSS Modeler . . . . . . . . . . . . . . . . . . . . 1.2 Structure and Features of This Book . . . . . . . . . . . . . . . . . . . 1.2.1 Prerequisites for Using This Book . . . . . . . . . . . . . 1.2.2 Structure of the Book and the Exercise/Solution Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Using the Data and Streams Provided with the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Datasets Provided with This Book . . . . . . . . . . . . . 1.2.5 Template Concept of This Book . . . . . . . . . . . . . . 1.3 Introducing the Modeling Process . . . . . . . . . . . . . . . . . . . . 1.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic Functions of the SPSS Modeler . . . . . . . . . . . . . . . . . . . . . 2.1 Defining Streams and Scrolling Through a Dataset . . . . . . . . 2.2 Switching Between Different Streams . . . . . . . . . . . . . . . . . . 2.3 Defining or Modifying Value Labels . . . . . . . . . . . . . . . . . . 2.4 Adding Comments to a Stream . . . . . . . . . . . . . . . . . . . . . . . 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Data Handling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Extracting/Selecting Records . . . . . . . . . . . . . . . . . 2.7.5 Filtering Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.6 Data Standardization: Z-Transformation . . . . . . . . . 2.7.7 Partitioning Datasets . . . . . . . . . . . . . . . . . . . . . . . 2.7.8 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . 2.7.9 Merge Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.10 Append Datasets . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
1 2 5 5
.
6
. . . . . . .
8 9 10 12 15 18 22
. . . . . . . . . . . . . . . . . .
23 23 28 31 35 40 41 47 47 48 54 59 64 71 81 88 109 123 ix
x
Contents
2.7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.12 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
134 148 189
Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Discrete Versus Continuous Variables . . . . . . . . . . 3.1.2 Scales of Measurement . . . . . . . . . . . . . . . . . . . . . 3.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Simple Data Examination Tasks . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Frequency Distribution of Discrete Variables . . . . . 3.2.3 Frequency Distribution of Continuous Variables . . . 3.2.4 Distribution Analysis with the Data Audit Node . . . 3.2.5 Concept of “SuperNodes” and Transforming a Variable to Normality . . . . . . . . . . . . . . . . . . . . . 3.2.6 Reclassifying Values . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Binning Continuous Data . . . . . . . . . . . . . . . . . . . 3.2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.9 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
191 191 191 193 194 197 200 200 200 204 208
. . . . . .
213 231 243 258 268 305
4
Multivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Exclusion of Spurious Correlations . . . . . . . . . . . . . . . . . . . . 4.7 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
307 307 310 316 321 329 333 334 342 344 366
5
Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction to Regression Models . . . . . . . . . . . . . . . . . . . . 5.1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Concept of the Modeling Process and Cross-Validation . . . . . . . . . . . . . . . . . . . . . . 5.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Building the Stream in SPSS Modeler . . . . . . . . . . 5.2.3 Identification and Interpretation of the Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Assessment of the Goodness of Fit . . . . . . . . . . . .
. . .
367 368 368
. . . .
371 374 375 377
. .
382 384
3
Contents
xi
5.2.5 Predicting Unknown Values . . . . . . . . . . . . . . . . . 5.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Building the Model in SPSS Modeler . . . . . . . . . . . 5.3.3 Final MLR Model and Its Goodness of Fit . . . . . . . 5.3.4 Prediction of Unknown Values . . . . . . . . . . . . . . . 5.3.5 Cross-Validation of the Model . . . . . . . . . . . . . . . . 5.3.6 Boosting and Bagging (for Regression Models) . . . 5.3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Generalized Linear (Mixed) Model . . . . . . . . . . . . . . . . . . . . 5.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Building a Model with the GLMM Node . . . . . . . . 5.4.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Cross-Validation and Fitting a Quadric Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Auto Numeric Node . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Building a Stream with the Auto Numeric Node . . . 5.5.2 The Auto Numeric Model Nugget . . . . . . . . . . . . . 5.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
388 389 391 413 413 416 419 427 428 431 439 443 475 475 477 483
. . . . . . . . .
486 497 498 521 522 530 533 534 545
6
Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 General Theory of Factor Analysis . . . . . . . . . . . . . . . . . . . . 6.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Building a Model in SPSS Modeler . . . . . . . . . . . . 6.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Principal Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Building a Model . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Feature Selection vs. Feature Reduction . . . . . . . . . 6.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
547 547 549 552 552 553 580 583 605 605 607 613 616 617 621
7
Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 General Theory of Cluster Analysis . . . . . . . . . . . . . . . . . . . .
623 623 625
xii
8
Contents
7.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 TwoStep Hierarchical Agglomerative Clustering . . . . . . . . . . 7.3.1 Theory of Hierarchical Clustering . . . . . . . . . . . . . 7.3.2 Characteristics of the TwoStep Algorithm . . . . . . . 7.3.3 Building a Model in SPSS Modeler . . . . . . . . . . . . 7.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 K-Means Partitioning Clustering . . . . . . . . . . . . . . . . . . . . . 7.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Building a Model in SPSS Modeler . . . . . . . . . . . . 7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Auto Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Motivation and Implementation of the Auto Cluster Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Building a Model in SPSS Modeler . . . . . . . . . . . . 7.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
632 634 637 637 650 652 664 665 676 676 678 696 698 724
. . . . . .
724 726 738 739 750 751
Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 General Theory of Classification Models . . . . . . . . . . . . . . . 8.2.1 Process of Training and Using a Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . 8.2.3 Classification Versus Clustering . . . . . . . . . . . . . . . 8.2.4 Decision Boundary and the Problem with Over- and Underfitting . . . . . . . . . . . . . . . . . . . . . 8.2.5 Performance Measures of Classification Models . . . 8.2.6 The Analysis Node . . . . . . . . . . . . . . . . . . . . . . . . 8.2.7 The Evaluation Node . . . . . . . . . . . . . . . . . . . . . . 8.2.8 A Detailed Example how to Create a ROC Curve . . 8.2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.10 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Building the Model in SPSS Modeler . . . . . . . . . . . 8.3.3 Optional: Model Types and Variable Interactions . . 8.3.4 Final Model and Its Goodness of Fit . . . . . . . . . . . 8.3.5 Classification of Unknown Values . . . . . . . . . . . . . 8.3.6 Cross-Validation of the Model . . . . . . . . . . . . . . . . 8.3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .
753 754 756
. . .
756 758 760
. . . . . . . . . . . . . . . .
763 765 769 771 780 794 798 804 805 808 816 818 825 825 828 833
Contents
8.4
9
xiii
Linear Discriminate Classification . . . . . . . . . . . . . . . . . . . . 8.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Building the Model with SPSS Modeler . . . . . . . . . 8.4.3 The Model Nugget and the Estimated Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Building the Model with SPSS Modeler . . . . . . . . . 8.5.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Neuronal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Building a Network with SPSS Modeler . . . . . . . . . 8.6.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . 8.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Building the Model with SPSS Modeler . . . . . . . . . 8.7.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . 8.7.4 Dimensional Reduction with PCA for Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.2 Building a Decision Tree with the C5.0 Node . . . . . 8.8.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . 8.8.4 Building a Decision Tree with the CHAID Node . . 8.8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 The Auto Classifier Node . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.1 Building a Stream with the Auto Classifier Node . . 8.9.2 The Auto Classifier Model Nugget . . . . . . . . . . . . . 8.9.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .
851 851 857
. . . . . . . . . . . . . . . . . . .
863 868 871 896 896 899 911 911 913 935 937 940 949 955 957 974 974 978 988
. . . . . . . . . . . . . . . .
991 998 1001 1015 1016 1024 1028 1031 1037 1040 1061 1063 1073 1076 1076 1086
Using R with the Modeler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Advantages of R with the Modeler . . . . . . . . . . . . . . . . . . . . 9.2 Connecting with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Test the SPSS Modeler Connection to R . . . . . . . . . . . . . . . .
. . . .
1089 1089 1090 1094
xiv
Contents
9.4 Calculating New Variables in R . . . . . . . . . . . . . . . . . . . . . 9.5 Model Building in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Modifying the Data Structure in R . . . . . . . . . . . . . . . . . . . 9.7 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
11
. . . . .
. . . . .
1098 1103 1114 1130 1146
Imbalanced Data and Resampling Techniques . . . . . . . . . . . . . . 10.1 Characteristics of Imbalanced Datasets and Consequences . . . 10.2 Resampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Random Oversampling Examples (ROSE) . . . . . . . 10.2.2 Synthetic Minority Oversampling Technique (SMOTE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 Adaptive Synthetic Sampling Method (abbr. ADASYN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Implementation in SPSS Modeler . . . . . . . . . . . . . . . . . . . . . 10.4 Using R to Implement Balancing Methods . . . . . . . . . . . . . . 10.4.1 SMOTE-Approach Using R . . . . . . . . . . . . . . . . . 10.4.2 ROSE-Approach Using R . . . . . . . . . . . . . . . . . . . 10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Exercise 1: Recap Imbalanced Data . . . . . . . . . . . . 10.5.2 Exercise 2: Resampling Application to Identify Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Exercise 3: Comparing Resampling Algorithms . . . 10.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.1 Exercise 1: Recap Imbalanced Data . . . . . . . . . . . . 10.6.2 Exercise 2: Resampling Application to Identify Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Exercise 3: Comparing Resampling Algorithms . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
1147 1147 1150 1153
Case Study: Fault Detection in Semiconductor Manufacturing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Case Study Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Standard Process in Data Mining . . . . . . . . . . . . . . . . . . 11.2.1 Business Understanding (CRISP-DM Step 1) . . . . . 11.2.2 Data Understanding (CRISP-DM Step 2) . . . . . . . . 11.2.3 Data Preparation (CRISP-DM Step 3) . . . . . . . . . . 11.2.4 Modeling (CRISP-DM Step 4) . . . . . . . . . . . . . . . 11.2.5 Evaluation and Deployment of Model (CRISP-DM Step 5 and 6) . . . . . . . . . . . . . . . . . . 11.3 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 1153 . . . . . . .
1156 1157 1170 1170 1176 1182 1182
. . . .
1183 1186 1187 1187
. 1188 . 1189 . 1190 . . . . . . .
1193 1193 1194 1195 1195 1197 1234
. . . . .
1241 1243 1245 1246 1247
Contents
12
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Data Sets Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 adult_income_data.txt . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 bank_full.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 beer.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.4 benchmark.xlsx . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.5 car_simple.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.6 car_sales_modified.sav . . . . . . . . . . . . . . . . . . . . . . 12.1.7 chess_endgame_data.txt . . . . . . . . . . . . . . . . . . . . . 12.1.8 credit_card_sampling_data.sav . . . . . . . . . . . . . . . . 12.1.9 customer_bank_data.csv . . . . . . . . . . . . . . . . . . . . . 12.1.10 diabetes_data_reduced.sav . . . . . . . . . . . . . . . . . . . . 12.1.11 DRUG1n.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.12 EEG_Sleep_Signals.csv . . . . . . . . . . . . . . . . . . . . . 12.1.13 employee_dataset_001 and employee_dataset_002 . . . 12.1.14 England Payment Datasets . . . . . . . . . . . . . . . . . . . 12.1.15 Features_eeg_signals.csv . . . . . . . . . . . . . . . . . . . . . 12.1.16 gene_expression_leukemia_all.csv . . . . . . . . . . . . . . 12.1.17 gene_expression_leukemia_short.csv . . . . . . . . . . . . 12.1.18 gravity_constant_data.csv . . . . . . . . . . . . . . . . . . . . 12.1.19 hacide_train.SAV and hacide_test.SAV . . . . . . . . . . 12.1.20 Housing.data.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.21 income_vs_purchase.sav . . . . . . . . . . . . . . . . . . . . . 12.1.22 Iris.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.23 IT-projects.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.24 IT user satisfaction.sav . . . . . . . . . . . . . . . . . . . . . . 12.1.25 longley.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.26 LPGA2009.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.27 Mtcars.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.28 nutrition_habites.sav . . . . . . . . . . . . . . . . . . . . . . . . 12.1.29 optdigits_training.txt, optdigits_test.txt . . . . . . . . . . . 12.1.30 Orthodont.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.31 Ozone.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.32 pisa2012_math_q45.sav . . . . . . . . . . . . . . . . . . . . . 12.1.33 sales_list.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.34 secom.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.35 ships.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.36 test_scores.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.37 Titanic.xlsx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.38 tree_credit.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.39 wine_data.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.40 WisconsinBreastCancerData.csv and wisconsin_breast_cancer_data.sav . . . . . . . . . . . . . . 12.1.41 z_pm_customer1.sav . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
1249 1249 1249 1249 1251 1251 1251 1252 1252 1253 1253 1253 1254 1255 1255 1255 1257 1257 1258 1258 1259 1259 1259 1259 1260 1260 1261 1261 1261 1261 1265 1265 1265 1266 1266 1267 1268 1269 1269 1269 1270 1270 1271 1272
1
Introduction
The amount of collected data has risen exponentially over the last decade, as companies worldwide store more and more data on customer interactions, sales, logistics, and production processes. For example, Walmart handles up to ten million transactions and 5000 items per second (see Walmart 2012). The company feeds petabytes of databases. To put that in context, that is the same as the total number of letters sent within the United States over six months. In the last decade, companies have discovered that understanding their collected data is a very powerful tool and can reduce overheads and optimize workflows, giving their firms a big advantage in the market. The challenge in this area of data analytics is to consolidate data from different sources and analyze the data, to find new structures or patterns and rules that can help predict the future of the business more precisely or create new fields of business activity. The job of a data scientist is predicted to become one of the most interesting jobs in the twenty-first century (see Davenport and Patil 2012), and there is currently a strong competition for the brightest and most talented analysts in this field. A wide range of applications, such as R, Python, SAS, MATLAB, and SPSS Statistics, provide a huge toolbox of methods to analyze large data and can be used by experts to find patterns and interesting structures in the data. Many of these tools are mainly programming languages, which assume the analyst has deeper programming skills and an advanced background in IT and mathematics. Since this field is becoming more important, graphic user-interfaced data analysis software is starting to enter the market, providing “drag and drop” mechanisms for career changers and people who are not experts in programming or statistics. One of these easy to handle, data analytics applications is the IBM SPSS Modeler. This book is dedicated to the introduction and explanation of its data analysis power, delivered in the form of a teaching course.
# Springer Nature Switzerland AG 2021 T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler, https://doi.org/10.1007/978-3-030-54338-9_1
1
2
1 Introduction
1.1
The Concept of the SPSS Modeler
IBM’s SPSS Modeler offers more than a single application. In fact, it is a family of software tools based on the same concept. These are: – – – – –
SPSS Modeler (Professional, Premium, Gold) SPSS Modeler Server SPSS Modeler Gold on Cloud SPSS Predictive Analytics Enterprise (combination of several products) IBM SPSS Modeler for Linux on System z
The difference between the versions of the Modeler is in the implemented functionalities or models (see Table 1.1). IBM has excellently managed the implementation and deployment of the application concept through different channels. Here we will show how to use the SPSS Modeler Premium software functionalities. The IBM SPSS Modeler is a powerful tool that enables the user access to a wide range of statistical procedures for analyzing data and developing predictive models. Upon finishing the modeling procedure, the models can be deployed and implemented in data analytics production processes. The SPSS Modeler’s concepts help companies to handle data analytics processes in a very professional way. In complex environments, however, the Modeler cannot be installed as a stand-alone solution. In these cases, the Modeler Server provides more power to users. Still, it is based on the same application concept. "
SPSS Modeler is a family of software tools provided for use with a wide range of different operating systems/platforms. Besides the SPSS Modeler Premium used here in this book, other versions of the Modeler exist. Most of the algorithms discussed here are available in all versions.
Table 1.1 Features of the different SPSS Modeler Editions Key feature Deployment Techniques
SPSS Modeler professional Desktop/Server Classification Segmentation Association
Capabilities
Enhancements
Source: IBM Website (2019b)
SPSS Modeler premium Desktop/Server Classification Segmentation Association Text analytics Entity analytics Social network analysis
SPSS Modeler gold Desktop/Server Classification Segmentation Association Text analytics Entity analytics Social network analysis Analytical decision management Collaboration and deployment services
1.1 The Concept of the SPSS Modeler
3
Toolbar
Stream
Modeler Manager
Nodes
Fig. 1.1 SPSS Modeler GUI
The Modeler’s Graphical User-Interface (GUI) To start the SPSS Modeler we click on Start > Programs > IBM SPSS Modeler 18.2 > IBM SPSS Modeler 18.2 Figure 1.1 shows the Modeler workspace with a stream. We can load the particular stream by using the toolbar item and navigating to the folder with the stream files provided in the book. The stream name here is “cluster_nutrition_habits. str”. The stream is the solution of an exercise using cluster analysis methods discussed later in this book. "
The user can modify the GUI colors, font sizes, etc. in A new modern “via Tools > User Options > Display”
At the bottom of the workspace, we can find the nodes palette. Here all available nodes are organized in tabs by topic. First, we have to click the proper tab and then select the nodes we need to build the stream. We will show how to build a stream in Sect. 2.1. If the tab “Streams” is activated in the upper part of the Modeler’s manager, on the right in Fig. 1.1, we can switch to one of the streams we opened before. We can also inspect the so-called SuperNodes. We explain the SuperNode concept in Sect. 3. 2.5. In the “Outputs” tab of the Modeler manager, we can find the relevant outputs created by a stream. Additionally, an important part of the Modeler manager is the tab “Models”. Here, we find all the models created by one of the open streams. If necessary, these models can be added to a stream. For now, these are the most important details of the IBM SPSS Modeler editions and the workspace. We will go into more detail in the following chapters. To
4
1 Introduction
conclude this introduction to the Modeler, we want to present a list of the advantages and challenges of this application, by way of a summary of our findings while working with it. Advantages of Using IBM SPSS Modeler – The Modeler supports data analysis and the model building process, with its accessible graphical user-interface and its elaborate stream concept. Creating a stream allows the data scientist to very efficiently model and implement the several steps necessary for data transformation, analysis, and model building. The result is very comprehensible; even other statisticians that are not involved in the process can understand the streams and the models very easily. – Due to its powerful statistics engine, the Modeler can be used to handle and analyze huge datasets, even on computers with restricted performance. The application is very stable. – The Modeler offers a connection with the statistical programming languages R and Python. We can link to the dataset, doing several calculations in R or Python, and send the results back to the Modeler. Users wishing to retain the functionalities of these programing languages have the chance to implement them in the Modeler streams. We will show how to install and how to use R in Chap. 9. Python works similar. – Finally, we have to mention the very good IBM support that comes as part of the professional implementation of the Modeler in a firm’s analytics environment. Even the best application sometimes raises queries that must be discussed with experts. IBM support is very helpful and reliable. Challenges with IBM SPSS Modeler – IBM’s strategy is to deliver a meticulously thought-out concept in data mining tools, to users wishing to learn how to apply statistics to real-world data. At first glance, the SPSS Modeler can be used in a very self-explanatory way. The potential risk is that difficult statistical methods can be applied in cases where the data does not meet all the necessary assumptions, and so inaccuracies can occur. – This leads us to a fundamental criticism: The Modeler focuses on handling large datasets in a very efficient way, but it does not provide that much detailed information or statistics on the goodness of the data and the models developed. A well-trained statistician may argue that other applications better support assessment of the models built. As an example, we will stress here the cluster analysis method. Statistics to assess the correlation matrix, such as the KMO Bartlett test or the Anti-image correlation matrix, are not provided. Also, a scree plot is missing. This is particularly hard to understand as the Modeler is obviously based on the program IBM SPSS Statistics, in which often more details are provided. – Furthermore, results calculated in a stream, e.g., factor loadings, must often be used in another context. The Modeler shows the calculation results in a wellstructured output tab, but the application does not provide an efficient way to
1.2 Structure and Features of This Book
5
access the results with full precision from other nodes. To deal with the results in other calculations, often the results must be copied manually. – Data transformation and aggregation for further analysis is hideous. The calculation of new features by aggregating other variables is challenging and can greatly increase the complexity of a stream. Outsourcing these transformations into a different statistical program, e.g., using the R node, is often more efficient and flexible. All in all, the IBM SPSS Modeler, like any other application, has its advantages and its drawbacks. Users can analyze immense datasets in a very efficient way. Statisticians with a deep understanding of “goodness of fit” measures and sophisticated methods for creating models will not always be satisfied, because of the limited amount of output detail available. However, the IBM SPSS Modeler is justifiably one of the leading data mining applications on the market.
1.2
Structure and Features of This Book
1.2.1
Prerequisites for Using This Book
This book can be read and used with minimal mathematical and statistical knowledge. Besides interest in dealing with statistical topics, the reader is required to have a general understanding of statistics and its basic terms and measures, like frequency, frequency distribution, mean, and standard deviation. Deeper statistics and mathematics are shortly explained in the book when needed, or references to the relevant literature are given, where the theory is explained in an understandable way. The following books focusing on more basic statistical analyses are recommended to the reader: Herkenhoff and Fogli (2013) and Weiers et al. (2011). Since the main purpose of this book is the introduction of the IBM SPSS Modeler and how it can be used for data mining, the reader needs a valid IBM SPSS Modeler license in order to properly work with this book and solve the provided exercises. For readers that are not completely familiar with the theoretical background, the authors explain at the beginning of each chapter the most relevant and needed statistical and mathematical fundamentals and give examples where the outlined methods can be applied in practice. Furthermore, many exercises are related to the explanations in the chapters and are intended for the reader to recapitulate the theory and gain more advanced statistical knowledge. The detailed explanations of and comments to these exercises clarify more essential terms and procedures used in the field of statistics. "
It is recommended that the reader of this book has a valid IBM SPSS Modeler license and should be interested in dealing with current issues in statistical and data analysis applications. This book facilitates the easy familiarization with statistical and data mining concepts because: – No advanced mathematical or statistical knowledge is required besides some general statistical understanding and knowledge of basic statistical terms.
6
1 Introduction
– Chapters start with explanations of the necessity of the procedures discussed. – Exercises cover all level of complexity, from basics of statistics to advanced predictive modeling. – Detailed explanations of the exercises help the reader to understand the terms and concepts used. "
The following books focusing on more basic statistical analyses are recommended to the reader: Herkenhoff, L. and Fogli, J. (2013), Applied statistics for business and management using Microsoft Excel, Springer, New York Weiers, R.M., Gray, J.B. and Peters, L.H. (2011), Introduction to business statistics, 7th ed., South-Western Cengage Learning, Australia, Mason, OH
1.2.2
Structure of the Book and the Exercise/Solution Concept
The goal of this book is to help users of the Modeler to become familiar with the wide range of data analysis and modeling methods offered by this application. To this end, the book has the structure of a course or teaching book. Figure 1.2 shows the topics discussed, allowing the user easy access to different focus points and the information needed for answering his own particular questions. Each section has the following structure: 1. 2. 3. 4.
An introduction to the basic principles of using the Modeler nodes. A short description of the theoretical or statistical background. Learning how to use the statistical methods by applying them to an example. Figuring out the most important parameters and their meaning, using “What-If?” scenarios. 5. Solving exercises and reviewing the solution.
Introducing the Details of the Streams Discussed At the beginning of each section, an overview can be found, as shown in Table 1.2, for the reader to identify the necessary dataset and the streams that will be discussed in the section. The names of the dataset and stream are listed before the final stream is depicted, followed by additional important details. At the end, the exercise numbers are shown, where the reader can test and practice what he/she has learned. Exercises and Solutions To give the reader a chance to assess what he/she has learned, exercises and their solutions can be found at the end of each section. The text in the solution usually explains any details worth bearing in mind with regard to the topic discussed. To link
1.2 Structure and Features of This Book
7
Basic Functionalities of the IBM SPSS Modeler Stream Building
Stream Handling
Recommended Notes & Comments in a Stream
Univariate Statistics Discrete & Continuous Variables
Scale Types
Distributions & Transformations & Binning
Theory & Model Building
Scatterplot Matrix
Multivariate Statistics Correlation Matrix
Contingency Tables
Regression Models Theory
Simple and Multiple Linear Regression
General Lineare (Mixed) Models
Auto Model Building
Factor & Cluster Analysis Factor Analysis PCA/PFA
Clustering at an Example
K-Means
TwoStep
Auto Clustering
Classification Logistic Regression
LDA
SVM
Neural Network
K-Nearest neighbor
Decision Trees
Extensions + Special Topic ‚Imbalanced Data‘ + Case Study R / Python
Imbalanced Data
Case Study
Fig. 1.2 Topics discussed and structure of the book Table 1.2 Overview of stream and data structure used in this book Stream name Based on dataset Stream structure
distribution_analysis_using_data_audit_node tree_credit.sav
Important additional remarks It is important to define the scale type of each variable correctly, so the Data Audit node applies the proper chart (bar chart or histogram) to each variable. For discrete variables, the SPSS Modeler uses a bar chart, whereas for continuous/metric variables, the distribution is visualized with a histogram Related exercises: 10
more extensive discussion in the theoretical part with the questions at the end of the book, each solution begins with a table, as shown in Table 1.3. Here, a crossreference to the theoretical background can be found.
8
1 Introduction
Table 1.3 Example of a solution
1.2.3
Name of the solution streams Theory discussed in section
File name of the solution Section XXX
Using the Data and Streams Provided with the Book
The SPSS Modeler streams need access to datasets that can then be analyzed. In the so-called “Source” nodes, the path to the dataset folder must be defined. To work more comfortably the streams provided with this book are based on the following logic: – All datasets are copied into one folder in the “C:” drive – There is just one file “registry_add_key.bat” in the folder “C: \SPSS_MODELER_BOOK\” – The name of the dataset folder is “C:\SPSS_MODELER_BOOK\001_Datasets” – The streams are normally copied to “C:\SPSS_MODELER_BOOK \002_Streams”, but the folder names can be different. If other folders are to be used, then the procedure described here, in particular the Batch file, must be modified slightly. "
All datasets, IBM SPSS Modeler streams, R scripts, and Microsoft Excel files discussed in this book are provided as downloads on the website:
"
http://www.statistical-analytics.net
"
Password: “modelerspringer2020”
"
For ease, the user can add a key to the registry of Microsoft Windows. This is done using the script “registry_add_key.bat” provided with this book.
"
Alternatively, the command: REG ADD “HKLM\Software\IBM\IBM SPSS Modeler\18.2\Environment” /v “BOOKDATA” /t REG_SZ /d “C:\SPSS_MODELER_BOOK\001_Datasets” /reg:32 and REG ADD “HKLM\Software\IBM\IBM SPSS Modeler\18.2\Environment” /v “BOOKDATA” /t REG_SZ /d “C:\SPSS_MODELER_BOOK\001_Datasets” /reg:64, where “C:\SPSS_MODELER_BOOK\001_Datasets” is the folder of datasets, can be used. This is to modify the values in the registry folder “Computer \HKEY_LOCAL_MACHINE\SOFTWARE\IBM\IBM SPSS Modeler\18.2 \Environment”
1.2 Structure and Features of This Book
9
To work with this book, we recommend the following steps: • • • •
Install SPSS Modeler Premium (or other version) on your computer Download the ZIP file with all files related to the book from the website http://www.statistical-analytics.net. The password is “spssmodelerspringer”. Move the folder “C:\SPSS_MODELER_BOOK” from the ZIP file to your disk drive “C:” • Add a key to the Microsoft Windows Registry that will allow the SPSS Modeler to find the datasets. To do so: (a) Navigate to the BATCH file named “registry_add_key.bat” in the folder “C: \SPSS_MODELER_BOOK\” (b) Right-click on the file and choose the option “Run as Administrator”. This allows the script to add the key to the registry. (c) After adding the key, restart the computer. The key can also be added to the Windows Registry manually by using the command: REG ADD “HKLM\Software\IBM\IBM SPSS Modeler\18.2\Environment” /v “BOOKDATA” /t REG_SZ /d “C:\SPSS_MODELER_BOOK\001_Datasets” Instead of using the original Windows folder name, e.g., “C: \SPSS_MODELER_BOOK\001_Datasets”, to address a dataset, the shortcut “$BOOKDATA” defined in the Windows registry should now be used (see also IBM Website (2015)). We should pay attention to the fact that the backslash in the path must be substituted with a slash. So, the path “C:\SPSS_MODELER_BOOK\001_Datasets\car_sales_modified.sav” equals “$BOOKDATA/car_sales_modified.sav”.
1.2.4
Datasets Provided with This Book
With this book, the reader has access to more than 30 sets that can be downloaded from the author’s website using the given password (see Sect. 1.3.1). The data are available in different file formats, such as: – TXT file with spaces between the values – CSV files with values separated by a comma Additionally, the reader can download extra files in R or in Microsoft Excel format. This enables the user to do several analysis steps in other programs too, e.g., R or Microsoft Excel. Furthermore, some calculations are presented in Microsoft Excel files.
10
"
" "
1 Introduction
All datasets discussed in this book can be downloaded from the authors’ website using the password provided. The datasets have different file formats, so the user learns to deal with different nodes to load the sets for data analysis purposes. Additionally, R- and Microsoft Excel Files are provided to demonstrate some calculations. Since version 17, SPSS Modeler does not support Excel file formats from 1997 to 2003. It is necessary to use file formats from at least 2007.
1.2.5
Template Concept of This Book
In this book, we show how to build different streams and how to use them for the analysis of datasets. We will explain now how to create a stream from scratch. Despite outlining this process, it is unhelpful and time-consuming to always have to add the data source and some suggested nodes, e.g., the Type node, to the streams. To work more efficiently, we hereby would like to introduce the concept of so-called template streams. Here, the necessary nodes for loading the dataset and defining the scale types for the variables are implemented. So, the users don’t have to repeat these steps in each exercise. Instead, they can focus on the most important steps and learn the new features of the IBM Modeler. The template streams can be extended easily by adding new nodes. Figure 1.3 shows a template stream. We should mention the difference between the template streams and the solution streams provided with this book. In the solution streams, all the necessary features and nodes are implemented, whereas in the template streams only the first steps, e.g., the definition of the data source, are incorporated. This is depicted in Fig. 1.4. So, the solution is not only a modification of the template stream. This can be seen by comparing Fig. 1.5 with the template stream in Fig. 1.3.
Fig. 1.3 Stream “TemplateStream_Car_Simple”
1.2 Structure and Features of This Book
Fig. 1.4 Template-stream concept of the book
Fig. 1.5 Final Stream “car_clustering_simple”
11
12
1 Introduction
"
A template-stream concept is used in this book. In the template streams, the datasets will be loaded, and the scale types of the variables are defined.
"
The template streams can be found “Template_Streams”, provided with this book.
"
Template streams access the data by using the “$BOOKDATA” shortcut defined in the registry. Otherwise the folder in the Source nodes would need to be modified manually before running the stream.
"
The template-stream concept allows the user to concentrate on the main parts of the functionalities presented in the different sections. It also means that when completing the exercises, it is not necessary to deal with dataset handling before starting to answer the questions. Simply by adding specific nodes, the user can focus on the main tasks.
in
the
sub-folder
The details of the datasets and the meaning of the variables included are described in Sect. 12.1. Before the streams named above can be used, it is necessary to take into account the following aspects of data directories. The streams are created based on the concept presented in Sect. 1.2. As shown in Fig. 1.6, the Windows registry shortcut “$BOOKDATA” is being used in the Source node. Before running a stream, the location of the data files should be verified and probably adjusted. To do this, double-click on the Source node and modify the file path if necessary (see Fig. 1.6).
1.3
Introducing the Modeling Process
Before we dive into statistical procedures and models, we want to address some aspects relevant to the world of data from a statistical point of view. Data analytics are used in different ways, such as for (see Ville (2001, pp. 12–13)): – – – – – – –
Customer acquisition and targeting in marketing Reducing customer churn by identifying their expectations Loyalty management and cross-selling Enabling predictive maintenance Planning sales and operations Managing resources Reducing fraud
This broad range of applications lets us assume that there are an infinite number of opportunities for the collection of data and for creating different models. So, we have to focus on the main aspects that all these processes have in common. Here we want to first outline the characteristics of the data collection process and the main steps in data processing.
1.3 Introducing the Modeling Process
13
Fig. 1.6 Example of verifying the data directory in a Statistics node
The results of a statistical analysis should be correct and reliable, but this always depends on the quality of the data the analysis is based upon. In practice, we have to deal with the effects that dramatically influence the volume of data, the quality, and the data analysis requirements. The following list, based on text from the IBM Website (2019a), gives an overview: 1. Scale of data: New automated discovery techniques allow the collection of huge datasets. There has been an increase in the number of devices that are able to generate data and send it to a central source. 2. Velocity of data: Due to increased performance in all business processes, data must be analyzed faster and faster. Managers and consumers expect results in minutes or seconds. 3. Variety of data: The time has long passed since data were collected in a structured form, delivered, and used more or less directly for data analysis purposes. Data are produced in different and often unstructured or less structured forms, such as through social networks comment, information on websites, or through streaming platforms.
14
1 Introduction
Fig. 1.7 Characteristics of datasets
4. Data in doubt: Consolidated data from different sources enable statisticians to draw a more accurate picture of which entities to analyze. The data volume increases dramatically, but improved IT performance allows the combination of many datasets and the use of a broader range of sophisticated models. The source of data determines the quality of the research or analysis. Figure 1.7 shows a scheme to characterize datasets by source and size. If we are to describe a dataset, we have to use two terms, one from either side of the scheme. For instance, collected data that relate to consumer behavior, based on a survey, is typically a sample. If the data are collected by the researcher his-/herself, it is called primary data because he/she is responsible for the quality. Once the data are collected, the process of building a statistical model can start, as shown in Fig. 1.8. As a first step, the different data sources must be consolidated, using a characteristic that is unique for each object. Once the data can be combined in a table, the information must be verified and cleaned. This means removing duplicates or finding spelling mistakes or semantic failures. At the end of the cleaning process, the data are prepared for statistical analysis. Typically, further steps such as normalization or re-scaling are used, and outliers are detected. This is to meet the assumptions of the statistical methods for prediction or pattern identification. Unfortunately, a lot of methods have particular requirements that are hard to achieve, such as normal distributed values. Here the statistician has to know the consequences of deviation from theory to practice. Otherwise, the “goodness of fit” measures or the “confidence intervals” determined, based on the assumptions, are often biased or questionable. A lot of further details regarding challenges in the data analysis process could be mentioned here. Instead of stressing theoretical facts, however, we would like to dive
1.3 Introducing the Modeling Process
15
Fig. 1.8 Steps to create statistical model
in to the data handling and model building process with the SPSS Modeler. We recommend the following exercises to the reader.
1.3.1
Exercises
Exercise 1: Data Measurement Units If we wish to deal with data, we have to have a perception of data measurement units. The units for measuring data volume, from a bit up to a terabyte, are well known. As a recap, answer the following questions: 1. Write all the data measurement units in the correct order and explain how to convert them from one to another. 2. After a terabyte of data, come the unit’s petabyte, exabyte, and zetabyte. How do they relate to the units previously mentioned? 3. Using the Internet, find examples that help you to imagine what each unit means in terms of daily live data volume. Exercise 2: Details and Examples for Primary and Secondary Data In Fig. 1.7, we distinguished between primary and secondary data. Primary data can be described in short as data that are generated by the researcher his-/herself. So, he/she is responsible for the data. Often also data that comes from one’s own department/company are referred to as primary data. Secondary data is collected by others. The researcher (or more often the firm of the researcher) is unable to judge the quality in detail.
16
1 Introduction
1. Name and briefly explain at least three advantages and three drawbacks of a personal interview, a mail survey, or an internet survey from a data collection perspective. 2. Name various possible sources of secondary data. Exercise 3: Distinguishing Different Data Sources The journal Economist (2014, p. 5), in its special report “Advertising and Technology”, describes how: The advertising industry obtains its data in two ways. “First-Party” data are collected by firms with which the user has a direct relationship. Advertisers and publishers can compile them by requiring users to register online. This enables the companies to recognize consumers across multiple devices and see what they read and buy on their site. “Third-party” data are gathered by thousands of specialist firms across the web [. . .] To gather information about users and help servers appropriate ads, sites often host a slew of third parties that observe who comes to the site and build up digital dossiers about them.
Using the classification scheme of data sources shown in Fig. 1.7, list the correct terms for describing the different data sources named here. Explain your decision, as well as the advantages and disadvantages of the different source categories. Exercise 4: Data Preparation Process In the article “How Companies Learn Your Secrets”, The New York Times Magazine (2012) describes the ability of data scientists to identify pregnant female customers. They do this so that stores, such as Target and Walmart, can generate significantly higher margins from selling their baby products. So, if the consumer realizes, through reading personalized advertisements, that these firms offer interesting products, the firm can strengthen its relationship with these consumers. The earlier a company can reach out to this target group, the more profit can be generated later. Here we want to discuss the general procedure for such a data analysis process. The hypothesis is that consumer habits of women and their male friends change in the event of a pregnancy. By the way, the article also mentioned that according to studies, consumers also change their habits when they marry. They become more likely to buy a new type of coffee. If they divorce, they start buying different brands of beer and, if they move into a new house, there is an increased probability they will buy a new kind of breakfast cereal. Consider you have access to the following data: • Primary data – Unique consumer ID generated by the firm – Consumers’ credit card details – Purchased items from the last 12 months, linked to credit card details or customer loyalty card – An internal firm registry, including the parent’s personal details collected in a customer loyalty program connected with the card
1.3 Introducing the Modeling Process
17
• Secondary data: – Demographic information (age, number of kids, marital status, etc.) – Details, e.g., ingredients in lotions, beverages, types of breads, bought from external data suppliers Answer the following questions: 1. Describe the steps in a general data preparation procedure, to enable identification of changing consumer habits. 2. Explain how you can learn if and how purchase patterns of consumers change over time. 3. Now you have a consolidated dataset. Additionally, you know how pregnant consumers are more likely to behave. How can you then benefit from this information and generate more profit for the firm that hired you? 4. Find out the risks in trying to contact the identified consumers, e.g., by mail, even if you can generate more profit afterwards. Exercise 5: Data Warehousing Versus Centralized Data Storage Figure 1.9 shows the big picture of an IT system implemented within a firm (see also Abts and Mülder (2009, p. 68)). The company sells shoes from a website. The aim of analyzing this system is to find out if concentrating data in a centralized database is appropriate in terms of security and data management.
Fig. 1.9 IT system with central database in a firm (Figure adapted from Abts and Mülder (2009, p. 68))
18
1 Introduction
Answer the following questions: 1. Describe the main characteristics of the firm’s IT system in your own words. 2. List some advantages of data consolidation realized by the firm. 3. Discussing the current status, describe the drawbacks and risks that the firm faces from centralizing the data. 4. Summarize your findings and suggest how to implement data warehousing within a firm’s IT landscape.
1.3.2
Solutions
Exercise 1: Data measurement Units Theory discussed in section
Section 1.1
The details of a possible solution can be found in Table 1.4. This table is based on Economist (2010). Exercise 2: Details and Examples of Primary and Secondary Data 1. Table 1.5 shows possible answers. Interested readers are referred to Lavrakas (2008). 2. Related to Fig. 1.7, we name here different sources for secondary data based on Weiers et al. (2011, pp. 112–113). Table 1.4 Data measure units and their interpretation Unit Bit
Size Two options 0 or 1
Byte (B)
8 bits
Kilobyte (KB) Megabyte (MB)
210 ¼ 1024 Bytes 210 KB ¼ 220 Bytes
Gigabyte (GB)
Petabyte (PB)
210 MB ¼ 220 KB ¼ 230 Bytes 210 GB ¼ 220 MB, etc. 210 TB ¼ 220 GB, etc.
Exabyte (EB) Zetabyte (ZB)
210 PB ¼ 220 TB, etc. 210 EB ¼ 220 PB, etc.
Terabyte (TB)
Example Abbreviation of “Binary digit”. Smallest unit for storing data Can create a simple alphabet with up to 256 characters or symbols One page of text equals 2 KB A pop song is about 4–5 MB, Shakespeare’s work needs 5 MB A 2-h movie is about 1–2 GB on the hard disk of a laptop All the books in America’s Library of Congress are about 15 TB All letters sent in the United States per year equal about 5 PB. Google processes 5 PB per h 50,000 years of high-quality video All information coded in the world in 2010 equals about 1.2 ZB
1.3 Introducing the Modeling Process
19
Table 1.5 Comparison of primary data collection methods Method for collection of primary data Personal interview
Mail survey
Internet survey
Advantages – Respondents often cooperate because of the influence of the interviewer – Open-ended questions can be asked – Depending on the cultural background, several topics should not be discussed – Respondents feel more anonymous – Nearly all topics can be addressed – Distance between firms and respondents not essential – Low costs per respondent – Answers are coded by the respondents – Usage of visualizations is possible
Drawbacks – Not anonymous – High time pressure – Respondents are influenced by the interviewers appearance and behavior – Relatively expensive in comparison to other methods
– Coding of answers is necessary – Number of questions must be limited because of often lower motivation to answer – Postal charges – Only a limited number of open-ended questions can be asked – Internet access necessary for respondents/firewalls in firms must be considered – Respondents are annoyed by the overwhelming number of internet surveys and therefore less motivated to answer – Professional design of the survey requires experience – Internet does not ensure confidentiality
Examples of official statistics sources: – Census data, e.g., Census of Population or Economic Censuses http://www.census.gov/ http://ec.europa.eu/eurostat/data/database – Central banks, e.g., Federal Reserve and European Central Bank http://www.federalreserve.gov/econresdata/default.htm http://sdw.ecb.europa.eu/ – Federal Government and Statistical Agencies https://nces.ed.gov/partners/fedstat.asp Examples of nonofficial statistics sources: – Firms and other commercial suppliers, e.g., Bloomberg, Reuters. – NGO’s – University of Michigan Consumer Sentiment Index
20
1 Introduction
Table 1.6 Advantages and disadvantages of data sources Term in the citation First party
Third party
Term in Fig. 1.7 Primary data ¼ data collected by the same researcher, department, or the firm itself. The firm is responsible for the quality Secondary data ¼ data collected by other firms, departments, or researchers. The firm itself cannot influence the quality of the data
Advantages – The researcher can influence the quality of the data and the sample size
Drawbacks – Takes time and needs own resources to collect the data, e.g., through conducting a survey
– Often easy and faster to get – Probably more expensive than primary data
– Explanation of the variables is less precise – Answering the original research question may be more difficult. That’s because the third party had another focus/reason for collecting the data – Quality of the data depends on the know-how of the third party
Exercise 3: Distinguishing Different Data Sources Theory discussed in section
Sect. 1.3
Table 1.6 shows the solution. More details can be found, e.g., in Ghauri and Grønhaug (2010). Exercise 4: Data Preparation Process 1. It is necessary to clean up and consolidate the data of the different sources. The consumer ID is the primary key in an interesting database table. Behind these keys, the other information must be merged. In the end, e.g., demographic information, including the pregnancy status calculated based on the baby register, is linked to consumer habits in terms of purchased products. The key to consolidating the purchased items of a consumer is the credit card or the consumer loyalty card details. As long as the consumer shows one of these cards and doesn’t pay cash, the purchase history becomes more complete each time. 2. Assuming we have clean personal consumer data linked to the products bought, we can now determine the relative frequency of purchases over time. Moreover, the product itself may change, and then the percentage of specific ingredients in each of the products may be more relevant. Pregnant consumers buy more unscented lotion at the beginning of their second trimester and sometimes in the first 20 weeks they buy more scent-free soap, extra-big bags of cotton balls, and supplements such as calcium, magnesium, and zinc, according to the New York Times Magazine (2012) article. 3. We have to analyze and check the purchased items per consumer. If the pattern changes are somehow similar to the characteristics we found from the data analysis for pregnancy, we can reach out to the consumer, e.g., by sending personalized advertisements to the woman or her family.
1.3 Introducing the Modeling Process
21
4. There is another interesting aspect to the data analytics process. Analyzing the business risk is perhaps more important than missing a chance to do correct analysis at the correct point in time. There are at least some risks: How do consumers react when they realize a firm can determine their pregnancy status? Is contacting the consumer by mail a good idea? One interesting issue the firm Target had to deal with was the complaints received from fathers who didn’t even know that their daughters were pregnant. They discovered it only upon receipt of the personalized mail promotion. So, it is necessary to determine the risk to business before implementing, e.g., a pregnancy-prediction model. Exercise 5: Data Warehousing Versus Centralized Data Storage 1. All data is stored in one central database. External and internal data are consolidated here. The incoming orders from a website, as well as the management of relevant information, are saved in the database. A database management system allows for restriction of user access. By accessing and transforming the database content, complex management reports can be created. 2. If data is stored in a central database, then redundant information can be reduced. Also, each user can see the same data at a single point in time. Additionally, the database can be managed centrally, and the IT staff can focus on keeping this system up and running. Furthermore, the backup procedure is simpler and cheaper. 3. There are several risks, such as: – Within the database, different types of information are stored, such as transactional data and managerial and strategic information. Even if databases allow only restricted access to information and documents, there remains the risk that these restrictions will be circumvented, e.g., by hacking a user account and getting access to data that should be confidential. – Running transactions to process the incoming orders is part of the daily business of a firm. These transactions are not complex, and the system normally won’t break down, as long as the data processing capacities are in line with the number of incoming web transactions. Generating managerial-related data is more complex, however. If an application or a researcher starts a performance-consuming transaction, the database service can collapse. If this happens, then the orders from the consumer website won’t be processed either. In fact, the entire IT system will shut down. 4. Overall, however, consolidating data within a firm is a good idea. In particular, providing consistent data improves the quality of managerial decisions. Avoiding the risk of complete database shutdown is very important, however. That is why data relevant to managerial decision-making, or data that are the basis of data analysis procedures, must be copied to a separate data warehouse. There the information can be accumulated day by day. A time lag of 24 h, for example, between data that is relevant to the operational process and the data analysis itself is in most firms not critical. An exception would be firms such as Dell that sell products with a higher demand price elasticity.
22
1 Introduction
References Abts, D., & Mülder, W. (2009). Grundkurs Wirtschaftsinformatik: Eine kompakte und praxisorientierte Einführung, STUDIUM, 6., überarb. und erw. Wiesbaden: Aufl, Vieweg + Teubner. Davenport, T., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century. Harvard Business Review, 90(10), 70–76. de Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-Commerce and knowledge management. Boston: Digital Press. Economist. (2010). All too much. Retrieved March 28, 2019, from http://www.economist.com/ node/15557421 Economist. (2014). Data—Getting to know. Economist 2014, pp. 5–6. Ghauri, P. N., & Grønhaug, K. (2010). Research methods in business studies (4th ed.). Harlow: Financial Times Prentice Hall. Herkenhoff, L., & Fogli, J. (2013). Applied statistics for business and management using Microsoft Excel. New York: Springer. IBM Website. (2015). Why does $CLEO_DEMOS/DRUG1n find the file when there is no $CLEO_DEMOS directory? Retrieved June 22, 2015, from http://www-01.ibm.com/support/ docview.wss?uid¼swg21478922 IBM Website. (2019a). Big data in the cloud. Retrieved June 28, 2019, http://www.ibm.com/ developerworks/library/bd-bigdatacloud/ IBM Website. (2019b). SPSS Modeler Edition Comparison. Retrieved March 28, 2019, from https://www.ibm.com/uk-en/products/spss-modeler/pricing Lavrakas, P. (2008). Encyclopedia of survey research methods—Volume 1. London: Sage [etc.]. The New York Times Magazine. (2012). How companies learn your secrets. Retrieved June 26, 2015, from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r¼0 Walmart. (2012). Company thanks and rewards associates for serving millions of customers. Retrieved August 1, 2015, http://news.walmart.com/news-archive/2012/11/23/walmart-usreports-best-ever-black-friday-events Weiers, R. M., Gray, J. B., & Peters, L. H. (2011). Introduction to business statistics (7th ed.). Mason, OH: South-Western Cengage Learning.
2
Basic Functions of the SPSS Modeler
After finishing this chapter, the reader is able to: 1. Define and run streams. 2. Explain the necessity of value labels and know how to implement them in the SPSS Modeler. 3. Use data handling methods, e.g., filtering, to extract specific information needed to answer research questions. 4. Explain in detail the usage of methods to handle datasets, such as sampling and merging and finally. 5. Explain why splitting data sets in different partitions is required to create and assess statistical models. So the successful reader will gain proficiency in the use of computer and the IBM SPSS Modeler to prepare data and handle even more complex datasets.
2.1
Defining Streams and Scrolling Through a Dataset
Description of the model Stream name Based on dataset Stream structure
Read SAV file data.str tree_credit.sav
Related exercises: 1, 2
# Springer Nature Switzerland AG 2021 T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler, https://doi.org/10.1007/978-3-030-54338-9_2
23
24
2
Basic Functions of the SPSS Modeler
Fig. 2.1 Saving a stream
Theoretical Background The first step in an analytical process is getting access to the data. We assume that we have access to the data files provided with this book and we know the folder where the data files are stored. Here we would like to give a short outline of how to import the data into a stream. Later in Sect. 3.1.1, we will learn how to distinguish between discrete and continuous variables. In Sect. 3.1.2, we will refer to the procedure for determining the correct scale type in more detail. Getting Access to a Data Source The dataset “tree_credit.sav” should be used here as an example. The file extension “sav” shows us that it is an SPSS Statistics file. Now we will describe how to add such a data source to the stream and define value labels. Afterward we will add a Table node, which enables us to scroll through the records. 1. We open a new and empty stream by using the shortcut “Ctrl+N” or the toolbar item “File/New”. 2. Now we save the file to an appropriate directory. To do this, we use “File/Save Stream”, as shown in Fig. 2.1. 3. The source file used here is an SPSS-Data file with the extension “SAV”. So, we add a “Statistics File” node from the Modeler tab “Sources”. We then doubleclick on the node to open the settings dialog window. We define the folder, as shown in Fig. 2.2. Here the prefix “$BOOKDATA/” represents the placeholder for the data directory. We have already explained that option in Sect. 1.2.3. If we choose not to add such a parameter to the Windows registry, this part of the file path should be replaced with the correct Windows path, e.g., “C:\DATA\” or similar paths. The filename is “tree_credit.sav” in every case, and we can confirm the parameters with the “OK” button.
2.1 Defining Streams and Scrolling Through a Dataset
25
Fig. 2.2 Parameters of the Statistics File node
Fig. 2.3 Unconnected nodes in a stream
4. Until now, the stream consists only of a Statistics File node. In the next step, we add a Type node from the Modeler toolbar tab “Field Ops”, by dragging and dropping it into the stream. Using this method, the nodes are unconnected (see Fig. 2.3). 5. Next, we have to connect the Statistics File node with the Type node. To do this, we left-click on the Statistics file node once and press F2. Then left-click on the Type node. The result is shown in Fig. 2.4. There is also another method for connecting two nodes: In step four, we added a Statistics File node to the stream. Normally, the next node has to be connected to this Statistics File node. For this reason, we would left-click once on the current node in the stream, e.g., the Statistics File node. This would mark the node. If we now left-click twice on the new Type node in “Field Ops”, the Type node will be added and automatically connected to the Statistics File node.
26
2
Basic Functions of the SPSS Modeler
Fig. 2.4 Two connected nodes in a stream
Fig. 2.5 Details of a Type node
6. So far, we have added the data to the stream and connected the Statistics File node to the Type node. If we want to see and probably modify the settings, we doubleclick on the Type node. As shown in Fig. 2.5, the variable names appear, their scale type and also their role in the stream. We will discuss the details of the different scales in Sects. 3.1.1 and 3.1.2. For now we can summarize that a Type node can be used to define the scale of measurement per variable, as well as their role, such as with input or target variables. If we use nodes other than a Statistics File node, the scale types are not predefined and must be determined by the user. 7. To have the chance to scroll through the records, we should also add a Table node. We can find it in the Modeler toolbar tab “Output” and add it to the stream. Finally, we connect the Type node to the Table node, as outlined above. Figure 2.6 shows the three nodes in the stream.
2.1 Defining Streams and Scrolling Through a Dataset
27
Fig. 2.6 Final stream “Read SAV file data”
"
To extend a stream with a new node, we can use two different procedures: 1. We add the new node. We connect it with an existing node afterward as follows: We click the existing node once. We press F2 and then we click the new node. Both nodes are now connected in the correct order. 2. Normally, we know the nodes that have to be connected. So before we add the new node, we click on the existing node, to activate it. Then we double-click the new node. Both nodes are now automatically connected.
"
A Type node must be included right after a Source node. Here the scale of measurement can be determined, as well as the role of a variable.
"
We have to keep in mind that it is not possible to show the results of two sub-streams in one Table node. To do, we would have to use a merge or append operation to consolidate the results. See Sects. 2.7.9 and 2.7.10.
8. If we now double-click on the Table node, the dialog window in Fig. 2.7 appears. To show the table with the records, we click on “Run”. 9. Now we can scroll through the records as shown in Fig. 2.8. After that, we can close the window with “OK”. "
Often it is unnecessary to modify the settings of a node, e.g., in a Table node, when we can right-click on the node in the stream and select “Run” (see Fig. 2.9).
Recommended Nodes in a Stream As outlined in Table 2.1, we must use a Source node to open the file with the given information. After the Source node, we should add a Type node to define the scale of measurement for each variable. Finally, we recommend adding a Table node to each stream, which gives us the chance to scroll though the data. This helps us inspect the data and become familiar with the different variables. Figure 2.6 shows the final stream we created for loading the dataset, given in the form of the statistics file “tree_credit.sav”. Here we added a Table node, right after the Type node. Sometimes we also connect the Table node directly to the Source node. All these details need to be arranged in a new stream. To save time, we prepared template streams as explained in Sect. 1.2.3.
28
2
Basic Functions of the SPSS Modeler
Fig. 2.7 Dialog window in the Table node
Fig. 2.8 Records shown with a Table node
2.2
Switching Between Different Streams
Description of the model Stream name Based on dataset
Filter_processor_dataset and Filter_processor_dataset modified benchmark.xlsx
2.2 Switching Between Different Streams
29
Fig. 2.9 Options that appear by right-clicking on a node
Table 2.1 Important nodes and their functionalities Source node, e.g., Statistics node
Type node
Table node
Loads the data in the SPSS Modeler. The type of the Source node depends on the type of file where the dataset is saved. They are: – Variable File node:Files are separated files by text or a comma (CSV). – Fixed File node:Text files with values saved in columns, but no delimiters used between the values. – Excel File Node:Microsoft Excel files in “xlsx” format (not “xls”). – Statistics File node:Opens SPSS Statistics files. A Type node can be used to: – Determine the scale of measurement of the variables. – Define the role of each variable, e.g., Input or Target variable. – Define value labels. Shows the values/information included in the data file opened by a Source node.
The aim of this section is to show how to handle different streams at the same time. We will explain shortly the functions of both streams used in Sect. 2.7.5. 1. We start with a clean SPSS Modeler environment, so we make sure each stream is closed. We also can close the Modeler and open it again. 2. We open the stream “Filter_processor_dataset”. In this stream several records from a dataset will be selected. 3. We also open the stream “Filter_processor_dataset modified”.
30
2
Basic Functions of the SPSS Modeler
Fig. 2.10 Open streams in the Modeler sidebar
Fig. 2.11 Advanced options in the Modelers sidebar
4. At this time, two streams are open in the Modeler. We can see that in the Modeler Managers sidebar on the right top corner. Here we can find all the open streams. By clicking on the stream name, we can switch between the streams (Fig. 2.10). 5. Additionally, it is possible to execute other commands, e.g., open a new stream or close a stream here. To do so we click with the right mouse button inside the “Streams” section of the sidebar (see Fig. 2.11). 6. Switching between the streams sometimes helps us to find out which different nodes are being used. As depicted in Fig. 2.12 in the stream “Filter_processor_dataset”, the Filter node and the Table node in the middle are being added, in comparison with stream “Filter_processor_dataset modified”. We will explain the functions of both streams in Sect. 2.7.5. "
Switching between streams is helpful for finding the different nodes used or for showing some important information in data analysis. If the streams are open, then in the Modeler sidebar on the right we can find them and can switch to another stream by clicking the name once.
2.3 Defining or Modifying Value Labels
31
Fig. 2.12 Stream “Filter_processor_dataset”
"
Using the right mouse button in the modelers sidebar on the right, a new dialog appears that allows us, for example, to close the active stream.
"
If there are SuperNodes, they would also appear in the Modeler’s sidebar. For details see Sect. 3.2.5.
2.3
Defining or Modifying Value Labels
Description of the model Stream name Based on dataset Stream structure
Related exercises: 3
modify_value_labels tree_credit.sav
32
2
Basic Functions of the SPSS Modeler
Fig. 2.13 Stream for reading the data and modifying the value labels
Fig. 2.14 Parameters of the Type node
Theoretical Background Value labels are useful descriptions. They allow convenient interpretation of the values. By defining a good label, we can later determine the axis annotations in diagrams too. Working with Value Labels 1. Here we want to use the stream we discussed in the previous section. To extend this stream, let’s open “Read SAV file data.str” (Fig. 2.13). 2. Save the stream using another name. 3. To open the settings dialog window shown in Fig. 2.14, we double-click on the Type node. In the second column of the settings in this node, we can see that the Modeler automatically inspects the source file and tries to determine the correct scale of measurement. In Sects. 3.1.1 and 3.1.2, we will show how to determine
2.3 Defining or Modifying Value Labels
33
Fig. 2.15 Step 1 of how to specify a value label in a Type node
the correct scale of measurement and how to modify the settings. Here we accept these parameters. 4. In this example we will focus on the variable “Credit_cards”. First click the button “Read Values”. See Fig. 2.15! 5. Let’s discuss the so-called “value labels”. If the number of different values is restricted, it makes sense to give certain values a name or a label. In column 3 of Fig. 2.14, we can see that at least two different values, 1.0 and 2.0, are possible. If we click on the Values field that is marked with an arrow in Fig. 2.14, a dropdown list appears. To show and to modify the value labels, let’s click on “Specify” as shown in Fig. 2.15. Figure 2.16 shows that in the SPSS-data file, two labels are predefined: 1 ¼ “Less than 5” and 2 ¼ “5 or more” credit cards. Normally, these labels are passed through the Type node without any modification (see arrow in Fig. 2.16). "
In a Type node, value labels can be defined or modified in the “Values” column. First we click the button “Read Values”. Then we choose the option “Specify. . .” in the column “Values” and we can define our own labels.
6. If we wish to define another label, we set the option to “Specify values and labels”. We can now modify our own new label. In this stream, we use the label
34
2
Basic Functions of the SPSS Modeler
Fig. 2.16 Step 2 of how to specify a value label in a Type node
“1, 2, 3, or 4” instead of “Less than 5”. Figure 2.17 shows the final step for defining a new label for the value 1.0 within the variable “Credit_cards”. 7. We can confirm the new label with “OK”. The “” text in the column “Values” shows us that we successfully modified the values labels (Fig. 2.18). Usually, the Modeler will determine the correct scale of measurement of the variables. However, an additional Type node will give us the chance to actually change the scale of measurement for each variable. We therefore suggest adding this node to each stream. The dataset “Credit_cards.sav” includes a definition of the scale of measurement, but this definition is incorrect. So, we adjust the settings in the “Measurement” column, as shown in Fig. 2.19. It is especially important to check variables that should be analyzed with a “Distribution node”. They have to be defined as discrete! This is the case with the “Credit_cards” variable. 8. We can close the dialog window without any other modification and click on “OK”. 9. If we want to scroll through the records and wish to have an overview of the new value labels, we should right-click on the Table node. Then we can use “Run” to show the records. Figure 2.20 shows the result.
2.4 Adding Comments to a Stream
35
Fig. 2.17 Step 3 of how to specify a value label in a Type node
10. To show the value labels defined in step 6, we use the button in the middle of the toolbar of the dialog window. It is marked with an arrow in Fig. 2.20. 11. Now we can see the value labels in Fig. 2.21. We can close the window with “OK”.
2.4
Adding Comments to a Stream
Description of the model Stream name Based on dataset Related exercises: 3
tree_credit_commented.str tree_credit.sav
Good and self-explanatory documentation of a stream can help reduce the time needed to understand the stream or at least to identify the nodes that have to be modified to find the correct information. We want to start, therefore, with a simple example of how to add a comment to a template stream.
36
Fig. 2.18 Modified value labels in the Type node
Fig. 2.19 Parameters of a Type node
2
Basic Functions of the SPSS Modeler
2.4 Adding Comments to a Stream
37
Fig. 2.20 Records without value labels in a Table node
Fig. 2.21 Records with value labels in a Table node
1. Open the template stream “Template-Stream tree_credit”. 2. To avoid any changes to the original template stream, we save the stream using another name, e.g., “tree_credit_commented.str”. 3. Now, we comment on the two nodes in the stream by assigning a box with the comment. To add a comment, without assigning it to a specific node, we have to make sure that no node is active. Therefore, we click on the background of the stream and not on a specific node. This deactivates any nodes that could be active. Now we click on the icon “Insert new comment” on the right-hand side of the Modelers toolbar, shown in Fig. 2.22.
38
2
Basic Functions of the SPSS Modeler
Fig. 2.22 Toolbar icon “Add new comment”
Fig. 2.23 A comment is added to the stream
Fig. 2.24 The comment is moved and resized into the background of two nodes
4. A new empty comment appears and we can define the text or the description, e.g., “template stream nodes”. Figure 2.23 shows the result. 5. Now we move and resize the comment so that it is in the background of both nodes (see Fig. 2.24). 6. If we want to assign a comment to a node, we have to first select the node with the mouse by clicking it once. We will probably have to use the Shift key to mark more than one node. 7. Now, we can add a comment by using the icon in the toolbar (see Fig. 2.22). Alternatively, we can right-click and choose “New Comment . . .” (see Fig. 2.25).
2.4 Adding Comments to a Stream
39
Fig. 2.25 Context dialog following a right-click
Fig. 2.26 A comment is assigned to a specific node
8. We define the comment, e.g., as shown in Fig. 2.26, with “source node to load dataset”. If there is no additional comment in the background, a dotted line appears to connect the comment to the assigned node. "
Commenting on a stream helps us working more efficiently and to describe some functionalities of several nodes. There are two types of comments: 1. Comments assigned to one or more node. We assign the comments to the nodes by activating them before adding the comment itself. 2. Comments not associated with certain nodes. Before adding such a comment we simply make sure that no node is active before clicking the “Add comment” symbol.
40
2
Basic Functions of the SPSS Modeler
"
Comments can be added by using the “Add new comment . . .” toolbar icon or by right-clicking the mouse.
"
Due to print-related restrictions with this book, we did not use comments in the streams. Nevertheless, we strongly recommend this procedure.
2.5
Exercises
Exercise 1: Fundamental Stream Details 1. Name the nodes that should, at a minimum, be part of a stream. Explain their role in the stream. 2. Explain the different functionalities of a Type node in a stream. 3. In Fig. 2.6 we connected the Table node to the Type node. (a) Open the stream “Read SAV file data.str”. (b) Remove the connection of this node with a right-click on it. (c) Then connect the Source node directly with the Table node. Exercise 2: Using Excel Data in a Stream In this exercise, we would like you to become familiar with the Excel File node. Often we download data in Excel format from websites, or we have to deal with secondary data that we’ve gotten from other departments. Unfortunately, the SPSS Modeler cannot deal with the Excel 2003 file format or previous versions. So we have to make sure to get the data in the 2010 file format at least or convert it. For this, we have to use Excel itself. Figure 2.27 shows some official labor market statistical records from the UK Office for National Statistics and its “Nomis” website. This Excel file has some specific features: – There are two worksheets in the workbook; the “data_modified” worksheet includes the data we would like to import here. We can ignore the second spreadsheet “Data”. – The first row includes a description of the columns below. We should use these names as our variable names. – The end of the table has no additional information. The SPSS Modeler’s import procedure can stop at the last row that contains values. Please create a stream, as described in the following steps: 1. Open a new stream. 2. Add an Excel node. Modify its parameters so that the data described can be analyzed in the stream. 3. Add a Table node to scroll through the records.
2.6 Solutions
41
Fig. 2.27 Part of the dataset “england_payment_fulltime_2014_reduced.xls”
Exercise 3: Defining Value Labels and Adding Comments to a Stream In this exercise, we want to show how to define value labels in a stream, so that the numbers are more self-explanatory. 1. Open the stream “Template-Stream_Customer_Bank.str”. Save the stream under another name. 2. Within the dataset exists a “DEFAULTED” variable. Define labels so that 0 equals “no” and 1 equals “yes”, as well as $null$ equals “N/A” (not available) 3. Assign a new comment to the Type node, which shows the text “modifies values labels”. 4. Verify the definition of the value labels using a Table node.
2.6
Solutions
Exercise 1: Fundamental Stream Details Name of the solution streams Theory discussed in section
Read SAV file data modified.str Section 2.1
1. The recommended nodes that should, at a minimum, be included in each stream can be found with a description in Table 2.1. 2. See also Table 2.1 for the different roles of a Type node in a stream. 3. The modified stream can be found in “Read SAV file data modified.str”.
42
2
Basic Functions of the SPSS Modeler
Fig. 2.28 Stream “Read SAV file data modified.str”
To remove the existing connection between a Type node and a Table node, we right-click on the connection and use “Delete connection”. Then with a left-click, we activate the Statistics file node. We press the F2 key and finally click on the Table node. Now the Source node and the Table node should be connected as shown in Fig. 2.28. Exercise 2: Using Excel Data in a Stream Name of the solution streams Theory discussed in section
using_excel_data.str’ Section 2.1
1. The final stream can be found in “using_excel_data.str”. Figure 2.29 shows the stream and its two nodes. 2. To get access to the Excel data, the parameters of the Excel node should be modified, as shown in Fig. 2.30. 3. The path to the file can be different, depending on the folder where the file is being stored. Here we used a placeholder “$BOOKDATA” (see Sect. 1.2.3). 4. In particular, we would like to draw attention to the options “Choose worksheet” and “On blank rows”. "
The Modeler does not always import calculations included in an ExcelWorksheet correctly. Therefore, the new values should be inspected, with a Table node for example. If NULL-values occur in the Excel-Worksheet, all cells should be marked, copied, and pasted using the function “Paste/ Values Only” (see Fig. 2.31).
2.6 Solutions
Fig. 2.29 Nodes in the stream “using_excel_data”
Fig. 2.30 Parameters of the Excel node
43
44
2
Basic Functions of the SPSS Modeler
Fig. 2.31 Paste only values in Excel
Exercise 3: Defining Value Labels and Adding Comments to a Stream Name of the solution streams Theory discussed in section
Stream_Customer_Bank_modified.str Section 2.3
In this exercise, we want to show how to define value labels in a stream, so that the numbers are more self-explanatory. 1. The solution stream can be found in “Stream_Customer_Bank_modified.str”. 2. To define the new value labels, we open the Type node with a double-click. Then we use “specify” in the second column for the “DEFAULTED” variable (see Fig. 2.32). First the option “Specify values and labels” must be activated. Then the recommended value labels can be defined, as shown in Fig. 2.33.
2.6 Solutions
45
Fig. 2.32 User-defined value labels are specified in a Type node (step 1)
3. To assign a comment to the Type node, we activate the Type node with a single mouse click. Then we use the toolbar icon “Add new comment”, as shown in Fig. 2.22, and add the text. Figure 2.34 shows the result. 4. To verify the definition of the value labels, the existing Table node can’t be used. That’s because it is connected to the Source node. The Type node only changes the labels later anyway. So it is better to add a new Table node and connect it to the Type node (see Fig. 2.35). To verify the new labels, we double-click the second Table node and click “Run”. To show the value labels, we press the button “Display field and value labels” in the middle of the toolbar. The last column in Fig. 2.36 shows the new labels.
46
2
Basic Functions of the SPSS Modeler
Fig. 2.33 User-defined value labels are specified in a Type node (step 2)
Fig. 2.34 Comment is assigned to a Type node
2.7 Data Handling Methods
47
Fig. 2.35 Final stream “Stream_Customer_Bank_modified.str”
Fig. 2.36 Dataset “customer_bank_data” with value labels
2.7
Data Handling Methods
2.7.1
Theory
Data mining procedures can be successfully applied only to well-prepared datasets. In Chap. 3, we will discuss methods to analyze variables one by one. These procedures can be used, e.g., to calculate a measure of central tendency and volatility, as well as to combine both types of measures to identify outliers. In the first part of the chapter “Multivariate Statistics”, we then discuss methods look at the dependency of two variables, e.g., to analyze their (linear) correlation. But before that, we want to show how to calculate new variables, as well as to deal with variables that represent text. Furthermore, in this section we will discuss methods that help us to generate specific subsets within the original dataset. By so doing, more methods, e.g., cluster analysis, can be applied. This is necessary because of the complexity of these multivariate techniques. If we can separate representative
48
2
Basic Functions of the SPSS Modeler
Fig. 2.37 Data handling topics discussed in the following section
subsets, then these methods become applicable despite any limitations of time or hardware restrictions. Figure 2.37 outlines the big picture for methods discussed in the following sections. The relevant section has been listed alongside each method. We will use various datasets to discuss the different procedures. To have the chance to focus on particular sets and use different methods to deal with the values or records included, we have reordered the methods applied.
2.7.2
Calculations
Description of the model Stream name Based on dataset Stream structure
simple_calculations IT_user_satisfaction.sav
Related exercises: 2, 3, 4, 5 and exercise “Creating a flag variable” in Sect. 3.2
2.7 Data Handling Methods
49
Table 2.2 Questionnaire items related to the training days training_days_actual
training_days_to_add
How many working days per year do you actually spend on IT-relevant training? Answer options: 0 ¼ none, 1 ¼ 1 day, 2 ¼ 2 days, 3 ¼ more than 2 days How many working days should be available per year for training and further education, in addition to the above-mentioned existing time budget for training during working hours? Answer options: 0 ¼ none, 1 ¼ 1 day, 2 ¼ 2 days, 3 ¼ more than 2 days
Theory Normally, we want to analyze all the variables that are included in our datasets. Nevertheless, often it is not enough to calculate measures of central tendency, for example, or to determine the frequency of several values. In addition, we also have to calculate other measures, for example, performance indicators. We will explain here how to normally do that using the Modeler. Simple Calculations Using a Derive Node In the dataset “IT_user_satisfaction”, we find two questions regarding the actual and the additional training days the user in a firm gets or expects. Table 2.2 displays the actual questions and their coding. If we now want to become familiar with the total number of training days the IT user expect, we can simply add the values or codes of both variables. The only uncertainty is that code “3” represents “more than 2 days”, so a user could express to expect 4 or even 5 days. Hopefully, the probabilities for these options are relatively small, so the workaround to calculate the sum should suffice. 1. We open the “Template-Stream IT_user_satisfaction” to get access to the dataset “IT_user_satisfaction”. 2. Behind the Type node we add from the Field Ops tab a Derive node and connect both nodes (see Fig. 2.38). 3. Now we want to name the new variable, as well as define the formula to calculate it. We double-click on the Derive node. In the dialog window that opens we can choose in “Derive field” the name of the new variable. We use here “training_expected_total_1”. Additionally, we set the type to “Ordinal”. "
With a Derive node new variables can be calculated. We suggest using self-explaining names for those variables. Short names may be easier to handle in a stream but often it is hard to figure out what they stand for.
"
After a Derive node always a Type node should be implemented to assign the correct scale to the new variable.
4. To show the results, we add a Table node behind the Derive node. We connect both nodes and run the Table node (see Fig. 2.39).
50
2
Basic Functions of the SPSS Modeler
Fig. 2.38 Derive node is added to the initial stream
Fig. 2.39 Distribution node to show the results of the calculation
The last column of Fig. 2.40 shows the result of the calculation. 5. To interpret the results more easily we can use a frequency distribution, so we add a Distribution node behind the Derive node. We select the new variable “training_expected_total_1” in the Distribution node to show the results. Figure 2.39 depicts the actual status of the stream. As we can see in Fig. 2.41, more than 30% of the users expect to have more than 3 days sponsored by the firm to become familiar with the IT system. 6. Now we want to explain another option for calculating the result, because the formula “training_days_actual+training_days_to_add” used in the first Derive node shown in Fig. 2.42 can be substituted with a more sophisticated version.
2.7 Data Handling Methods
51
Fig. 2.40 Calculated values in the Table node
Fig. 2.41 Distribution of the total training days expected by the user
Using the predefined function “sum_n” is simpler in this case, and we can also learn how to deal with a list of variables. The new variable name is “training_expected_total_2”. Figure 2.43 shows the formula. Also here we define the type as “Ordinal”. 7. To define the formula, we double-click on the Derive node. A dialog window appears, as shown in Fig. 2.43. With the little calculator symbol on the left-hand side (marked in Fig. 2.43), we can start using the “expression builder”. It helps us to select the function and to understand the parameter each function expects. Figure 2.44 shows the expression builder. In the middle of the window, we can select the type of function we want to use. Here, we choose the category
52
2
Fig. 2.42 Parameters of a Derive node
Fig. 2.43 Derive node with the expression builder symbol
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
53
Fig. 2.44 Expression builder used in a Derive node
“numeric”. In the list below we select the function “sum_n”, so that we can find out the parameters this function expects. An explanatory note below the table tells us that “sum_n(list)” defines a list. The most important details are the brackets [. . .] used to create the list of variables. The final formula used here is: sum_n([training_days_actual,training_days_to_add]) "
The expression builder in the Modeler can be used to define formulas. It offers a wide range of predefined functions.
"
In functions that expect lists of variables, we must use brackets [].
"
An example is “sum_n([var1,var2])” to calculate the sum of both variables.
8. To show the result, we add a new Table node behind the second Derive node. Figure 2.45 shows the final stream. The results of the calculation are the same, as shown in Fig. 2.40.
54
2
Basic Functions of the SPSS Modeler
Fig. 2.45 Final stream “simple_calculations”
We want to add another important remark here: we defined two new variables “training_expected_total_1” and “training_expected_total_2” in the Derive nodes. Both have unique names. Even so, they cannot be shown in the same table node. To connect one table node with both Derive nodes would make sense here. We should divide a stream, however, if we want to calculate different measures, or we want to analyze different aspects. It is inadvisable to join both parts together. "
When defining sub-streams for different parts of an analysis, we have to keep in mind that it is not possible (or error prone and inadvisable) to show the results of two sub-streams in one Table node.
2.7.3
String Functions
Description of the model Stream name Based on dataset Stream structure
string_functions england_payment_fulltime_2014_reduced.xls (continued)
2.7 Data Handling Methods
55
Related exercises: 6
Theory Until now we used the Derive node to deal with numbers, but this type of node can also deal with strings. If we start the expression builder in a Derive node (Fig. 2.43 shows how to do this), we find the category “string” on the left-hand side (see Fig. 2.46). In this section, we want to explain how to use these string functions generally. Separating Substrings The dataset “england_payment_fulltime_2014_reduced” includes the median of the weekly payments in different UK regions. The source of the data is the UK Office for National Statistics and its website NOMIS UK (2014). The data are based on an annual workplace analysis coming from the Annual Survey of Hours and Earnings (ASHE). For more details see Sect. 12.1.14. Figure 2.47 shows some records, and Table 2.3 shows the different region or area codes. We do not want to examine the payment data, however. Instead, we want to extract the different area types from the first column. As shown in Fig. 2.47, in the first column the type of the region and the names are separated by an “:”. Now we use different string functions of the Modeler to extract the type. We will explain three “calculations” to get the region type. Later in the exercise we want to extract the region names. 1. We open the stream “Template-Stream England payment 2014 reduced” to get access to the data. The aim of the steps that then follow is to extend this stream. At the end there should be a variable with the type of the area each record represents. 2. After the Type node we add a Derive node and connect both nodes (see Fig. 2.48).
56
2
Basic Functions of the SPSS Modeler
Fig. 2.46 Expression builder of a Derive node
Fig. 2.47 Weekly payments in different UK regions Table 2.3 Area codes
ualad09 pca10 gor
District Parliamentary constituencies Region
2.7 Data Handling Methods
57
Fig. 2.48 Derive node is added to the template stream
3. Double-clicking on the Derive node, we can define the name of the new variable calculated there. We use the name “area_type_version_1” to distinguish the different results from each other (Fig. 2.49). 4. The formula here is: startstring(locchar(“:”, 1, admin_description)-1,admin_description)
The function “startstring” needs two parameters. As the extraction procedure always starts on the first character of the string, we need to only define the number of characters to extract. This is the first parameter. Here we use “locchar(‘:’,1, admin_description)” to determine the position of the “:” We subtract one to exclude the colon from the result. The second parameter of “startstring” tells this procedure to use the string “admin_description” for extraction. 5. After the Derive node, we add a Table node to show the results (see the upper part of Fig. 2.50). If we run the Table node, we get a result as shown in Fig. 2.51. We can find the area type in the last column. 6. Of course we could now calculate the frequencies of each type here. Instead of this, we want to show two other possible ways to extract the area types. As depicted in the final stream in Fig. 2.50, we therefore add two Derive nodes and two Table nodes. 7. In the second Derive node, we use the formula substring(1,locchar(“:”,1, admin_description)-1,admin_description)
The function “substring” also extracts parts of the values represented by the variable “admin_description”, but the procedure needs three parameters:
58
Fig. 2.49 Parameters of the first Derive node
Fig. 2.50 Final stream “string_functions”
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
59
Fig. 2.51 Table node with the extracted area type in the last column
(1) Where to start the extraction: in this case, position 1 (2) The number of characters to extract: “locchar(‘:’,1, admin_description)-1” (3) Which string to extract from: admin_description 8. In the third Derive node, we used a more straightforward approach. The function is “textsplit(admin_description,1,‘:’)”
It separates the first part of the string and stops at the colon.
2.7.4
Extracting/Selecting Records
Description of the model Stream name Based on dataset Stream structure
selecting_records benchmark.xlsx (continued)
60
2
Basic Functions of the SPSS Modeler
Related exercises: 4
Theoretical Background Datasets with a large number of records are common. Often not all of the records are useful for data mining purposes, so there should be a way to determine records that meet a specific condition. Therefore, we want to have a look at the Modeler’s Select node. Filtering Processor Data Depending on the Manufacturer’s Name The file “benchmark.xlsx” contains a list of AMD and Intel processors. Table 2.4 explains the variables. For more details see also Sect. 12.1.4. Here, we want to extract the processors that are produced by the firm Intel. 1. We use the template stream “Template-Stream_processor_benchmark_test” as a starting point. Figure 2.52 shows the structure of the stream. 2. Running the table node we can find 22 records. The first column of Fig. 2.53 shows the processors are manufactured by Intel or AMD. 3. To extract the records related to Intel processors, we add from Modeler’s Records Ops tab a Select node and connect it with the Excel File node (see Fig. 2.54). 4. To modify the parameters of the Select node, we double-click on it. In the dialog window we can start the expression builder using the button on the right-hand side. It is marked with an arrow in Fig. 2.55. In the expression builder (Fig. 2.56), we first double-click on the variable “firm” and add it to the command window. Then we can extend the statement manually by defining the complete condition as: firm ¼ “Intel”. Figure 2.56 shows the complete statement in the expression builder. We can confirm the changes with “OK” and close all Select node dialog windows. 5. Finally, we should add a Table node behind the Select node to inspect the selected records. Figure 2.57 shows the final stream. Table 2.4 Variables in dataset “benchmark.xlsx”
Firm Processor type EUR CB
Name of the processor company Name of the processor Price of the processor Score of the Cinebench 10 test
2.7 Data Handling Methods
Fig. 2.52 Structure of “Template-Stream_processor_benchmark_test”
Fig. 2.53 Processor data in “benchmark.xlsx”
61
62
Fig. 2.54 Select node is added to the initial stream
Fig. 2.55 Dialog window of the Select node
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
Fig. 2.56 Expression Builder with the Selection statement
Fig. 2.57 Stream “selecting_records”
63
64
2
Basic Functions of the SPSS Modeler
Fig. 2.58 Selected Intel processor data
Running the Table node at the end of the stream, we can find the selected 12 records of processors from Intel (see Fig. 2.58). "
A Select node can be used to identify records that meet a specific condition. The condition can be defined by using the expression builder.
2.7.5
Filtering Data
Description of the model Stream name Based on dataset
Filter_processor_dataset benchmark.xlsx (continued)
2.7 Data Handling Methods
65
Stream structure
Related exercises: 7
Theoretical Background In data mining we often have to deal with many variables, but usually not all of them should be used in the modeling process. The record ID or the names of objects that are often included in each dataset are good examples for that. With the ID we can specify certain objects or records, but neither the ID nor the name is useful for the statistical modeling process itself. To reduce the number of variables or to assign another name to a variable, the Filter node can be used. This is particularly necessary if we would like to cut down the number of variables in a specific part of a stream. Filtering the IDs of PC Processors Benchmark tests are used to identify the performance of computer processors. The file “benchmark.xlsx” contains a list of AMD and Intel processors. Alongside the price in Euro and the result of a benchmark test performed with the test application Cinebench 10 (“CB”), the name of the firm and the type of processor are also included. For more details see Sect. 12.1.4. Of course we do not need the type of the processor to examine the correlations between price and performance, etc. So we can eliminate this variable from the calculations in the stream. The name of the firm is important though, as it will be used to create a scatterplot for the processors coming from Intel or from AMD. We therefore should not remove this variable completely.
66
2
Basic Functions of the SPSS Modeler
"
A Filter node can be used to (1) reduce the number of variables in a stream and (2) to rename variables. Indeed in the Source nodes of the Modeler, the same functionalities are available, but an extra Filter node should be used for transparency reasons, or if the number of variables should be reduced in a specific part of the stream (and not in the whole one).
"
The Filter node does not filter the records! It only reduces the number of variables.
We use the stream “Correlation_processor_benchmark”. A detailed description of this stream and its functionalities can be found in the Exercises 2 and 3 of Sect. 4.8. Figure 2.59 shows the stream. Here we calculated the Pearson Correlation Coefficient for the variables “Euro” (Price) and “CB” (Cinebench result). To reduce the number of variables in the analytical process, we now want to integrate a Filter node between the Excel node and the Type node.
Fig. 2.59 Stream “Correlation_processor_benchmark”
2.7 Data Handling Methods
67
1. We open the stream “Correlation_processor_benchmark”, which is to be modified. 2. We use “File/Stream Save as . . .” to save the stream with another name, e.g., “Filter_processor_dataset”. 3. Now we remove the connection between the Excel node and the Type node by right-clicking on it (on the connection—not on the nodes themselves!). We use the option “Delete Connection”. 4. Now we insert from the “Field Ops” tab a Filter node and place it between the Excel node and Type node (see Fig. 2.60). 5. Finally, we connect the new nodes in the right directions: we connect the Excel node with the Filter node and the Filter node with the Type node. Figure 2.60 shows the result. 6. To have the chance to understand the functionality of the Filter node, we should add another Table node and connect it with the Filter node (see Fig. 2.60). 7. To exclude some variables, we now double-click on the Filter node. Figure 2.61 shows the dialog window with all four variables. 8. To exclude the variable “processor type” from the part of the stream behind the Filter node and the analysis process, we click on the second arrow. Figure 2.62 shows the result. 9. We now can click “OK” to close the window.
Fig. 2.60 Stream “Filter_processor_dataset”
68
Fig. 2.61 Parameters of a Filter node
Fig. 2.62 Excluding variables using a Filter node
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
69
10. To have an idea and to check the functionalities of the Filter node, we should now inspect the dataset before and after usage of the Filter node. To do this we double-click on the left as well as on the right Table node, as Fig. 2.63 visualizes. Figures 2.64 and 2.65 show the results.
Fig. 2.63 Table nodes to compare the variables in the original dataset and those behind the Filter node
Fig. 2.64 Variables of the original “benchmark.xlsx” dataset
70
2
Basic Functions of the SPSS Modeler
Fig. 2.65 Filtered variables of the “benchmark.xlsx” dataset
Fig. 2.66 Using a Filter node to modify variable names
"
A Filter node can be used to exclude variables and to rename a variable without modifying the original dataset.
"
If we instead would like to identify records that meet a specific condition, then we have to use the Select node!
The names of the variables are sometimes hard to figure out. If we would like to improve the description and to modify the name, we can use the Filter node too. We open the Filter node and overwrite the old names. Figure 2.66 shows an example.
2.7 Data Handling Methods
71
Fig. 2.67 Excel Input node options with a disabled variable
Unfortunately, now this stream won’t work correctly anymore! We have to also adjust this variable name in the following nodes or formulas of the stream. In the end, we can summarize that renaming a variable makes sense and should be done in the Filter node. Here, we have explained how to use the Filter node in general, but there are also other options for reducing the number of variables. If we would like to reduce them for the whole stream, we can also use options in the Sources node itself, e.g., in an Excel Input node or a Variable File node. Figure 2.67 shows how to disable the variable “processor type” directly in the Excel File node. "
The parameters of the Source nodes can be used to reduce the number of variables in a stream. We suggest, however, adding a separate Filter node right behind a Source node, to create easy to understand streams.
2.7.6
Data Standardization: Z-Transformation
Description of the model Stream name Based on dataset Stream structure
verifying_z_transformation england_payment_fulltime_2014_reduced.xls (continued)
72
2
Basic Functions of the SPSS Modeler
Related exercises: 9
Theory In Chap. 4, we will discuss tasks in multivariate statistics. In such an analysis more than one variable is used to assess or explain a specific effect. Here—and also in the following chapter—we are interested in determining the strength a variable contributes to the result/effect. For this reason, and also to interpret the variables themselves more easily, we should rescale the variables to a specific range. In statistics we can distinguish between normalization and standardization. To normalize a variable, all values will be transformed with xnorm ¼
xi xmin xmax xmin
to an interval of [0, 1]. In statistics, however, we use more often the so-called standardization or ztransformation to equalize the range of each variable. After the transformation all of them spread around zero. First we have to determine the mean x and the standard deviation s of each variable. Then we use the formula zi ¼
xi x s
to standardize each value xi. The result is zi. These values zi have interesting characteristics. First we can compare them in terms of standard deviations. Second, the mean of zi is always zero and the standard deviation of zi is always one.
2.7 Data Handling Methods
73
In addition, we can identify outliers easily: standardizing the values means that we can interpret the z-values in terms of multiples of the standard deviations they are away from the mean. The sign tells us the direction of the deviation from the mean to the left or to the right. "
Standardized values (z values) are calculated for variables with different dimensions/units and make it possible to compare them. Standardized values represent the original values in terms of the distance from the mean in standard deviations.
"
In a multivariate analysis, z-standardized values should be used. It helps to determine the importance of the variables for the results.
"
A standardized value of, e.g., 2.18 means that it is 2.18 standard deviations away from the mean to the left. The standardized values itself have always have a mean of zero and a standard deviation of one.
Standardizing Values In this chapter, we want to explain the procedure for “manually” calculating the z values. We also want to explain how the Modeler can be used to do the calculation automatically. This will lead us to some functionalities of the Auto Data Prep (aration) node. We use the dataset “test_scores”, which represents test results in a specific school. See also Sect. 12.1.36. These results should be standardized and the calculated values should be interpreted. We use the template stream “Template-Stream test_scores” to build the stream. Figure 2.68 shows the initial status of the stream.
Fig. 2.68 Template stream “test_scores”
74
2
Basic Functions of the SPSS Modeler
1. We open the stream “Template-Stream test_scores” and save it using another name. 2. To inspect the original data, we add a Data Audit node and connect it with the Statistics node (see Fig. 2.69). 3. We can now inspect the data by double-clicking on the Data Audit node. After that, we use the run button to see the details in Fig. 2.70. In the measurement column, we can see that the variable “pretest” is continuous. Furthermore, we can find the mean of 54.956 and the standard deviation of
Fig. 2.69 Template stream with added Data Audit node
Fig. 2.70 Details of the results in the Data Audit node
2.7 Data Handling Methods
75
13,563. We will use these measures later to standardize the values. For now, we can close the dialog window in Fig. 2.70 with “OK”. 4. We first want to standardize the value of the variable “pretest” manually, so that we can understand how the standardization procedure works. As shown in Sect. 2.7.2, we therefore should add a Drive node and connect it with the Type node. Figure 2.71 shows the name of the new variable in the Drive node, that is, “pretest_manual_standardized”. As explained in the first paragraph of this section, we usually standardize values by subtracting the mean and then dividing the difference by the standard deviation. The formula is “(pretest-54.956)/ 13.563”. We should keep in mind that this procedure is for demonstration purposes only! It is never appropriate to use fixed values in a Derive node—considering that the values in the dataset can be different each time. Then fixed values would not be appropriate and the results would be wrong! Unfortunately in this case we cannot substitute the mean 54.956. As shown in Sect. 2.7.9, the predefined function “mean_n” calculates the average of values in a row using a list of variables. Here we would need the mean of a column—respectively a variable. Figure 2.72 shows the actual status of the stream. 5. To show the results we add a Table node behind the Derive node.
Fig. 2.71 Parameters of the Derive node to standardize the pre-test values
76
2
Basic Functions of the SPSS Modeler
Fig. 2.72 Derive node is added
Fig. 2.73 Standardized Pre-test results
Figure 2.73 shows some results. We find that the pretest result of 84 points equals a standardized value of (84–54.956)/13.563 ¼ +2.141. That means that 84 points is 2.141 away from the mean to the right. It is outside the 2s-interval (see also Sect. 3.2.7) and therefore a very good test result!
2.7 Data Handling Methods
"
77
We strongly suggest not using fixed values in a calculation or a Derive node! Instead of this, certain measures should be calculated using the build-in functions of the Modeler.
6. Finally, to check the results we add a Data Audit node to that sub-stream (see Fig. 2.74). 7. As explained, the standardized values have a mean of zero and a standard deviation of one. Beside small deviations for the mean that is not exactly zero, Fig. 2.75 shows the correct results. 8. To use a specific procedure to calculate the z-values, we have to make sure that the variable “pretest” is defined as continuous and as input variable. To check this, we double-click on the Type node. Figure 2.76 shows that in the template stream the correct options are being used.
Fig. 2.74 Stream with Table and Data Audit node
Fig. 2.75 Data Audit results for the standardized values
78
2
Basic Functions of the SPSS Modeler
Fig. 2.76 Type node settings
9. As explained above, it is error-prone to use fixed values (e.g., the mean and the standard deviation of the pretest results) in the Derive node. So there should be a possibility to standardize values automatically in the Modeler. Here we want to discuss one of the functionalities of the Auto Data Prep(aration) node for this purpose. Let’s select such an Auto Data Preparation node from the Field Ops tab of the Modeler toolbar. We add it to the stream and connect it with the Type node (see Fig. 2.77). 10. If we double-click on the Auto Data Preparation node, we can find an overwhelming number of different options (see Fig. 2.78). Here we want to focus on the preparation of the input variables. These variables have to be continuous. We checked both assumptions by first extracting the Type node parameters in step 8. 11. Now we activate the tab “Settings” in the dialog window of the Auto Data Preparation node (see Fig. 2.78). Additionally, we choose the category “Prepare Inputs and Target” on the left-hand side. At the bottom of the dialog window, we can activate the option “Put all continuous fields on a common scale (highly recommended if feature construction will be performed)”. 12. We can now close the dialog window of the Auto Data Preparation node. Finally, we should add a Table node as well as a Data Audit node to this part of the stream. This is to show the results of the standardization process. Figure 2.79 shows the final stream. 13. Figure 2.80 shows the results of the standardization procedure using the Auto Data Preparation node. We can see that the variable name is “pretest_transformed”. We can define the name extension by using the Field Names settings on the left-hand side in Fig. 2.78.
2.7 Data Handling Methods
Fig. 2.77 Auto Data Preparation node is added to the stream
Fig. 2.78 Auto Data Preparation node parameters to standardize continuous input variables
79
80
2
Basic Functions of the SPSS Modeler
Fig. 2.79 Final stream “verifying_z_transformation”
Fig. 2.80 Table node with the results of the Auto Data Preparation procedure
2.7 Data Handling Methods
81
The standardized values are the same as calculated with the Derive node before, however. We can find the value 2.141 in Fig. 2.80. It is the same as that shown in Fig. 2.73. Scrolling through the table we can see that the variable “pretest” is no longer presented here. It is being replaced by the standardized variable “pretest_transformed”. "
The Auto Data Preparation node offers a lot of options. This node can also be used to standardize input variables. Therefore, the variables should be defined as input variables, using a Type node in the stream.
"
Furthermore, only continuous values can be standardized here. We should make sure to check the status of a variable as “continuous input variable”, before we use an Auto Data Preparation node.
"
In the results, transformed variables replace the original variables.
2.7.7
Partitioning Datasets
Description of the model Stream name Based on dataset Stream structure
Related exercises: 9
partitioning_data_sets test_scores.sav
82
2
Basic Functions of the SPSS Modeler
Theoretical Background In data mining, the concept of cross-validation is often used to test the model for its applicability to unknown and independent data and to determine the optimal parameters for which the model best fits the data. For this purpose, the dataset has to be split into several parts: a training dataset, test dataset, and validation dataset. The training dataset will be used to find the correct parameters, and the smaller validation and test datasets will be used for finding the optimal parameters and testing the accuracy of these calculated parameters. This is depicted in Fig. 2.81. The process of cross-validation is described in Sect. 5.1.2 in more detail. Partitioning a Dataset into Two Subsets We would like to start by separating two subsets based on dataset “test_scores.sav”. For this we use the template stream “Template-Stream test_scores” (see Fig. 2.82). 1. We open the stream “Template-Stream test_scores” and save it using another name. 2. To inspect the original data, we add a Data Audit node and connect it to the Statistics node (see Fig. 2.83). 3. Let’s have a first look at the data by double-clicking the Data Audit node. After that we use the Run button to see the details in Fig. 2.84. In the column for the valid number of values per variable, we find the sample size of 2133 records. We will compare this value with the sample size of the subsets after the procedure to divide the original dataset. For now we can close the dialog window in Fig. 2.84 with “OK”. 4. After the Type node, we must add a “Partition” node. To do so, we first activate the Type node by clicking on it once. Then double-click on a Partition node in the
Fig. 2.81 Training, validation, and test of models
2.7 Data Handling Methods
83
Fig. 2.82 Template stream “test_scores”
Fig. 2.83 Template stream with added Data Audit node
Modeler tab “Field Ops”. The new Partition node is now automatically connected with the Type node (Fig. 2.85). 5. Now we should adjust the parameters of the Partition node. We double-click on the node and modify the settings as shown in Fig. 2.86. We can define the name of the new field that consists of the “labels” for the new partition. Here we can use the name “Partition”.
84
2
Basic Functions of the SPSS Modeler
Fig. 2.84 Details of the results in the Data Audit node
Fig. 2.85 Extended template stream with added Partition node
In addition, we would like to create two subsets, “Training and test”. We use the option with the name “Train and test”. After that we should determine the percentage of values in each partition. Normally 80/20 is a good choice. With the other options, we can determine the label and values of the subsets or partitions. Unfortunately, the Modeler only allows defining strings as values. Here, we can choose “1_Training” and “2_Testing”. See Fig. 2.86. 6. In the background, the process will assign the records to the specific partitions randomly. That means in each trial we get other results for the records assigned and also for the measures calculated in each partition are slightly different. To avoid this
2.7 Data Handling Methods
85
Fig. 2.86 Parameters of the Partition node
we could determine the initial value of the random generator by activating the option “Repeatable partition assignment”, but we do not use this option here. 7. There are no other modifications necessary. So we can close the dialog with “OK”. "
The Partition node should be used to define: (i) A training and a test subset or (ii) A training, test, and validation subset.
"
All of them must be representative. The Modeler defines which record belongs to which subset and assigns a name of the subset.
"
Normally, the partitioning procedure will assign the records randomly to the subset. If we would like to have the same result in each trial, we should use the option “Repeatable partition assignment”.
8. To understand what happens if we use the partitioning method, we add a Table node as well as a Distribution node and connect them with the Partition node. In the Distribution node, we use the new variable “Partition” to get the frequency of records per subset (see Fig. 2.87).
86
2
Basic Functions of the SPSS Modeler
Fig. 2.87 Distribution node settings
Fig. 2.88 Final stream “partitioning_data_sets”
Figure 2.88 shows the final stream. 9. We can inspect the assignment of the records to the partitions by scrolling through the records itself. To do this we double-click on the Table node to the right of the stream. With the Run button we get the result also shown in Fig. 2.89. In the last column, we can find the name of the partition to which each record belongs. As outlined above, here we can only define strings as values. The result is different each time because there is a random selection of the records in the background. We can now close the window.
2.7 Data Handling Methods
87
Fig. 2.89 Table with the new variable “Partition” (RANDOM VALUE!)
Fig. 2.90 Frequency distribution for the new variable “Partition” (RANDOM VALUES)
10. The Table node helps us to understand the usage of the Partition node. Nevertheless, we should still analyze the frequency distribution of the variable “Partition”. We run the Distribution node and we get a result similar to Fig. 2.90. The frequency has a random value here also, and it is just an approximation of the defined ratio 80/20, using the parameters in the Partition node in Fig. 2.86. We can see that each record is being assigned to one subset, however, because the sum of the frequencies 1690+443 equals the sample size 2133, as determined by the Data Audit node in Fig. 2.84.
88
2
Basic Functions of the SPSS Modeler
In Sect. 5.3.7, the procedure to divide a set into three subsets (training, validation, and test) will be discussed.
2.7.8
Sampling Methods
Theory In the previous chapter, we explained methods to divide the data into several subsets or partitions. Every record belongs to one of the subsets and no record (except outliers) is being removed from the sample. Also here, an understanding of the term “representative sample” is particularly important. The sample and the objects in the sample must have the same characteristics as the objects in the population. So the frequency distribution of the variables of interest is the same in the population as in the sample. In data mining, however, we have to deal with complex procedures, e.g., regression models or cluster analysis. Normally, the researcher is interested in using all the information in the data and therefore all the records that are available. Unfortunately, the more complex methods need powerful computer performance to calculate the results in an appropriate time. In addition, the statistical program has to handle all the data and has also to reorder, sort, or transform the values. Sometimes, the complexity of these routines exceeds the capacity of the hardware or the time available. So we have to reduce the number of records using a well-thought-out and intuitive selection procedure. Despite a lot of derivatives we can focus on three general techniques. Table 2.5 gives an overview. The SPSS Modeler offers a wide range of sampling techniques that help with all these methods. In the following examples, we want to show how to use the Select node to realize the sampling methods as outlined.
Table 2.5 Sampling techniques Name Random sampling Stratified sampling
Systematic sampling
Short description The records in the dataset all have the same and predefined chance of being selected. An intelligent subtype of the random sampling procedures, but it reduces sampling error. Stratums are a part of the population that have at least one characteristic in common—e.g., the gender of respondents. After the most important stratums are identified, their frequency will be determined. Now random sampling is being used to create a subsample in which the values are represented with “approximately” the same proportions. The subset is in this particular sense a representation of the original dataset. This applies only to the selected characteristics that should be reproduced. Every n-th record is selected. Assuming that the values in the original list have no hidden order, there is no unwanted pattern in the result.
2.7 Data Handling Methods
"
89
The key point of sampling is to allow the usage of complex methods, by reducing the number of records in a dataset. The subsample must be unbiased and representative, to come to conclusions that correctly describe the objects and their relationship to the population.
Simple Sampling Methods in Action Description of the model Stream name Based on dataset Stream structure
sampling_data test_scores.sav
Related exercises:
Here we want to use the dataset “test_scores.sav”. The values represent the test scores of students in several schools with different teaching methods. For more details see Sect. 12.1.36. 1. We start with the template stream “Template-Stream test_scores” shown in Fig. 2.91. 2. We open the stream “Template-Stream test_scores” and save it using another name. 3. To inspect the original data, we run the Table node. Then we can activate the labels by using the button in the middle of the toolbar also shown in Fig. 2.92. We find some important variables that can be used for stratification. Furthermore, we find the sample size of 2133 records. 4. Later we want to check if the selected records are representative for the original dataset. There are a lot of methods that can be used to do this. We want to have a
90
2
Basic Functions of the SPSS Modeler
Fig. 2.91 Template stream “test_scores”
Fig. 2.92 Records of “test_scores” dataset
look at the variables “pre test” and “post test”, especially at the mean and the standard deviation. We add a Data Audit node to the stream and connect it with the Source node. Figure 2.93 shows the extended stream and in Fig. 2.94 we can find the mean and the standard deviation of “pre test” and “post test”. 5. To have the chance to sample the dataset, we add from the Record Ops tab a Sample node and connect it with the Type node. In addition, we extend the stream by adding a Table node and a Data Audit node behind the Sample node. This gives us the chance to show the result of the sampling procedure. Figure 2.95 shows the actual status of the stream.
2.7 Data Handling Methods
91
Fig. 2.93 Added Data Audit node to the stream
Fig. 2.94 Mean and standard deviation of “pre-test” and “post-test”
So far we have not selected any sampling procedure in the Sample node, but what we can see is that the Sample node reduces the number of variables by one. The reason is that in the Type node the role of the “student_id” is defined as “None” for the modeling process (see Fig. 2.96). It is a unique identifier and so we do not need it in the subsamples for modeling purposes. 6. Now we can analyze and use the options for sampling provided by the Sample node. Therefore, double-click on the Sample node. Figure 2.97 shows the dialog window.
92
2
Basic Functions of the SPSS Modeler
Fig. 2.95 Extended stream to sample data
Fig. 2.96 Role definition for “student_id” in the Type node
7. The most important option in the dialog window in Fig. 2.97 is the “Sample method”. Here we have activated the option “Simple”, so that the other parameters will indeed be simple to interpret. The mode option checks if the selected records, determined by using the other parameters, are included or excluded from the sample. Here we should normally use “Include sample”.
2.7 Data Handling Methods
93
Fig. 2.97 Parameters of a Sample node
The option “Sample” has three parameters to determine what happens in the selection process. “First” just cuts off the sample after the number of records specified. This option is only useful if we can be sure there is definitely no pattern in the dataset. The first n selected records are representative for the whole sample also. The option “1-in-n” selects each n-th record, and the option “Random %” selects randomly n % of the records. 8. Here we want to reduce the dataset by 50%. So we have two choices: either we select every second record or we choose randomly 50% of the records. We start with the “1-in-n” option as shown in Fig. 2.98. We can also restrict the number of records by using the “Maximum sample size”, but this is not necessary here. We confirm the settings with “OK”. 9. The Table node at the end of the stream tells us that the number of records selected is 1066 (see Fig. 2.99). 10. The corresponding Data Audit node in Fig. 2.100 shows us that the mean and the standard deviation of “pre test” and “post test” differ slightly in comparison with the original values in Fig. 2.94. 11. To check the usage of the option “Random %” in the Sample node we add another Sample node as well as a new Data Audit node to the stream (see Fig. 2.101). 12. If we run the Data Audit node at the bottom in Fig. 2.101, we can see that the number of records differs each time. Note the half of the original 2133 records
94
2
Basic Functions of the SPSS Modeler
Fig. 2.98 Parameter “1-in-n” of a Sample node
Fig. 2.99 Number of records selected shown in the Table node
2.7 Data Handling Methods
95
Fig. 2.100 Details of the sampled records in the Data Audit node
Fig. 2.101 Final stream
and therefore 2133/2 ¼ 1066.5 records are selected. Sometimes the new sample size is with 1034 really different. "
The Sample node can be used to reduce the number of records in a sample. To create a representative sub-sample, the simple sampling methods “1-in-n” or “Random %” can be used. The number of records selected can be restricted. Variables whose roles are defined as “None” in a
96
2
Basic Functions of the SPSS Modeler
Source or Type node are excluded from the sampling process. Using the option “Random %”, the sample size differs from the defined percentage.
Complex Sampling Methods Description of the model Stream name Based on dataset Stream structure
sampling_data_complex_methods sales_list.sav
Related exercises:
Random sampling can avoid unintentional pattern appearing in the sample, but very often random sampling also destroys patterns that are evidently useful and necessary in data mining. Recalling the definition of the term “representative sample”, which we discussed at the beginning of this section, we have to make sure that “the frequency distribution of the variables of interest is the same in the population and in the sample”. Obviously, we cannot be sure that this is the case if we select each object randomly. Consider the random sampling of houses on sale from an internet database; the regions in the sample are not distributed as known from the whole database. The idea is to add constraints to the sampling process, to ensure the representativeness of the sample. The concept of stratified sampling is based on the following pre-conditions: – Each object/record of the population is assigned to exactly one stratum. – All stratums together represent the population and no element is missing. Stratification helps to reduce sampling error. In the following example, we want to show more complex sampling methods using shopping data from a tiny shop. This
2.7 Data Handling Methods
97
gives us the chance to easily understand the necessity for the different procedures. The dataset was created by the authors based on an idea by IBM (2014, p. 57). For more details see Sect. 12.1.33. 1. Let’s start with the template stream “009_Template-Stream_shopping_data”. We open it in the SPSS Modeler (see Fig. 2.102). 2. The predefined Table node gives us the chance to inspect the given dataset. We double-click and run it. We get the result shown in Fig. 2.103. As we can see, these are the transactions of customers in a tiny shop.
Fig. 2.102 Template Stream shopping data
Fig. 2.103 Data of “sales_list.sav”
98
3. 4.
5.
6.
7. 8.
2
Basic Functions of the SPSS Modeler
If we use random sampling here—then it is clearly not necessary to reduce the number of records—we can create a smaller sample but, e.g., the distribution of the gender, would not be the same or representative. To use stratified sampling, we have to add a Sample node behind the Type node and connect both nodes (see Fig. 2.104). By double-clicking on it, we get the dialog window as shown in Fig. 2.105. We must then activate the option “Complex” using the radio button in the option “Sample method”. Please keep in mind the defined sample size of 0.5 ¼ 50% in the middle of the window! We open another dialog window by clicking on the button “Cluster and Stratify . . .” (see Fig. 2.106). We can now add “gender” to the “Stratify by” list. Therefore, we use the button on the right-hand side marked in Fig. 2.106 with an arrow. We can now close the dialog window with “OK”. If we want to add an appropriate label/description to the Sample node, we can click on the “annotations” tab in Fig. 2.105. We can add a name with the option “Custom” (see Fig. 2.107). We used “Stratified by Gender”. Now also close this window with “OK”. To scroll through the records in the sampled subset, we must add another Table node at the end of the stream and connect it with the Sample node (see Fig. 2.108). Running this Table node gives us the result as shown in Fig. 2.109. Remembering the sample size of 50% as defined in the Sample node in Fig. 2.105, we can accept the new size of six records. In the original dataset, we had records (not transactions!) of eight female and four male customers. In the
Fig. 2.104 Sample node is added
2.7 Data Handling Methods
Fig. 2.105 Parameters of a Sample node
Fig. 2.106 Cluster and Stratify options of a Sample node
99
100
2
Basic Functions of the SPSS Modeler
Fig. 2.107 Defining a name for the Sample node
Fig. 2.108 Table node is added to the stream
new one we can find records of four female and two male customers. So the proportions of the gender—based on the number of records—are the same. This is exactly what we want to ensure by using stratified sampling. 9. So far the stratified sampling seems to be clear, but we also want to explain the option of defining individual proportions for the strata in the sample. To duplicate the existing sub-stream, we can simply copy the Sample and the Table node. For this we have to first mark both nodes by clicking on them once
2.7 Data Handling Methods
101
Fig. 2.109 Result of stratified sampling
Fig. 2.110 Copied and pasted sub-stream
while pressing the Shift key. Alternatively, we can mark them with the mouse by drawing a virtual rectangle around them. Now we can simply copy and paste them. We get a stream as shown in Fig. 2.110. 10. Now we have to connect the Type node and the new sub-stream, so we click on the Type node once and press the F2 key. At the end we click on the Target node—which in this case is the Sample node. Figure 2.111 shows the result.
102
2
Basic Functions of the SPSS Modeler
Fig. 2.111 Extended stream with two sub-streams
"
Nodes with connections between them can be copied and pasted. To do this the node or a sub-stream must be marked with the mouse, then simply copy and paste. Finally, the new components have to be connected to the rest of the stream.
11. We now want to modify the parameters of the new Sample node. We doubleclick on it. In the “Cluster and Stratify . . .” option, we defined the gender as the variable for the relevant strata (see Fig. 2.105). If we think that records with women are under-represented, then we must modify the strata proportions. In the field “Sample Size” we activate the option “Custom”. 12. Now let’s click on the “Specify sizes . . .” button. In the new dialog window, we can read the values of the variable “gender” by clicking on “Read Values”. This is depicted in Fig. 2.112. 13. To define an individual proportion of women in the new sample, we can modify the sample size as shown in Figs. 2.113 and 2.114. We can then close this dialog window with “OK”. 14. Finally, we can modify the description of the node as shown in Fig. 2.115. We can then close this dialog window too. 15. Running the Table node in this sub-stream we get the result as shown in Fig. 2.116. We can see that the number of female customers increased in comparison with the first dataset shown in Fig. 2.109. Now we want to use another option to ensure we get a representative sample. In the procedure used above, we focused on gender. In the end we found a representative sample regarding the variable “gender”. If we want to analyze products that are often sold together, however, we can consider reducing the
2.7 Data Handling Methods
103
Fig. 2.112 Definition of individual strata proportions—step 1
Fig. 2.113 Definition of individual strata proportions—step 2
sample size, especially in the case of an analysis of huge datasets. Here a stratified sample related to gender is useless. We have to make sure that all the products sold together are also in the result. In this scenario, it is important to understand the characteristics of a flat table database scheme: as shown once more in Fig. 2.117, the first customer bought three products, but the purchase is represented by three records in the table. So it is not appropriate to sample the records randomly based on the customer_ID. If one record is selected then all the records of the same purchase must also be assigned to the new subset or partition.
104
2
Basic Functions of the SPSS Modeler
Fig. 2.114 Definition of individual strata proportions—step 3
Fig. 2.115 Definition of a new node name
2.7 Data Handling Methods
Fig. 2.116 New dataset with individual strata proportions
Fig. 2.117 Data of “sales_list.sav”
105
106
2
Basic Functions of the SPSS Modeler
Here we can use the customer_ID as a unique identifier. Sometimes it can be necessary to define another primary key first for this operation. This is a typical example where the clustering option of the Sample node can be used. We want to add a new sub-stream by using the same original dataset. Figure 2.118 shows the actual status of our stream. 16. Now we add another Sample node and connect it with the Type node (see Fig. 2.119). In the parameters of the Sample node, we activate complex
Fig. 2.118 Actual status of the stream
Fig. 2.119 Final stream
2.7 Data Handling Methods
107
Fig. 2.120 Custom Strata options enabled in the Sample node
sampling (see Fig. 2.120) and click on “Cluster and Stratify . . .”. Here, we select the variable “customer_ID” in the drop-down list of the “Clusters” option (see Fig. 2.121). By using this option, we make sure that if a record of purchase X is selected, all other records that belong to purchase X will be added to the sample. We can close this dialog window with “OK”. 17. Within the parameters of the Sample node, we can define a name for the node as shown in Fig. 2.122. After that we can close the dialog box for the Sample node parameters. 18. Now we can add a Table node at the end of the new sub-stream. Figure 2.119 shows the final stream. 19. Double-clicking the Table node marked with an arrow in Fig. 2.119, we probably get the result as shown in Fig. 2.123. Each trial gives a new dataset because of the random sampling, but the complete purchase of a specific
108
2
Fig. 2.121 Definition of a cluster variable in the sampling
Fig. 2.122 Defining a name for the Sample node
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
109
Fig. 2.123 New dataset produced by clustering using “customer_ID”
customer is always included in the new sample, as we can see by comparing the original dataset in Fig. 2.117 and the result in Fig. 2.123. The specific structure of the purchases can be analyzed by using the new dataset. "
The Sample node allows stratified sampling that produces representative samples. In general, variables can be used to define strata. Additionally, cluster variables can be defined to make sure that all the objects belonging to a cluster will be assigned to the new dataset, in case only one of them is randomly chosen. Furthermore, the sample sizes of the strata proportions can be individually defined.
2.7.9
Merge Datasets
Description of the model Stream name Based on dataset
merge_England_payment_data england_payment_fulltime_2013.csv england_payment_fulltime_2014_reduced.xls
Stream structure (continued)
110
2
Basic Functions of the SPSS Modeler
Related exercises: 12, 14
In practice, we often have to deal with datasets that come from different sources or that are divided into different partitions, e.g., by years. As we would like to analyze all the information, we have to consolidate the different sources first. To do so, we need a primary key in each source, which we can use for that consolidation process. "
A primary key is a unique identifier for each record. It can be a combination of more than one variable, e.g., of the name, surname, and birthday of a respondent, but we recommend whenever possible to avoid such “difficult” primary keys. It is hard to deal with them and most likely error-prone. Instead of combining variables, we should always try to find a variable with unique values. Statistical databases usually offer such primary keys.
Considering the case of two datasets, with a primary key in each subset, we can imagine different scenarios: 1. Merging the datasets to combine the relevant rows and to get a table with more columns, or 2. Adding rows to a subset by appending the other one. In this section, we would like to show how to merge datasets. Figure 2.124 depicts the procedure. Two datasets should be combined by using one variable as a
2.7 Data Handling Methods
111
Fig. 2.124 Process to merge datasets (inner join) Table 2.6 Join types Join type Inner join
Venn-diagram
Description Only the rows that both sources have in common are matched.
Full outer join
All rows from both datasets are also in the joined table. But a lot of values are not available ($null$) (see also Fig. 2.124).
Partial outer join left
Records from the first-named dataset are in the joined table. From the second dataset only those with a key that match with the first dataset key will be copied.
Partial outer join right
Records from the second-named dataset are in the joined table. From the first dataset, only those with a key that match with the second dataset key will be copied.
Anti-join
Only the records with a key that is not in the second dataset are in the joined table.
primary key. The SPSS Modeler uses this key to determine the rows that should be “combined”. Figure 2.124 shows an operation called inner join. Dataset 1 will be extended by dataset 2, using a primary key that both have in common, but where there are also two keys in each dataset (3831 and 6887) that are not in the other one. The difference between the ways to join lies in how they handle these difficulties. Table 2.6 shows the join types that can be found in the Modeler.
112
2
Basic Functions of the SPSS Modeler
In the following scenario, we want to merge two datasets coming from the UK. For more details see Sect. 12.1.14. Figure 2.125 shows the given information for 2013, and Fig. 2.126 shows some records for 2014. In both sets, a primary key “area_code” can be identified that is provided by the official database. The primary key should be unique because there is no area with the same official code. The relatively complicated variable “admin_description” (administrative description) stands for a combination of the type of the region with their names. We will separate both parts in the Exercise 5 “Extract area names” at the end of this section. Here we want to deal with the original values. Looking at Figs. 2.125 and 2.126, it is obvious that besides the weekly payment, the confidence value (CV) for these variables also exists in both subsets, but there is no variable for the year. We will solve that issue by renaming the variables in each subset to make clear which variable represents the values for which year. The aim of the stream is to extend the values for 2013 with the additional information from 2014. In the end, we want to create a table with the area code, the administrative description of the area, the weekly gross payment in 2013 and 2014 as well as the confidence values for 2013 and 2014.
Fig. 2.125 England employee data 2013
Fig. 2.126 England employee data 2014
2.7 Data Handling Methods
113
1. We open the template stream “Template-Stream England payment 2013 and 2014 reduced”. As shown in Fig. 2.127 there is a Source node in the form of a Variable file node to load data from 2013. Below, the data for 2014 are imported by using a Excel file node. We can see here that different types of sources can be used to merge the records afterward (see Fig. 2.127). 2. If we double-click on the Table nodes on top and at the bottom, we get the values as shown in Figs. 2.125 and 2.126. We can scroll through the records and then close both windows. 3. To have the chance to exclude several variables in each subset, we add a Filter node behind each source node. We can find this node type in the Field Ops tab of the SPSS Modeler. Figure 2.128 shows the actual status of the stream. 4. In the end, we want to create a table with the area code, the administrative description of the area, the weekly gross payment in 2013 and 2014 as well as the confidence values for 2013 and 2014. To do so we have to exclude all the other variables, but additionally we must rename the variables so they get a correct and unique name for 2013 and 2014. Fig. 2.127 Template-Stream England payment 2013 and 2014 reduced
114
2
Basic Functions of the SPSS Modeler
Fig. 2.128 Filter nodes are added behind each Source node
Figure 2.129 shows the Parameters of the Filter node behind the Source node for 2013. In rows three and four, we changed the name of the variables “weekly_payment_gross” and “weekly_payment_gross_CV” by adding the year. Additionally, we excluded all the other variables. To do so we click once on the arrow in the column in the middle. 5. In the Filter node for the data of 2014 we must exclude the “admin_description” and the “area_name”. Figure 2.130 shows the parameters of the second Filter node. The variable names should also be modified here. 6. Now the subsets should be ready for the merge procedure. First we add from the “Record Ops” tab a Merge node to the stream. Then we connect both Filter nodes with this new Merge node (see Fig. 2.131). 7. We double-click on the Merge node. In the dialog window we can find some interesting options. We suggest taking control of the merge process by using the method “Keys”. As shown in Fig. 2.132, we use the variable “area_code” as a primary key. With the four options at the bottom of the dialog window we can determine the type of the join-operation that will be used. Table 2.6 shows the different join-types and a description.
2.7 Data Handling Methods
115
Fig. 2.129 Filter node for data in 2013 to rename and exclude variables
8. To check the result we add a Table node behind the Merge node. Figure 2.133 shows the actual stream that will be extended later. In Fig. 2.134 we can find some of the records. The new variable names and the extended records are shown for the first four UK areas.
116
2
Basic Functions of the SPSS Modeler
Fig. 2.130 Filter node for data in 2014 to exclude and rename variables
Fig. 2.131 New Merge node in the stream
2.7 Data Handling Methods
Fig. 2.132 Parameters of a Merge node
Fig. 2.133 Stream to merge 2013 and 2014 payment data
117
118
2
Basic Functions of the SPSS Modeler
Fig. 2.134 Merged data with new variable names
Fig. 2.135 Filter and rename dialog of the Merge node
In general we can also rename the variables and exclude several of them in the “Filter” dialog of the Merge node. We do not recommend this, as we wish to increase the transparency of the streams functionalities (Fig. 2.135). "
A Merge node should be used to combine different data sources. The Modeler offers five different join types: inner join, full outer join, partial outer join left/right, and anti-join. There is no option to select the leading subset for the anti-join operation.
"
To avoid misleading interpretations, the number of records in the source files and the number of records in the merged dataset must be verified.
2.7 Data Handling Methods
"
The names of the variables in both datasets must be unique. Therefore Filter nodes should be implemented before the Merge node is applied. Renaming the variables in the Merge node is also possible, but to ensure a more transparent stream, we do not recommend this option.
"
In the case of a full outer join, we strongly recommend checking the result. If there are records in both datasets that have the same primary key but different values in another variable the result will be an inconsistent dataset. For example, one employee has the ID “1000” but different “street_names” in its address.
"
Sometimes it is challenging to deal with the scale of measurement of the variables in a merged dataset. We therefore suggest incorporating a Type node right after the Merge node, just to check and modify the scales of measurement.
"
If two datasets are to be combined row by row then the Append node should be used (see Sect. 2.7.10).
119
Stream Extension to Calculate Average Income To show the necessity of merging two different sources, we want to address calculation of the average gross income. The manual calculation is easy: For “Hartlepool” in the first row of Fig. 2.134 we get (475.4 + 462.1)/2 ¼ £468.75 per week. Next we will explain how to calculate the average income per week for all regions. In Sect. 2.7.2 we outlined the general usage of a Derive node: 9. From the tab “Field Ops” we add a Derive node to the stream and connect it with the Merge node (see Fig. 2.136).
120
2
Basic Functions of the SPSS Modeler
Fig. 2.136 Derive node is added to the stream
10. With a double-click on the Derive node we can now define the name of the new variable with the average income. We use “weekly_payment_gross_2013_2014_MEAN” as shown in Fig. 2.137. 11. Finally we have to define the correct formula using the Modelers expression builder. To start this tool, we click on the calculator symbol on the right-hand side, as depicted in Fig. 2.137.
2.7 Data Handling Methods
121
Fig. 2.137 Parameters of the Derive node
12. A new dialog window pops up as shown in Fig. 2.138. As explained in Sect. 2.7.2, we can use the formula category list to determine the appropriate function. The category “Numeric” is marked with an arrow in Fig. 2.138. The correct formula to determine the average weekly income for 2013 and 2014 per UK region is: mean_n([weekly_payment_gross_2013,weekly_payment_gross_2014])
We can click “OK” and close the Derive node dialog (Fig. 2.139). 13. Finally we add another Table node to show the results. The predicted result of £468.75 per week for Hartlepool is the last value in the first row of Fig. 2.140.
122
2
Basic Functions of the SPSS Modeler
Fig. 2.138 Using the expression builder to find the correct formula
2.7 Data Handling Methods
123
Fig. 2.139 Final stream
Fig. 2.140 Average income 2013 and 2014 per UK area
2.7.10 Append Datasets Description of the model Stream name Based on dataset
append_England_payment_data england_payment_fulltime_2013.csv england_payment_fulltime_2014_reduced.xls
Stream structure (continued)
124
2
Basic Functions of the SPSS Modeler
Related exercises: 13
In the previous section we explained how to combine two datasets column by column. For this we used a primary key. To ensure unique variable names we had to rename the variables by extending them by the year. Figure 2.141 shows the parameters of a Filter node used for that procedure. If datasets have the same columns but represent different years, then it should also be possible to append the datasets. To distinguish the subsets in the result, we should extend them by a new variable that represents the year. In the end each row will belong to a specific year, as shown in Fig. 2.142. In this example we want to use the same datasets as the previous example: one CSV-File with the weekly gross payments for different regions in the UK and one Excel spreadsheet that contains the gross payments for 2014. Now we will explain how we defined a new variable with the year and appended the tables. Figure 2.142 depicts the procedure and the result. 1. We open the template stream “Template-Stream England payment 2013 and 2014 reduced”. Figure 2.143 shows a Variable file node for the data from 2013. Below the data for 2014 are imported using a Excel file node. 2. If we run the Table nodes on the top and the bottom we get the values as shown in Figs. 2.144 and 2.145.
2.7 Data Handling Methods
Fig. 2.141 Filter node for data in 2014 to exclude and rename variables
Fig. 2.142 Two appended datasets
125
126
2
Basic Functions of the SPSS Modeler
Fig. 2.143 Template-Stream England payment 2013 and 2014 reduced
Fig. 2.144 England employee data 2013
Fig. 2.145 England employee data 2014
2.7 Data Handling Methods
127
3. Now we need to add a new variable that represents the year in each subset. We add a Derive node and connect it with the Variable file node. To name the new unique variable we use “year” and the formula we define as a constant value 2013, as shown in Fig. 2.146. 4. We add a second Derive node and connect it with the Excel file node. We use the name “year” here also, but for the formula the constant value is 2014 (Fig. 2.147).
Fig. 2.146 Formula in the first Derive node
128
2
Basic Functions of the SPSS Modeler
Fig. 2.147 Formula in the first Derive node
5. To show the results of both operations, we add two Table nodes and connect them with the Derive nodes. Figure 2.148 shows the actual status of the stream. 6. If we use the Table node at the bottom of Fig. 2.148, we get the records shown in Fig. 2.149.
2.7 Data Handling Methods
Fig. 2.148 Extended template stream to define the “year”
Fig. 2.149 Values for 2014 and the new variable “year”
129
130
2
Basic Functions of the SPSS Modeler
Fig. 2.150 Filter node is added behind the Variable File node
7. Here it is not necessary to exclude variables in the dataset for 2014. Nevertheless, as shown in Fig. 2.144, we can remove some of them from the dataset for 2013, because they do not match with the other ones from 2014 (Fig. 2.145). To enable us to exclude these variables from the subset 2013, we add a Filter node behind the Variable File node. We can find this node in the Field Ops tab of the Modeler. Figure 2.150 shows the actual status of the stream.
2.7 Data Handling Methods
131
Fig. 2.151 Filter node parameters
8. Now we can exclude several variables from the result. Figure 2.151 shows us the parameters of the Filter node. 9. Now we can append the modified datasets. We use an Append node from the Records Ops tab. Figure 2.152 shows the actual status of the stream. 10. In the Append node, we must state that we want to have all the records from both datasets in the result. Figure 2.153 shows the parameters of the node. 11. Finally, we want to scroll through the records using a Table node. We add a Table node at the end of the stream (see Fig. 2.154). "
The Append node can be used to combine two datasets row by row. It is absolutely vital to ensure that the objects represented in the datasets are
132
2
Basic Functions of the SPSS Modeler
Fig. 2.152 Append node is added
unique. A Merge node with an inner join can help here. For details see the Exercise 13 “Append vs. Merge Datasets”. "
We suggest using the option “Match fields by Name” of a Append node to append the datasets (see Fig. 2.153).
"
The option “Tag records by including source dataset in field” can be used to mark the records with the number of the dataset they come from. To
2.7 Data Handling Methods
133
Fig. 2.153 Parameters of the Append node
"
have the chance of differentiating between both sets and using userdefined values, e.g., years, we suggest using Derive nodes to add a new variable with a constant value. The disadvantage of the Append node is that it is more difficult to calculate measures such as the mean. To do this we should use a Merge node as shown in Sect. 2.7.9.
12. Running the Table node we will find the records partially shown in Fig. 2.155. We can find the variable with the year and the expected sample size 2048. As explained in Fig. 2.142, the rows of variables that were not present in both datasets are filled with $null$ values. We can see that in the last column of the table in Fig. 2.155.
134
2
Basic Functions of the SPSS Modeler
Fig. 2.154 Final stream to Append datasets for UK payment data
Fig. 2.155 Table with the final records
2.7.11 Exercises Exercise 1: Identify and Count Non-available Values In the Microsoft Excel dataset “england_payment_fulltime_2014_reduced.xlsx”, some weekly payments are missing. Please use the template stream “TemplateStream England payment 2014 reduced” to get access to the dataset. Extend the stream by using appropriate nodes to count the number of missing values.
2.7 Data Handling Methods
135
Exercise 2: Comfortable Selection of Multiple Fields In Sect. 2.7.2, we explained how to use the function “sum_n”. The number of training days of each IT user was calculated by adding the number of training days they participated in the last year and the number of additional training days they would like to have. To do so, the function “sum_n” needs a list of variables. In this case it was simply “([training_days_actual,training_days_to_add])”. It can be complex to add more variables, however, because we have to select all the variable names in the expression builder. The predefined function “@FIELDS_MATCHING()” can help us here. 1. You can find a description of this procedure in the Modeler help files. Please explain how this function works. 2. Open the stream “simple_calculations”. By adding a new Derive and Table node, calculate the sum of the training days in the last year and the days to add by using the function “@FIELDS_MATCHING()”. Exercise 3: Counting Values in Multiple Fields In the dataset “IT_user_satisfaction.sav”, we can find many different variables (see also Sect. 12.1.24). A lot of them have the same coding that represents satisfaction levels with a specific aspect of the IT system. For this exercise, count the answers that signal a satisfaction level of “good” or “excellent” with one of the IT system characteristics. 1. Open the stream “Template-Stream IT_user_satisfaction”. 2. Using the function “count_equal”, count the number of people that answered “good” regarding the (1) start-time, (2) system_availability, and (3) performance. Show the result in a Table node. 3. Using the function “count_greater_than”, determine the number of people that answered at least “good”, regarding the (1) start-time, (2) system_availability, and (3) performance. Show the result in a Table node. 4. The variables “start-time”, “system_availability”, . . ., and “slimness” are coded on the same scale. Now calculate the number of people that answered “good” for the aspects asked for. Use the function “@FIELDS_BETWEEN” to determine which variables to inspect. 5. Referring to the question above, we now want to count the number of “good” answers for all variables except “system_availability”. Use the “Field Reorder” node to define a new sub-stream and count these values. Exercise 4: Determining Non-reliable Values In the file “england_payment_fulltime_2014_reduced.xls”, we can find the median of the sum of the weekly payments in different UK regions. The detailed descriptions can be found in Sect. 12.1.14. Table 2.7 summarizes the variable names and their meaning. The source of the data is the UK Office for National Statistics and its website NOMIS UK (2014). The data represent the Annual Survey of Hours and Earnings (ASHE). We should note that the coefficient of variation for the median of the sum of the payments is described on the website NOMIS UK (2014) in the following statement:
136
2
Basic Functions of the SPSS Modeler
Table 2.7 Variables Dataset “england_payment_fulltime_2014_reduced” admin_description admin_description area_code weekly_payment_gross weekly_payment_gross_CV
Table 2.8 CV value
Represents the type of the region and the name separated by “:”. For the type of the region, see Table 2.3. A unique identifier for the area/region. Median of sum of weekly payments. Coefficient of variation of the value above.
5% or lower over 5%, up to 10% over 10%, up to 20% over 20%
Precise Reasonably precise Acceptable, but use with caution Unreliable, figures suppressed
Quality of estimate The quality of an estimate can be assessed by referring to its coefficient of variation (CV), which is shown next to the earnings estimate. The CV is the ratio of the standard error of an estimate to the estimate. Estimates with larger CVs will be less reliable than those with smaller CVs. In their published spreadsheets, ONS use the following CV values to give an indication of the quality of an estimate (see Table 2.8).
Therefore, we should pay attention to the records with a CV value above 10%. Please do the following: 1. Explain why we should use the median as a measure of central tendency for payments or salaries. 2. Open the stream with the name “Template-Stream England payment 2014 reduced”. 3. Save the stream using another name. 4. Extend the stream to show in a new Table node the CV values in descending order. 5. In another table node, show only the records that have a confidence value (CV) of the weekly payments below (and not equal to) “reasonably precise”. Determine the sample size. 6. In addition, add nodes to show in a table those records with a CV that indicates they are at least “reasonably precise”. Determine the sample size. Exercise 5: Extract Area Names In the dataset “england_payment_fulltime_2014_reduced.xls”, the weekly payments for different UK regions are listed. The region is described here by a field called “admin_description” as shown in Fig. 2.156. The aim of this exercise is to separate the names from the description of the region type. 1. Use the stream “Template-Stream England payment 2014 reduced”.
2.7 Data Handling Methods
137
Fig. 2.156 Records in the dataset “england_payment_fulltime_2014_reduced.xls”
2. Add appropriate nodes to extract the names of the regions using the variable “admin_description”. The names should be represented by a new variable called “area_name_extracted”. 3. Show the values of the new variable in a table. Exercise 6: Distinguishing Between the Select and the Filter Node We discussed the Select and the Filter node in the previous chapters. Both can be used to reduce the data volume a model is based upon. 1. Explain the functionality that each of these nodes can help to realize. 2. Outline finally the difference between both node types. Exercise 7: Filtering Variables Using the Source Node In Sect. 2.7.5, we created a stream based on the IT processor benchmark test dataset, to show how to use a Filter node. We found that we can exclude the variable “processor type” from the analytical process. If the variable will never be used in the whole stream, then we also have this option in the Source node, where the data came from in the first place. The stream “Correlation_processor_benchmark” should be modified. Open the stream and save it under another name. Then modify the Excel File node so that the variable “processor type” does not appear in the nodes that follow. Exercise 8: Standardizing Values The dataset “benchmark.xlsx” includes the test results of processors for personal computers published by c’t Magazine for IT Technology (2008). As well as the names of the manufacturers Intel and AMD, the names of the processors can also be found. The processor speed was determined using the “Cinebench” benchmark test. Before doing multivariate analysis, it is helpful to identify outliers. This can be done by an appropriate standardization procedure, as explained in Sect. 2.7.6. 1. In Sect. 2.7.4, the stream “selecting_records” was used to select records that represent processor data for the manufacturer “Intel”. Please now open the stream “selecting_records” and save it under another name.
138
2
Basic Functions of the SPSS Modeler
Fig. 2.157 Generate the Select nodes using a Partition node
2. Make sure or modify the stream so that the “Intel” processors will now be selected. 3. Add appropriate nodes to standardize the price (variable “EUR”) and the Cinebench benchmark results (variable “CB”). 4. Show the standardized values. 5. Determine the largest standardized benchmark result and the largest standardized price. Interpret these values in detail. Are these outliers? 6. Now analyze the data for the AMD processors. Can you identify outliers here? Exercise 9: Dataset Partitioning In the dataset “housing.data.txt” we find values regarding Boston housing neighborhoods. We want to use that file as a basis to learn more about the Partitioning node. Please do the following: 1. Open the template stream “008_Template-Stream_Boston_data”. 2. Add a Partition node at the end of the stream. 3. Use the option “Train and Test” in the Partition node and set the training partition size at 70% and the test partition size at 30%. 4. Now generate two Select nodes automatically with the Partition node as shown in Fig. 2.157.
2.7 Data Handling Methods
139
Fig. 2.158 Final stream “dataset_partitioning”
5. Move the Select nodes to the end of the stream and connect them with the Partition node. 6. To have a chance of analyzing the partitions, add two Data Audit nodes. For the final stream, see Fig. 2.158. 7. Now please run the Data Audit node behind the Select node for the training records TWICE. Compare the results for the variables and explain why they are different. 8. Open the dialog window of the Partition node once more. Activate the option “Repeatable partition assignment” (see Fig. 2.157). 9. Check the results from two runs of the Data Audit node once more. Are the results different? Try to explain why it could be useful in modeling processes to have a random but repeatable selection. Exercise 10: England Payment Gender Difference The payments for female and male employees in 2014 are included in the Excel files “england_payment_fulltime_female_2014” and “england_payment_fulltime_male_2014”. The variables are described in Sect. 12. 1.14. We focus here only on the weekly gross payments. By modifying the stream “merge_employee_data”, calculate the differences in the medians between payments for female and male employees. Exercise 11: Merge Datasets In Sect. 2.7.9, we discussed several ways to merge two datasets. Using small datasets, in this exercise we want to go into the details of the merge operations. Figure 2.159 depicts the “Template-Stream Employee_data”. This stream obviously gives us access to the datasets “employee_dataset_001.xls” and “employee_dataset_002.xls”. The records of these datasets, with a sample size of three records each, are shown in Figs. 2.160 and 2.161.
140
2
Basic Functions of the SPSS Modeler
Fig. 2.159 Nodes in the “Template-Stream Employee_data”
Fig. 2.160 Records of “employee_dataset_001.xlsx”
1. Now please open the stream “Template-Stream Employee_data”. 2. Add a Merge node and connect it with the Source File nodes. 3. Also add a Table node to show the results produced by the Merge node. Figure 2.162 shows the final stream. 4. As shown in Fig. 2.163, Merge nodes in general offer four methods for merging data. By changing the type of merge operation (inner join, full outer join, etc.), produce a screenshot of the results. Explain in your own words the effect of the join type. See, e.g., Fig. 2.164.
2.7 Data Handling Methods
Fig. 2.161 Records of “employee_dataset_002.xlsx”
Fig. 2.162 Final stream to check the effects of different merge operations
141
142
Fig. 2.163 Parameters of a Merge Node
Fig. 2.164 Result of a merge operation
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
143
Exercise 12: Append Datasets In Sect. 2.7.10, we discussed how to add rows to a dataset by appending another one. Figure 2.165 shows the nodes in the “Template-Stream Employee_data_modified”. The stream is based on the datasets “employee_dataset_001.xlsx” and “employee_dataset_003.xlsx” (not “. . .002”!). The records of these datasets, with a sample size of three records each, are shown in Figs. 2.166 and 2.167. Contrary to the dataset “employee_dataset_002.xls”, in these datasets the customer IDs are unique. This is important because otherwise we would get an inconsistent dataset.
Fig. 2.165 Nodes in the “Template-Stream Employee_data”
144
Fig. 2.166 Records of “employee_dataset_001.xlsx”
Fig. 2.167 Records of “employee_dataset_003.xlsx”
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
145
Fig. 2.168 Final stream to check the functionalities of the Append node
The aim of this exercise is to understand the details of the Append operation and the Append node. 1. Open the stream “Template-Stream Employee_data”. 2. Add an Append node and connect it with the Excel File nodes. 3. Add a Table node to show the results produced by the Append node. For the final stream, see Fig. 2.168. 4. By modifying the parameters of the Append node, using (see Fig. 2.169) the options of – “Match fields by . . .” or – “Include fields from . . .” as well as – “Tag records by including source dataset in field”,
Find out what happens and describe the functionality of these options. Exercise 13: Append Versus Merge Datasets In Sect. 2.7.9, we discussed how to merge datasets and in Sect. 2.7.10, we explained how to add rows to a dataset by appending another one. Figure 2.165 shows the nodes in the “Template-Stream Employee_data”. This stream is based on the datasets “employee_dataset_001.xlsx” and “employee_dataset_002.xlsx”. The records of these datasets, with a sample size of three records each, are shown in Figs. 2.170 and 2.171.
146
2
Basic Functions of the SPSS Modeler
Fig. 2.169 Parameters of the Append node
The aim of this exercise is to understand the difference between the Merge and the Append node. Therefore, both nodes should be implemented into the stream. After that, the mismatch of results should be explained. You will also become aware of the challenges faced when using the Append node. Open the stream “Template-Stream Employee_data”. Run both Table nodes and compare the data included in the datasets. Add a Merge node and connect it with the Excel File nodes. In the Merge node, use the full outer join operation to get the maximum number of records in the result. 5. Also add an Append node and connect it with the Excel File nodes. 6. In the Append node, enable the option “Include fields from . . . All datasets”. 7. Behind the Merge and the Append node, add a Table node to show the results produced. Figure 2.172 shows the final stream. 1. 2. 3. 4.
2.7 Data Handling Methods
147
Fig. 2.170 Records of “employee_dataset_001.xlsx”
Fig. 2.171 Records of “employee_dataset_002.xlsx”
8. Now run each of the Table nodes and outline the differences between the Merge and the Append node results. 9. Describe the problems you find with the results of the Append operation. Try to make a suggestion as to how we could be aware of such problems before append two datasets.
148
2
Basic Functions of the SPSS Modeler
Fig. 2.172 Final stream “append_vs_merge_employee_data”
2.7.12 Solutions Exercise 1: Identify and Count Non-available Values Name of the solution stream Theory discussed in section
identifing_non-available_values Section 2.7.2
Figure 2.173 shows one possible solution. 1. We extended the template stream by first adding a Data Audit node from the Output tab and then connecting that node with the Source node. We do this to have a chance of analyzing the original dataset. Figure 2.174 shows that the number of values in row “weekly_payment_gross” is 1021. In comparison, the number of values for “admin_description” is 1024. So we can guess that there are three missing values. The aim of this exercise, however, is to count the missing values themselves. 2. Behind the Type node we add a Derive node, from the Field Ops tab of the Modeler. The formula used here is “@NULL(weekly_payment_gross)”. 3. In the Table node we can see the results of the calculation (see the last column in Fig. 2.175).
2.7 Data Handling Methods
Fig. 2.173 Stream “identifing_non-available_values”
Fig. 2.174 Data Audit node analysis for the original dataset
Fig. 2.175 Table node with the new variable “null_flag”
149
150
2
Basic Functions of the SPSS Modeler
Fig. 2.176 Frequency distribution of the variable “null_flag”
4. To calculate the frequencies we use a Distribution node. Here we find that there are indeed three missing values (Fig. 2.176). Exercise 2: Comfortable Selection of Multiple Fields Name of the solution stream Theory discussed in section
England_payment_gender_difference Section 2.7.2
1. The function “@FIELDS_MATCHING(pattern)” selects the variables that match the condition given as the pattern. If we use “training_days_*”, the function gives us a list containing the variables “training_days_actual” and “training_days_to_add”. Figure 2.177 shows the final stream “simple_calculations_extended”. Here we added a third sub-stream. The formula in the Derive node is “sum_n (@FIELDS_MATCHING(‘training_days_*’))”. We can find the results in the Table node, as shown in Fig. 2.178.
2.7 Data Handling Methods
Fig. 2.177 Final stream “simple_calculations_extended”
Fig. 2.178 Results of the calculation
151
152
2
Basic Functions of the SPSS Modeler
Exercise 3: Counting Values in Multiple Fields Name of the solution stream Theory discussed in section
counting_values_multiple_fields Section 2.7.2
1. Figure 2.179 shows the initial stream “Template-Stream IT_user_satisfaction”. 2. Figure 2.180 shows the final stream. First we want to explain the sub-stream, counting the number of people that answered “5 ¼ good” for satisfaction with the start-time, the system_availability, and the performance.
Fig. 2.179 Template-Stream IT_user_satisfaction
Fig. 2.180 Stream “counting_values_multiple_fields”
2.7 Data Handling Methods
153
We use a Filter node to reduce the number of variables available in the first and second sub-stream. Here we disabled all the variables except “start-time”, “system_availability”, and “performance”, as shown in Fig. 2.181. This is not necessary but it helps us to have a better overview, both in the expression builder of the Derive node and when displaying the results in the Table node. In the Derive node, we use the function count_equal(5,[starttime, system_availability, performance])
as shown also in Fig. 2.182. Figure 2.183 shows the results. 3. The second sub-stream is also connected to the Filter node. That’s because we also want to analyze the three variables mentioned above. The formula for the Derive node is (Fig. 2.184): count_greater_than(3,[starttime,system_availability, performance])
To count the number of answers that represent satisfaction of at least “5 ¼ good”, we used the function “count_greater_than”, with the first parameter “3”. Figure 2.185 shows the result.
Fig. 2.181 Enabled/disabled variables in the Filter node
154
Fig. 2.182 Parameters of the first Derive node
Fig. 2.183 Results of the first calculation
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
Fig. 2.184 Parameters of the second Derive node
Fig. 2.185 Results of the second calculation
155
156
2
Basic Functions of the SPSS Modeler
The variables “start-time”, “system_availability”, . . ., “slimness” should be analyzed here. We are interested in the number of values that are 5 or 7. To get a list of the variables names, we can use the function “@FIELDS_BETWEEN”. We have to make sure that all the variables between “start-time” and “slimness” have the same coding. The formula count_greater_than(3,@FIELDS_BETWEEN(start-time, slimness))
can be found in the Derive node in Fig. 2.186. Figure 2.187 shows the results. 4. The function “@FIELDS_BETWEEN(start-time, slimness)” produces a list of all the variables between “start-time” and “slimness”. If we want to exclude a variable we can filter or reorder the variables. Here we want to show how to use the Field Reorder node.
Fig. 2.186 Parameters of the third Derive node
2.7 Data Handling Methods
157
Fig. 2.187 Results of the third calculation
If we add a Field Reorder node to the stream and double-click on it, the list of the fields is empty (see Fig. 2.188). To add the field names, we click on the button to the right of the dialog window that is marked with an arrow in Fig. 2.188. Now we can add all the variables (see Fig. 2.189). After adding the variable names, we should reorder the variables. To exclude “system_availability” from the analysis, we make it the first variable in the list. To do so, we select the variable name by clicking on it once and then we use the reorder buttons to the right of the dialog window (see Fig. 2.190). In the Derive node, we only have to modify the name of the variable that is calculated. We use “count_5_and_7_for_all_except_system_availabilty”. The formula is the same as in the third sub-stream. As shown in Fig. 2.191, it is: count_greater_than(3,@FIELDS_BETWEEN(start-time, slimness))
158
2
Basic Functions of the SPSS Modeler
Fig. 2.188 Field selection in a Field Reorder node
Fig. 2.189 Adding all the variables in a Field Reorder node
2.7 Data Handling Methods
159
Fig. 2.190 Reorder variables in a Field Reorder node
Running the last Table node, we get slightly different results than in the third sub-stream. Comparing Figs. 2.187 and 2.192, we can see that the number of answers with code 5 or 7 is sometimes smaller. This is because we excluded the variable “system_availability” from the calculation, by moving it to first place and starting with the second variable “start-time”.
160
Fig. 2.191 Parameters of the fourth Derive node
Fig. 2.192 Results of the third calculation
2
Basic Functions of the SPSS Modeler
2.7 Data Handling Methods
161
Exercise 4: Determining Non-reliable Values Name of the solution stream Theory discussed in section
identifing_non-reliable_values Section 2.7.4
1. The median should be used as a measure of central tendency for payments or salaries because it is less sensitive to outliers. The mean would be a questionable measure because it is very sensitive to outliers.1 2. Figure 2.193 shows the initial template stream. 3. We save the stream with the name “identifing_non-reliable_values”. 4. To sort the values, we can use a Sort node from the Record Ops tab. Figure 2.194 shows the parameters of the node. Additionally, we add a Table node behind the Sort node to show the results (see Fig. 2.195). 5. To extract only the records that have a confidence value (CV) of the weekly payments below (and not equal to) “reasonably precise”, we can use a Select node with the parameters shown in Fig. 2.196 from the Record Ops tab and a Table node to show the results. To determine the sample size, we run the Table node. Figure 2.197 shows a sample size of 983 records. 6. If we want to extract the records with a CV that is at least “reasonably precise”, we can use another Select node with the condition “weekly_payment_ gross_CV= ,100).
a. Dependent Variable: MEDV b. All requested variables entered.
Fig. 5.110 Overview of the variable selection process
ANOVAa Model 1
2
3
Sum of Squares
Mean Square
df
Regression
31637,511
13
2433,655
Residual
11078,785
492
22,518
Total
42716,295
505
Regression
31637,449
12
2636,454
Residual
11078,846
493
22,472
Total
42716,295
505
Regression
31634,931
11
2875,903 22,432
Residual
11081,364
494
Total
42716,295
505
F
Sig.
108,077
,000b
117,320
,000c
128,206
,000d
a. Dependent Variable: MEDV b. Predictors: (Constant), LSTAT, CHAS, B, PTRATIO, ZN, CRIM, RM, INDUS, AGE, RAD, DIS, NOX, TAX c. Predictors: (Constant), LSTAT, CHAS, B, PTRATIO, ZN, CRIM, RM, INDUS, RAD, DIS, NOX, TAX d. Predictors: (Constant), LSTAT, CHAS, B, PTRATIO, ZN, CRIM, RM, RAD, DIS, NOX, TAX
Fig. 5.111 Goodness of fit measures for the models considered during the variables selection process
5.3 Multiple Linear Regression
459 Model Summary Adjusted R Square
Std. Error of the Estimate
Model
R
1
,861a
,741
,734
4,745298
2
,861b
,741
,734
4,740496
3
,861c
,741
,735
4,736234
R Square
a. Predictors: (Constant), LSTAT, CHAS, B, PTRATIO, ZN, CRIM, RM, INDUS, AGE, RAD, DIS, NOX, TAX b. Predictors: (Constant), LSTAT, CHAS, B, PTRATIO, ZN, CRIM, RM, INDUS, RAD, DIS, NOX, TAX c. Predictors: (Constant), LSTAT, CHAS, B, PTRATIO, ZN, CRIM, RM, RAD, DIS, NOX, TAX
Fig. 5.112 Table with goodness of fit values, including R2 and adjusted R2, for each of the model selection steps
Coefficientsa Unstandardized Coefficients Model 1
B (Constant)
t
Beta
Sig.
36,459
5,103
7,144
,000
–,108
,033
–,101
–3,287
,001
ZN
,046
,014
,118
3,382
,001
INDUS
,021
,061
,015
,334
,738
CHAS
2,687
,862
,074
3,118
,002
–17,767
3,820
–,224
–4,651
,000
RM
3,810
,418
,291
9,116
,000
AGE
,001
,013
,002
,052
,958
–1,476
,199
–,338
–7,398
,000
RAD
,306
,066
,290
4,613
,000
TAX
–,012
,004
–,226
–3,280
,001
PTRATIO
–,953
,131
–,224
–7,283
,000
,009
,003
,092
3,467
,001
–,525
,051
–,407
–10,347
,000
36,437
5,080
7,172
,000
–,108
,033
–,101
–3,290
,001
,046
,014
,117
3,404
,001
CRIM
NOX
DIS
B LSTAT 2
Std. Error
Standardized Coefficients
(Constant) CRIM ZN
Fig. 5.113 First part of the coefficients table
460
5
Regression Models
Fig. 5.114 The MLR stream with cross-validation and the Regression node
In the Expert tab in the Regression node, further statistical outputs can be chosen, which are then displayed here in the Advanced tab (see Fig. 5.123 for possible, additional output statistics). Solution of the Optional Subtasks This stream is also included in the “multiple_linear_regression_REGRESSION_ NODE” file and looks like Fig. 5.114. 6. We include the Partition node in the stream before the Type node, and divide the dataset into training data (70%) and a test dataset (30%), as described in Sect. 2.7.7. 7. In the Fields tab, we select the partition field, generated in the previous step as the Partition. This results in the model being built using only the training data and validated on the test data (see Fig. 5.115). 8. After adding the Analysis node to the model nugget, we run the stream. The window in Fig. 5.116 pops up. There, we see probability statistics for the training data and test data separately. Hence, the two datasets are evaluated independently of each other.
5.3 Multiple Linear Regression
461
Fig. 5.115 Selection of the partition field in the Regression node
The mean error is near 0, and the standard deviation (RMSE) is, as usual, a bit higher for the test data. The difference between the two standard deviations is not optimal, but still okay. Thus, the model can be used to predict the MEDV of unknown data (see Sect. 5.1.2 for cross-validation in this chapter).
462
5
Regression Models
Fig. 5.116 Output of the Analysis node. Separate evaluation of the training data and testing data
Exercise 5: Polynomial Regression with Cross-Validation Name of the solution streams Theory discussed in section
polynomial_regression_mtcars Sect. 5.3 Sect. 5.1.2
Figure 5.117 shows the final stream of this exercise. 1. We import the mtcars data with the Var. File node. To get the relationship between the variables “mgp” and “hp”, we connect a Plot node to the Source node and plot the two variables in a scatterplot. The graph is displayed in Fig. 5.118. In the graph, a slight curve can be observed and so we suppose that a nonlinear relationship is more reasonable. Hereafter, we will therefore build two regressions models with exponents 1 and 2 and compare them with each other. To perform a polynomial regression of degree 2, we need to calculate the square of the “hp” variable. For that purpose, we add a Derive node to the stream and define a new variable “hp2”, which contains the squared values of “hp” (see Fig. 5.119 for the setting of the Derive node).
5.3 Multiple Linear Regression
Fig. 5.117 Stream of the polynomial regression of the mtcars data via the Regression node
Fig. 5.118 Scatterplot of the variables “hp” and “mpg”
463
464
5
Regression Models
Fig. 5.119 Definition of the squared “hp” variable in the Derive node
2. To perform cross-validation, we partition the data into training (60%), validation (20%), and test (20%) sets via the Partition node (see Sect. 2.7.7, for details on the Partition node). Afterward, we add a Type node to the stream and perform another scatterplot of the variables “hp” and “mpg”, but this time we color and shape the dots depending on their set partition (see Fig. 5.120). What we can see once again is that the model hardly depends on the explicit selection of the training, validation, and test data. In order to get a robust model, a bagging procedure should also be used.
5.3 Multiple Linear Regression
465
Fig. 5.120 Scatterplot of “mpg” and “hp”, shaped depending on how they belong to the dataset
3. Now, we add two Regression nodes to the stream and connect each of them with the Type node. One node is used to build a standard linear regression while the other fits a second-degree polynomial function to the data. Here, we only present the settings for the second model. The settings of the first model are analog. For that purpose, we define “mpg” as the target variable and “hp” and “hp2” as the input variables in the Fields tab of the Regression node. The Partition field is once again set as Partition (see Fig. 5.121). In the Model node, we now enable “Use partitioned data” to include crossvalidation in the process and set the variable selection method to “Enter”. This guarantees that both input variables will be considered in the model (see Fig. 5.122). To calculate the Akaike information criterion (AIC), see Fahrmeir (2013), which should be compared between the models in part 4 of this exercise; we open
466
5
Regression Models
Fig. 5.121 Definition of the variable included in the polynomial regression
the Expert tab and switch from the Simple to the Expert options. Then, we click on output and mark the Selection criteria. This ensures the calculation of further variable selection criteria, including the AIC (see Fig. 5.123). After setting up the options for the linear regression too, we then run the stream and the model nuggets appear. 4. When comparing the R2 and AIC values of the models, which can be viewed in the Advanced tab of the model nuggets, we see that the R2 of the degree 2 model is higher than the R2 of the degree 1 model. Furthermore, the inverse holds true for the AIC values of the models (see Figs. 5.124 and 5.125). Both measures of
5.3 Multiple Linear Regression
467
Fig. 5.122 Using partitioning in the Regression node
the statistics indicate that the polynomial model, comprising a degree 2 term, fits the data better. To perform a cross-validation and evaluate the model on the validation set and test set, we connect an Analysis node to each of the model nuggets, and then we run the stream to get the error statistics. These are displayed in Figs. 5.126 and 5.127, for the model with only a linear term and the model with a quadric term. We see that the RMSE, which is just the standard deviation (see Sect. 5.1.2), of the degree model are lower for all, the training, validation, and test sets, in the degree 2 model, in comparison with the linear model. Hence, cross-validation suggests the inclusion of a quadratic term in the regression equation. The large difference in the RMSE of the validation set (4.651) compared to the other two sets (2.755 and 1.403) results in the very low number of observations (occurrences). Now, even the test set has a mean error close to 0 and a low standard deviation for the degree 2 model. This states the universality of the model, which is thus capable of predicting the “mpg” of a new dataset.
468
5
Regression Models
Fig. 5.123 Enabling of the selection criteria, which include the AIC
Fig. 5.124 Goodness of fit statistics and variable selection criteria for the degree 1 model
Fig. 5.125 Goodness of fit statistics and variable selection criteria for the degree 2 model
5.3 Multiple Linear Regression
Fig. 5.126 Error statistics in the degree 1 model
Fig. 5.127 Error statistics in the degree 2 model
469
470
5
Regression Models
5. After training several models, the validation set is used to compare these models with each other. The model that best predicts the validation data is then the best candidate for describing the data. Thus, the validation dataset is part of the model building and finding procedure. It is possible that the validation set somehow favors a specific model, because of a bad partition sampling for example. Hence, due to the potential for bias, another evaluation of the final model should be done on an independent dataset, the test set, to confirm the universality of the model. The final testing step is therefore an important part in cross-validation and finding the most appropriate model. Exercise 6: Boosted Regression Name of the solution streams Theory discussed in section
Boosting_regression_mtcars Sect. 5.1.2 Sect. 5.3.6
Figure 5.128 shows the final stream of this exercise. 1. We import the mtcars data with the Var. File node and add a Partition node to split the dataset into training (70%) and testing (30%) set. For the description of the Partition node and the splitting process, we refer to Sects. 2.7.7 and 5.3.5. After adding the typical Type node, we add a Select node, to restrict the modeling only on the training set. The Selection node can be generated in the Partition node. For details see Sect. 5.3.5 and Fig. 5.75. 2. To build the boosting model, we add a Linear node to the stream and connect it with the Select node. Afterward, we open it and select “mpg” as the target and “disp” resp. “wt” as input variables in the Fields tab (see Fig. 5.129). Next, we choose the boosting option in the Building Options tab to ensure a boosting model building process (see Fig. 5.130). In the Ensemble options, we define 100 as the maximum components in the final ensemble model (see Fig. 5.131).
Fig. 5.128 Stream of the boosting regression of the mtcars data with the Linear node
5.3 Multiple Linear Regression
471
Fig. 5.129 Definition of the target and input variables
Now, we run the stream and the model nugget appear. 3. To inspect the quality of the ensemble model, we open the model nugget and notice that the boosting model is more precise (R2 ¼ 0.734) than the reference model (R2 ¼ 0.715) (see Fig. 5.132). Thus, boosting increases the fits of the model to the data, and this ensemble model is chosen for prediction of the “mpg” (see arrow in Fig. 5.132). In the Ensemble accuracy view, we can retrace the accuracy progress of the modeling process. Here, we see that only 26 component models are trained since further improvement by additional models wasn’t possible, and thus, the modeling process stopped here (Fig. 5.133).
472
5
Regression Models
Fig. 5.130 Definition of the boosting modeling technique
For cross-validation and the inspection of the RMSE, we now add an Analysis node to the stream and connect it with the model nugget. As a trick, we further change the data connection from the Selection node to the Type node (see Fig. 5.134). This will ensure that both the training and testing partitions are considered simultaneously by the model. Hence, the predictions can be shown in one Analysis node. After running the stream again, the output of the Analysis node appears (see Fig. 5.135). We can see that the difference of the RMSE, which is the standard deviation value (see Sect. 5.1.2), in the training and testing data do not differ much, hence indicating a robust model which is able to predict the “mpg” from unseen data.
5.3 Multiple Linear Regression
Fig. 5.131 Definition of the maximum components for the boosting model
Fig. 5.132 Quality of the boosting and reference model
473
Ensemble Accuracy
Cumulative Accuracy
100.0%
50.0%
0.0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Component Models
Fig. 5.133 Accuracy progress while building the boosting regression
Fig. 5.134 Change of the data connection from the Select node to the Type node
Fig. 5.135 Statistics of the Analysis node
5.4 Generalized Linear (Mixed) Model
5.4
475
Generalized Linear (Mixed) Model
Generalized Linear (Mixed) Models (GLM and GLMM), are essentially a generalization of the linear and multiple linear models described in the two previous sections. They are able to process more complex data, such as dependencies between the predictor variables and a nonlinear connection to the target variable. A further advantage is the possibility of modeling data with non-Gaussian distributed target variables. In many applications, a normal distribution is inappropriate and even incorrect to assume. An example would be when the response variable is expected to take only positive values that vary over a wide range. In this case, constant change of the input leads to a large change in the prediction, which can even be negative in the case of normal distribution. In this particular example, assuming an exponentially distributed variable would be more appropriate. Besides these fixed effects, the Mixed Models also add additional random effects that describe the individual behavior of data subsets. In particular, this helps us to model the data on a more individual level. This flexibility and the possibility of handling more complex data comes with a price: the loss of automatic procedures and selection algorithms. There are no black boxes that help to find the optimal variable dependencies, link function to the target variable, or its distribution. YOU HAVE TO KNOW YOUR DATA! This makes utilization of GLM and GLMM very tricky, and they should be used with care and only by experienced data analysts. Incorrect use can lead to false conclusions and thus false predictions. The Modeler provides two nodes for building these kinds of models, the GenLin node and the GLMM node. With the GenLin node, we can only fit a GLM to the data, whereas the GLMM node can model Linear Mixed Models as well. After a short explanation of the theoretical background, the GLMM node is described here in more detail. It comprises many of the same options as the GenLin node, but, the latter has a wider range of options for the GLM, e.g., a larger variety of link functions, and is more elegant in some situations. A detailed description of the GenLin node is omitted here, but we recommend completing Exercise 2 in the later Sect. 5.4.5, where we explore the GenLin node while fitting a GLM model. Besides these two nodes, the SPSS Modeler 18.2 provides us with a third node to build generalized linear models, the GLE node, which is able to run a GLM on the SPSS Analytics Server (IBM (2019c)). The settings of this node can be made similar to the GenLin and GLMM node, which is why we omit an explanation of this node here and refer the interested reader to IBM (2019b) for further information.
5.4.1
Theory
The GLM and GLMM can process data with continuous or discrete target variables. Since GLMs and Linear Mixed Models are two separate classes of model, with the GLMM combining both, we will explain the theoretical background of these two concepts in two short sections.
476
5
Regression Models
Table 5.4 List of typically used link functions Name Identity Negative inverse Log Logit
Function x x1 log(x) x log 1x
Typical use Linear response data, e.g. multiple linear regression Used for exponential response data, e.g. exponential or gamma distributed target variable. Useful if the target variable follows a Poisson distribution Used for categorical target variables, e.g., in logistic regression (see Sect. 8.3)
Generalized Linear Model You may recall the setting for the MLR. In it we had data records (xi1, . . ., xip), each consisting of p input variables, and for each record an observation yi, but instead of a normal distribution, y1, . . ., yn can be assumed to follow a different law, such as Poisson-, Gamma-, or Exponential distribution. Furthermore, the input variables fulfill an additive relationship, as in the MLR, h xi1 , . . . , xip ¼ β0 þ β1 xi1 þ . . . þ βp xip : The difference from an ordinary MLR is that this linear term is linked to the conditional mean of the target variable via a so-called link function g, i.e., E ðtarget variable j xi1 , . . . , xip Þ ¼ g1 β0 þ β1 ∙ xi1 þ . . . þ βp ∙ xip : This provides a connection between the target distribution and the linear predictor. There are no restrictions concerning the link function, but, its domain should coincide with the support of the target distribution, in order to be practical. There are, however, a couple of standard functions which have proved to be adequate (see Table 5.4). For example, if the target variable is binary and g is the logistic function, the GLM is a logistic regression, which is discussed in Sect. 8.3. We refer to Fahrmeir (2013) for a more detailed theoretical introduction to GLM. Linear Mixed Model Understanding Linear Mixed Models mathematically is quite challenging, and so here we just give a heuristic introduction, so that the idea of adding random effects is understood, and we refer to Fahrmeir (2013) for a detailed description of Mixed Models. We start in the same situation and have the same samples as with MLR: a record of input variables and an observed target variable value. With Linear Mixed Models though, the samples form clusters where the data structure is slightly different. Think, for example, of a medical study with patients receiving different doses of a new medication. Patients with different medical treatments probably have a different healing process. Furthermore, patients of different ages, and other factors, react differently to the medication.
5.4 Generalized Linear (Mixed) Model
477
The Linear Mixed Models allow for these individual and diverse behaviors within the cluster. The model consists of two different kinds of coefficients. First, as in a typical regression, there are effects that describe the data or population in total, when we think of the medical study example. These are called fixed effects. The difference from ordinary MLR is an additional linear term of random effects that are cluster or group specific. These latter effects now model the individual behaviors within the cluster, such as different responses to the medical treatment in different age groups. They are assumed to be independent and normally distributed (see Fahrmeir (2013) for more information on the Linear Mixed Model). The GLMM is then formed through a combination of the GLM with the Linear Mixed Model. Estimation of the Coefficients and the Goodness of Fit The common method for estimating the coefficients that appear in these models is the “maximum likelihood approach” (see Fahrmeir (2013) for a description of this estimation). We should mention that other methods exist for estimating coefficients, such as the Bayesian approach. For an example see once again Fahrmeir (2013). A popular statistical measure for comparing and evaluating models is the “Akaike Information criterion (AICC)” (see Table 5.3). As in the other regression models, the mean squared error is also a commonly used value for determining the goodness of fit of a model. This parameter is consulted during cross-validation in particular. Another criterion is the “Bayesian information criterion (BIC)”, see Schwarz (1978), which is also calculated in the Modeler.
5.4.2
Building a Model with the GLMM Node
Here, we set up a stream for a GLMM with the GLMM node using the “test_scores” dataset. It comprises data from schools that used traditional and new/experimental teaching methods. Furthermore, the dataset includes the pretest and the posttest results of the students, as well as variables to describe the student’s characteristics. The learning quality not only depends on the method but also, for example, on the teacher. How he/she delivers the educational content to the pupils and to the class itself. If the atmosphere is noisy and twitchy, the class has a learning disadvantage. These considerations justify a mixed model and the use of the GLMM node to predict the posttest score. Description of the model Stream name Based on dataset Stream structure
Generalized linear mixed model test_scores.sav (see Sect. 12.1.36) (continued)
478
5
Regression Models
Important additional remarks The GLMM requires knowledge of the data, the variable dependencies, and the distribution of the target variable. Therefore, a descriptive analysis is recommended before setting the model parameters Related exercises: 1
1. To start the process, we use the template stream “001 Template-Stream test_scores” (see Fig. 5.136). We open it and save it under a different name. 2. To get an intuition of the distribution of the target variable, we add a Histogram node to the Source node and visualize the empirical distribution (see Sect. 3.2.3 for a description of the Histogram node). In Fig. 5.137, we see the histogram of the target values with the curve of a normal distribution. This indicates a Gaussian distribution of the target variable. 3. We add the GLMM node from the SPSS Modeler toolbar to the stream, connect it to the Type node, and open it to define the data structure and model parameters. Fig. 5.136 Template stream “test_scores”
5.4 Generalized Linear (Mixed) Model
479
Fig. 5.137 Histogram of the target variable with a normal distribution
4. In the Data Structure tab, we can specify the dependencies of the data records, by dragging the particular variables from the left list and dropping them onto the subject field canvas on the right. First, we drag and drop the “school” variable, then the “classroom” variable, and lastly the “student_id” variable, to indicate that students in the same classroom correlate in the outcome of the test scores (see Fig. 5.138). 5. The random effects, induced by the defined data structure, can be viewed in the Fields and Effects tab. See Fig. 5.139 for the included random effects in our stream for the test score data, which are “school” and the factor “school*classroom”. These indicate that the performance of the students depends on the school as well as on the classrooms and are therefore clustered. We can manually add more random effects by clicking on the “Add Block. . .” button at the bottom (see arrow in Fig. 5.139). See the infobox below and IBM (2019b, pp. 198–199) for details on how to add further effects. 6. In the “Target” options, the target variable, its distribution, and relationship to the linear predictor term can be specified. Here, we choose the variable “posttest” as our target. Since we assume a Gaussian distribution and a linear relationship to
480
5
Regression Models
Fig. 5.138 Definition of the clusters for the random effects in the GLMM node
Fig. 5.139 Random effects in the GLMM node
5.4 Generalized Linear (Mixed) Model
481
Fig. 5.140 Definition of the target variable, its distribution, and the link function
the input variables, we choose the “Linear model” setting (see arrow in Fig. 5.140). We can choose some other models predefined by the Modeler for continuous, as well as categorical target variables [see the enumeration in Fig. 5.140 or IBM (2019b, pp. 195–196)]. We can further use the “Custom” option to define the distribution and manually link functions if none of the above models are appropriate for our data. A list of the distributions and link functions can be found in IBM (2019b, pp. 196–197). 7. In the “Fixed Effects” options, we can specify the variables with deterministic influence on the target variable, which should be included in the model. This can be done by dragging and dropping. We select the particular variable in the left list and drag it to the “Effect builder canvas”. There we can select multiple variables at once and we can also define new factor variables to be included in the model. The latter can be used to describe dependences between single input variables. The options and differences between the columns in the “Effect builder canvas” are described in the infobox below.
482
5
Regression Models
Fig. 5.141 Definition of the fixed effects
In our example, we want to include the “pretest”, “teaching_method”, and “lunch” variables as single predictors. Therefore, we select these three variables in the left list and drag them into the “Main” column on the right. Then we select the “school_type” and “pretest” variable and drag them into the “2-way” column. This will bring the new variable “school_type*pretest” into the model since we assume a dependency between the type of school and the pretest results. To include the intercept in the model, we check the particular field (see left arrow in Fig. 5.141). There are four different possible drop options in the “Effect builder canvas”: Main, *, 2-way, and 3-way. If we drop the selected variables, for example A, B, and C, into the Main effects column, each variable is added individually to the model. If, on the other hand, the 2-way or 3-way is chosen, all possible (continued)
5.4 Generalized Linear (Mixed) Model
483
variable combinations of 2 or 3 are inserted. In the case of the 2-way interaction, this means for the example here, that the terms A*B, A*C, and B*C are added to the GLMM. The * column adds a single term to the model, which is a multiplication of all the selected variables. Despite these given options, we can specify our own nested terms and add them to the model with the “Add custom term” button on the right [see the right arrow in Fig. 5.141 and for further information see IBM (2019b)]. 8. In the “Build options” and “Model Options”, we can specify further advanced model criteria that comprise convergence settings of the involved algorithms. We refer the interested reader to IBM (2019b). 9. Finally, we run the stream and the model nugget appears.
5.4.3
The Model Nugget
Some of the tables and graphs in the model nugget of the GLMM are similar to the ones of MLR, for example the Predicted by Observed, Fixed Effects, and Fixed Coefficients views. For details, see Sect. 5.3.3, where the model nugget of the MLR is described. It should not be surprising that these views equal the ones of the MLR, since the fixed input variables form a linear term, as in the MLR. The only difference to the MLR model is the consideration of co-variate terms, here, “pretest*school” (see Fig. 5.142). These terms are treated as normal input variables, and are therefore no different to handle.
Fig. 5.142 Coefficients of the test scores GLMM with the co-variate term “pretest*school_type”
484
5
Target
Post-test
Probability Distribution
Normal
Link Function
Identity
Regression Models
Akaike Corrected
10,820.097
Bayesian
10,837.074
Information Criterion
Fig. 5.143 Summary of the GLMM
The single effects are significant, whereas the product variable “school_ type*pretest” has a relatively high p-value of 0.266. Therefore, a more suitable model can be one without this variable. Model Summary In the summary view, we can get a quick overview of the model parameters, the target variable, its probability distribution, and the link function. Furthermore, two model evaluation criteria are displayed, the Akaike (AICC) and a Bayesian criterion, which can be used to compare models with each other. A smaller value thereby means a better fit of the model (Fig. 5.143). Random Effects and Covariance Parameters The views labeled “Random Effects Covariances” and “Covariance Parameters” display the estimates and statistical parameters of the random effects variance within the clusters, as well as covariances between the clusters; in our example, these are the school and the classrooms. The covariances of the separate random effects are in the “Random Effect Covariances” view. The color indicates the direction of the correlation, darker color means positively correlated and lighter means negatively correlated. If there are multiple blocks, we can switch between them in the bottom selection menu (see Fig. 5.144). A block is thereby the level of subject dependence, which was defined in the GLMM node, in this case the school and the classroom (see Fig. 5.139). The value, displayed in Fig. 5.144 for example, is a measure of the variation between the different schools. In the “Covariance Parameter” view, the first table gives an overview of the number of fixed and random effects that are considered in the model. In the bottom
5.4 Generalized Linear (Mixed) Model
485
Fig. 5.144 Covariance matrix of the random effects
table, parameters of the residual and covariance estimates are shown. As well as the estimates, these include the standard error and confidence interval, in particular (see Fig. 5.145). At the bottom, we can switch between the residual and the block estimates. In the case of the test score model, we find that the estimated effect of school subjects is much higher (22.676) than the effect of the residuals (7.868), and the “school*classroom” subject, which is 7.484 (see Fig. 5.146 for school subject variation and Fig. 5.145 for the residual variation estimate). This indicates that most of the variation in the test score set, which is unexplained by the fixed effects, can be described by the between-school variation. However, since the standard error is high, the actual size of the effect is uncertain and cannot be clearly specified.
486
5
Regression Models
Fig. 5.145 Covariance Parameters estimate view in the model nugget
Random Effect Block 1 Estimate Std.Error
Z
Sig.
95% Confidence Interval Lower
Var(Intercept)
22.676
7.930
2.860
.004
11.426
Upper 45.001
Covariance Structure:Variance components Subject Specification:school
Fig. 5.146 Estimate of school subject variation
5.4.4
Cross-Validation and Fitting a Quadric Regression Model
In all previous examples and models, the structure of the data and the relationships between the input and target variables were clear before estimating the model. In many applications, this is not the case, however, and the basic parameters of an assumed model have to be chosen before fitting the model. One famous and often
5.4 Generalized Linear (Mixed) Model
487
referenced example is polynomial regression, see Exercise 5 in Sect. 5.3.7. Polynomial regression means that the target variable y is connected to the input variable x via a polynomial term y ¼ β 0 þ β 1 x þ β 2 x2 þ . . . þ β p xp : The degree p of this polynomial term is unknown, however, and has to be determined by cross-validation. This is a typical application of cross-validation, to find the optimal model, and afterward test for universality (see Sect. 5.1.2). Recall Fig. 5.4, where the workflow of the cross-validation process is visualized. Cross-validation has an advantage over normal selection criteria, such as forward or backward selection, since the decision of the exponent is made on an independent dataset, and thus overfitting of the model is prevented. Selecting criteria in this manner can fail, however. In this section, we perform a polynomial regression with cross-validation, with the GLMM node based on the “Ozone.csv” dataset, containing meteorology data and ozone concentration levels from the Los Angeles Basin in 1976 (see Sect. 12.1.31). We will build a polynomial regression model to estimate the ozone concentration from the temperature. We split the description into two parts, in which we describe first the model building and then the validation of these models. Of course, we can merge the two resulting streams into one single stream: we have saved this under the name “ozone_GLMM_CROSS_VALIDATION”. Building Multiple Models with Different Exponents Description of the model Stream name Based on dataset Stream structure
ozone_GLMM_1_BUILDING_MODELS Ozone.csv (see Sect. 12.1.31)
Important additional remarks Fix the seed in the partition node so that the validation and testing sets in the second validation stream are independent of the training data used here.
488
5
Regression Models
Fig. 5.147 Filter node to exclude all irrelevant variables from the further stream
1. We start by importing the comma separated “Ozone” data with a Var. File node. 2. Since we only need the ozone (O3) and temperature (temp) variable, we add a Filter node to exclude all other variables from the further stream (see Fig. 5.147). For this we have to click on all the arrows from these variables, which are then crossed out, so they won’t appear in the subsequent analysis. 3. Now, we insert the usual Type node, followed by a Plot node, in order to inspect the relationship between the variables “O3” and “temp” (see Sect. 4.2 for instructions on the Plot node). The relationship between the two variables can be seen by the scatterplot output of the Plot node in Fig. 5.148. As can be seen in scatterplot Fig. 5.148, the cloud of points foretells a slight curve, which indicates a quadric relationship between the O3 and temp variables. Since we are not entirely sure, we choose cross-validation to decide if a linear or quadric regression is more suitable.
5.4 Generalized Linear (Mixed) Model
489
Fig. 5.148 Scatterplot of the “O3” and “temp” variables from the ozone data
4. To perform cross-validation, we insert a Partition node into the stream, open it, and select the “Train, test and validation” option to divide the data into three parts (see Fig. 5.149). We choose 60% of the data as the training set and split the remaining data equally into validation and testing sets. Furthermore, we should point out that we have to fix the seed of the sampling mechanism, so that the three subsets are the same in the later validation stream. This is important, as otherwise the validation would be with already known data and therefore inaccurate, since the model typically performs better on data it is fitted to. Here, the seed is set to 1,234,567. 5. We generate a Selection node for the training data, for example with the “Generate” option in the Partition node (see Sect. 2.7.7). Add this node to the stream between the Type and the GLMM node. 6. Now, we add a GLMM node to the stream and connect it to the Select node.
490
5
Regression Models
Fig. 5.149 Partitioning of the test score data into training, validation and test sets
7. Open the GLMM node and choose “O3” as the target variable in the Fields and Effects tab (see Fig. 5.150). We also select the Linear model relationship, since we want to fit a linear model. 8. In the Fixed Effects option, we drag the temp variable and drop it into the main effects (see Fig. 5.151). The definition of the linear model is now complete. 9. To build the quadric regression model, copy the GLMM node, paste it into the stream canvas, and connect it with the Select node. Open it and go to the Fixed Effects option. To add the quadric term to the regression equation, we click on the “Add a custom term” button on the right (see arrow in Fig. 5.151). The custom term definition window pops up (see Fig. 5.152). There, we drag the temp variable into the custom term field, click on the “By*” button, and drag and drop the temp variable into the custom field once again. Now the window should look like Fig. 5.152. We finish by clicking on the “Add Term” button, which adds the quadric temp term to the regression as an input variable. The Fixed Effects window should now look like Fig. 5.153.
5.4 Generalized Linear (Mixed) Model
491
Fig. 5.150 Fields and Effects tab of the GLMM node. Target variable (O3) and the type of model selection
10. Now we run the stream and the model nuggets appear. The stream for building the three models is finished, and we can proceed with validation of those models. Cross-Validation and Testing This is the second part of the stream, to perform a cross-validation on the Ozone data. Here, we present how to validate the models from part one and test the best of these models. Since this is the sequel to the above stream, we continue with the enumeration in the process steps.
492
5
Regression Models
Fig. 5.151 Input variable definition of the linear model. The variable temp is included as a linear term
Fig. 5.152 Definition of the quadric term of the variable temp
5.4 Generalized Linear (Mixed) Model
493
Fig. 5.153 Input variables of the quadric regression Description of the model Stream name Based on dataset Stream structure
ozone_GLMM_2_VALIDATION Ozone.csv (see Sect. 12.1.31)
(continued)
494
5
Regression Models
Important additional remarks Use the same seed in the partition node as in the building model stream, so that the validation and testing are independent of the training data.
11. We start with the just built stream “ozone_GLMM_1_BUILDING_MODELS” from the previous step and save it under a different name. 12. Delete the Selection and both GLMM nodes, but keep the model nuggets. 13. Open the Partition node and generate a new Selection node for the validation data. Connect the Selection node to the Partition node and both model nuggets. 14. To analyze the model performances, we could add an Analysis node to each of the nuggets and run the stream, but we wish to describe an alternative way to proceed with the validation of the models, which gives a better overview of all the statistics. This will make it easier to compare the models later. We add a Merge node to the canvas and connect each nugget to it. We open it and go to the “Filter” tab. To distinguish the two output variables, we rename them from just “$L-O3” to “$L-O3 quadric” and “$L-O3 linear”, respectively. Now, we switch off one from ALL remaining duplicate fields, by clicking on the particular arrows in the “Filter” column (see Fig. 5.154). After the merge, all input variables remain just once with each of the target variables. 15. Now, we connect the Merge node with an Analysis node that was already put on the canvas. 16. We run the stream. The output of the Analysis node is displayed in Fig. 5.155. There, the statistics from both models are displayed in one output window, which makes comparison easy.
Fig. 5.154 Merge of the model outputs. Switching off of duplicate filters
5.4 Generalized Linear (Mixed) Model
495
Fig. 5.155 Validation of the models with the output of the Analysis node
As we can see, the mean errors of all three are close to 1, and the standard deviations of the models are very close to each other. This indicates that quadric regression does not strongly improve the goodness of fit, but when computing the RMSE (see Sect. 5.1.2), the error of the quadric regression is smaller than the error of the linear model. Hence, the former outperforms the latter and so describes the data a bit better. 17. We still have to test this by cross-validating a chosen model with the remaining testing data. The stream is built similarly to the other testing streams (see, e.g., Sect. 5.2.5 and Fig. 5.156).
496
5
Regression Models
Fig. 5.156 Test stream for the selected model during cross-validation
Fig. 5.157 Validation of the final model with the test set. Analysis output
18. To test the model, first, we generate a Select node, connect it to the Type node, and select the testing data. 19. We copy the model nugget from the most validated model and paste it into the canvas. Then, connect it to the Select node of the test dataset. 20. We add another Analysis node at the end of this stream and run it. The output is shown in Fig. 5.157. We can see that the error mean is almost 0 and the RMSE
5.4 Generalized Linear (Mixed) Model
497
(see Sect. 5.1.2) is of the same order as the standard deviation of the training and validation data. Hence, a quadric model is suitable for discerning ozone concentration from temperature.
5.4.5
Exercises
Exercise 1: Comparison of a GLMM and Ordinary Regression The dataset “Orthodont.csv” (see Sect. 12.1.30) contains measures of orthodontic change (distance) over time of 27 teenagers. Each subject thereby has different tooth positioning and thus an individual therapy prescription, which will result in movements of various distances. So, the goal of this exercise is to build a Generalized Linear Regression Model with the GLMM node to model the situation on an individual level and to predict the distance of tooth movement using the age and gender of a teenager. 1. Import the data file and perform a boxplot of the “distance” variable. Is the assumption of adding individual effects suitable? 2. Build two models with the GLMM node, one with random effects and one without. Then, use the “Age” and “Sex” variables as fixed effects in both models, while adding an individual intercept as a random effect in the latter model. 3. Does the random effect improve the model fit? Inspect the model nuggets to answer this question. Exercise 2: The GenLin Node In this exercise, we introduce the GenLin node and give readers the opportunity to familiarize themselves with this node and learn the advantages and differences, compared with the GLMM node. The task is to set up a stream of the test scores dataset (see Sect. 12.1.36) and perform cross-validation with the GenLin node. 1. Import the test score data and partition it into three parts, training, validation, and test set, with the Partition node. 2. Add a GenLin node to the stream and select the target and input variables. Furthermore, select the field as Partition, which will indicate the relationship to the subsets. 3. In the Model tab, choose the option “Use partitioned data” and select “Include intercept”. 4. In the Expert tab, we set the normal distribution for the target variable and choose Power as the link function with parameter 1.
498
5
Regression Models
5. Add two additional GenLin nodes to the stream by repeating steps 2–4. Choose 0.5 and 1.5 as power parameters for the first and second models, respectively. 6. Run the stream and validate the model performances with the Analysis nodes. Which model is most appropriate for describing the data? 7. Familiarize yourself with the model nugget. What are the tables and statistics shown in the different tabs? Exercise 3: Generalized Model with Poisson-Distributed Target Variable The “ships.csv” set (see Sect. 12.1.35) comprises data on ship damage caused by waves. Your task is to model the counts of these incidents with a Poisson regression, through the predictors ship type, year of construction, and operation period. Furthermore, the aggregated months of service are stated for each ship. 1. Reflect why a Poisson regression is suitable for this kind of data. If you are not familiar with it, inform yourself about Poisson distribution and the kind of randomness it describes. 2. Use the GenLin node to build a Poisson regression model with the three predictor variables mentioned above. Use 80% for the training dataset and 20% for the test data. Is the model appropriate for describing the data and predicting the counts of ship damage? Justify your answer. 3. Inform yourself on the “Offset” of a Poisson regression, for example, on Wikipedia. What does this field actually model, in general and in the current dataset? 4. Update your stream by a logarithmic transformation of the “months of service” and then add these values as offset. Does this operation increase the model fit?
5.4.6
Solutions
Exercise 1: Comparison of a GLMM and Ordinary Regression Name of the solution streams Theory discussed in section
glmm_orthodont Sect. 5.4
Figure 5.158 shows the final stream for this exercise. 1. We import the “Othodont.csv” data (see Sect. 12.1.30) with a Var. File node. Then, we add the usual Type node to the stream. Next, we add a Graphboard node to the stream and connect it with the Type node. After opening it, we mark the “distance” and “subject” variables and click
5.4 Generalized Linear (Mixed) Model
499
Fig. 5.158 Complete stream of the exercise to fit a GLMM to the “Orthodont” data with the GLMM node
on the Boxplot graphic (see Fig. 5.159). After running the node, the output window appears, which displays the distribution of each subject’s distance via a boxplot (see Fig. 5.160). As can be seen, the boxes are not homogeneous, which confirms our assumption of individual effects. Therefore, the use of a GLMM is highly recommended. 2. Next, we add a GLMM node to the stream canvas and connect it with the Type node. We open it and select the “distance” variable as the target variable and choose the linear model option as the type of the model (see Fig. 5.161). Now, we add the “Age” and “Sex” variables as fixed effects to the model, in the Fields and Effects tab of the GLMM node (see Fig. 5.162). This finishes the definition of the model parameters for the nonrandom effects model.
500
5
Regression Models
Fig. 5.159 Options in the Graphboard node for performing a boxplot
Now, we copy the GLMM node and paste it into the stream canvas. Afterward, we connect it with the Type node and open it. In the Data Structure tab, we drag the “Subject” field and drop it into the Subject canvas (Fig. 5.163). This will include an additional intercept as a random effect in the model and finishes the parameters definition of the model with random effects. We finally run the stream, and the two model nuggets appear. 3. We open both model nuggets and take a look at the “Predicted by Observed” scatterplots. These can be viewed in Figs. 5.164 and 5.165. As can be seen, the points of the model with random effects lie in a straight line around the diagonal,
5.4 Generalized Linear (Mixed) Model
501
Fig. 5.160 Boxplot for each subject in the Othodont dataset
whereas the points of the model without random effects have a cloudier shape. Thus, the model with random effects explains the data better and is more capable of predicting the distance moved, using the age and gender of a young subject. This is also indicated by the Akaike information criterion (AICC), which as can be checked in the Model Summary views. Here, the model with random effects has a smaller AICC value (441.630) than the one without random effects (486.563), which means a better fit of the model (see Fahrmeir (2013)).
Fig. 5.161 Target selection and model type definition in the GLMM node
Fig. 5.162 Definition of the fixed effects in the GLMM node
5.4 Generalized Linear (Mixed) Model
503
Fig. 5.163 Adding an individual random effect to the GLMM model
35.000
Predicted Value
30.000
25.000
20.000
15.000 15.000
20.000
25.000 distance
30.000
35.000
Fig. 5.164 Predicted by Observed plot of the model without random effects
504
5
Regression Models
35.000
Predicted Value
30.000
25.000
20.000
15.000 15.000
20.000
25.000 distance
30.000
35.000
Fig. 5.165 Predicted by Observed plot of the model with random effects
Exercise 2: The GenLin Node Name of the solution streams Theory discussed in section
Test_scores_GenLin Sect. 5.4
In this exercise, we get to know the GenLin node. Figure 5.166 shows the complete stream of the solution. 1. The “001 Template-Stream test_scores” is the starting point for our solution of this exercise. After opening this stream, we add a Partition node to it and select a three-part splitting of the dataset with the ratio 60% training: 20% testing: 20% validation (see Fig. 5.167). 2. Now, we add a GenLin node to the stream and select “posttest” as the target variable, “Partition” as the partitioning variable, and all other variables as input (see Fig. 5.168). 3. In the Model tab, we choose the option “Use partitioned data” and select “Include intercept” (see Fig. 5.169). 4. In the Expert tab, we set the target variable as normal distribution and choose “Power” as the link function with parameter 1 (see Fig. 5.170). 5. We add two additional GenLin nodes to the stream and connect the Partition node to them. Afterward, we repeat steps 2–4 for each node and set the same parameters, except for the power parameters, which are set to 0.5 and 1.5 (see Fig. 5.171 for how the stream should look like). 6. Now, we run the stream and the three model nuggets should appear in the canvas. We then add an Analysis node to each of these nuggets and run the stream a
5.4 Generalized Linear (Mixed) Model
505
Fig. 5.166 Complete stream of the exercise to fit a regression into the test score data with the GenLin node
Fig. 5.167 Partitioning of the test score data into three parts: training, testing and validation
506
5
Regression Models
Fig. 5.168 Specification of the target, input and partitioning variables
second time to get the validation statistics of the three models. These can be viewed in Figs. 5.172, 5.173, and 5.174. We see that all standard deviations appear very close to each other, to the overall models and to the dataset parts. This suggests that there is no real difference in the performance of the models for this data. The model with power 1 has the smallest values of these statistics, however, and is thus the most appropriate for describing the test score data. 7. To get insight into the model nugget and its parameters and statistics, we will only inspect the model nugget of the GenLin power 1 model here. The other nuggets have the same structure and tables. First, we observe that the GenLin mode nugget can also visualize the predictor importance in a graph. This is displayed in the Model tab. Not surprisingly, the
5.4 Generalized Linear (Mixed) Model
507
Fig. 5.169 The model tab of the GenLin node. Enabling of partitioning use and intercept inclusion
“Pretest” variable is by far the most important variable in predicting the “posttest” score (see Fig. 5.175). Now, we take a closer look at the “Advanced” tab. Here, multiple tables are displayed, but the first four (Model Information, Case processing Summary, Categorical Variable Information, Continuous Variable Information) summarize the data used in the modeling process. The next table is the “Goodness of Fit” table; it contains a couple of measures for validating the model using the training data. These include, among others, the Pearson Chi-Squared value and Akaike’s Information Criterion (AICC) (see Fig. 5.176). For these measures, the smaller the value, the better the model fit. If we compare, for example, the AICC of the three models with each other, we get the same picture as before with the Analysis nodes. The model with a power of 1 explains the training data better than the other two models. The next table shows the significance level and test statistics from validation of the model with the naive model. That is, the model consisting only of the
508
5
Regression Models
Fig. 5.170 Expert tab in the GenLin node. Definition of the target distribution and the link function
intercept. As displayed in Fig. 5.177, this model is significantly better than the naive model. The last but one table contains the results of the significance test of the individual input variables. The significance level is located in the column furthest right (see Fig. 5.178). In the case of the test scores, all effects are significant, except for the school_setting, school_type, and gender variables. If we compare this with Exercise 2 in Sect. 5.3.7, where the “posttest” score is estimated by an MLR, we see that there the variables school_type and gender were omitted by the selection method (see Fig. 5.103). Thus, this is consistent with related models.
5.4 Generalized Linear (Mixed) Model
Fig. 5.171 Stream after adding three different GenLin models
Fig. 5.172 Output of the Analysis node for the GenLin model with power 1
509
510
5
Regression Models
Fig. 5.173 Output of the Analysis node for the GenLin model with power 0.5
Fig. 5.174 Output of the Analysis node for the GenLin model with power 1.5
5.4 Generalized Linear (Mixed) Model
511
Fig. 5.175 Predictor importance graphic in the GenLin model nugget
Goodness of Fita Value Deviance
df
12765,277
1240
1249,000
1240
Pearson Chi-Square
12765,277
1240
Scaled Pearson ChiSquare
1249,000
1240
Likelihoodb
–3223,833
Akaike’s Information Criterion (AIC)
6467,666
Finite Sample Corrected AIC (AICC)
6467,844
Bayesian Information Criterion (BIC)
6518,967
Scaled Deviance
Log
Value/df 10,295 10,295
Consistent AIC (CAIC) 6528,967 Dependent Variable: posttest Model: (Intercept), school_setting, school_type, teaching_method, n_student, gender, lunch, pretesta a. Information criteria are in smaller-is-better form. b. The full log likelihood function is displayed and used in computing information criteria.
Fig. 5.176 Goodness of fit parameter for the GenLin model
512
5
Regression Models
Omnibus Testa Likelihood Ratio Chidf Sig. Square 3657,695 8 ,000 Dependent Variable: posttest Model: (Intercept), school_setting, school_type, teaching_method, n_student, gender, lunch, pretesta a. Compares the fitted model against the intercept-only model.
Fig. 5.177 Test of the GenLin model against the naive model
Test of Model Effects Type III Source
Wald ChiSquare
df
Sig.
347,217
1
,000
6,494
2
,039
,060
1
,806
957,871
1
,000
n_student
8,223
1
,004
gender
1,118
1
,290
lunch
7,997
1
,005
(Intercept) school_setting school_type teaching_method
pretest 6958,141 1 Dependent Variable: posttest Model: (Intercept), school_setting, school_type, teaching_method, n_student, gender, lunch, pretest
,000
Fig. 5.178 Test of the effects of the GenLin model
The last table finally summarizes the estimated coefficients of the variables, with the significance test statistics. The coefficients are in column B and the significance levels are in the last column of the table (see Fig. 5.179). As in the previous table, the school_setting, school_type, and gender coefficients are not significant. This means that a model without these input variables would be a more suitable fit for this data. We encourage the reader to build a new model and test this.
5.4 Generalized Linear (Mixed) Model
513
Parameter Estimates 95% Wald Confidence Interval Parameter (Intercept)
Std. Error
B
Lower
Upper
Hypothesis Test Wald ChiSquare df
Sig.
23,358
1,0153
21,368
25,347
529,304
1
,000
[school_setting=1]
,402
,3176
–,220
1,025
1,605
1
,205
[school_setting=2]
,722
,2852
,163
1,281
6,402
1
,011
[school_setting=3]
0a
.
.
.
.
.
.
,073
,2968
–,509
,655
,060
1
,806
[school_type=1] [school_type=2] [teaching_method=0]
0a
.
.
.
.
.
.
–5,971
,1929
–6,350
–5,593
957,871
1
,000
0a
.
.
.
.
.
.
n_student
–,110
,0385
–,186
–,035
8,223
1
,004
[gender=0]
–,192
,1813
–,547
,164
1,118
1
,290
[gender=1]
0a
.
.
.
.
.
.
–,711
,2515
–1,204
–,218
7,997
1
,005
[teaching_method=1]
[lunch=1] [lunch=2] pretest
0a
.
.
.
.
.
.
,911
,0109
,890
,933
6958,141
1
,000
(Scale) 10,220b ,4090 9,449 11,054 Dependent Variable: posttest Model: (Intercept), school_setting, school_type, teaching_method, n_student, gender, lunch, pretest a. Set to zero because this parameter is redundant. b. Maximum likelihood estimate.
Fig. 5.179 Parameter estimate and validation for the GenLin model
Exercise 3: Generalized Model with Poisson-Distributed Target Variable Name of the solution streams Theory discussed in section
Ships_genlin Sect. 5.1.2 Sect. 5.4
Figure 5.180 shows the complete solution stream. 1. Poisson regression is typically used to model count data and assumes a Poissondistributed target variable. Furthermore, the expected value is linked to both the linear predictor term via a logarithmic function, hence, log E ðtarget variable j xi1 , . . . , xip Þ ¼ β0 þ β1 ∙ xi1 þ . . . þ βp ∙ xip and the target variable, here “incidents”, following a Poisson law. A Poisson distribution also describes rare events and is ideal for modeling random events that occur infrequently. We refer to Fahrmeir (2013) and Cameron and Trivedi (2013) for more information on Poisson regression. Since ship damage caused by a wave is very unusual, the assumption is of Poisson-distributed target variables and so Poisson regression is suitable. Plotting a histogram of the “incidents” variable also affirms this assumption (see Fig. 5.181).
514
5
Regression Models
Fig. 5.180 Complete stream of the exercise to fit a Poisson regression
Fig. 5.181 Histogram of the “incidents” variable, which suggests a Poisson distribution
5.4 Generalized Linear (Mixed) Model
515
Fig. 5.182 Partitioning of the ship data into a training set and a test set
2. To build a Poisson regression with the GenLin node, we first split the data into a training set and a test set using a Partition node in the desired proportions (Fig. 5.182). After adding the usual Type node, we insert the GenLin node to the stream by connecting it to the Type node. Now, we open the GenLin node and define the target, Partition, and input variables in the Fields tab (see Fig. 5.183). In the Model tab, we enable the “Use partitioned data” and “Include intercept” options, as in Fig. 5.169. In the Expert tab, we finally define the settings of a Poisson regression. That is, we choose “Poisson” distribution as the target distribution and “logarithmic” as the link function (see Fig. 5.184). Afterward, we run the stream and the model nugget appears.
516
5
Regression Models
Fig. 5.183 Definition of the target, partition, and input variables for Poisson regression of the ship data
To evaluate the goodness of our model, we add an Analysis node to the nugget. Figure 5.185 shows the final stream of part 2 of this exercise. Now, we run the stream again, and the output of the Analysis tab pops up. The output can be viewed in Fig. 5.186. We observe that the standard deviations of the training and test data differ a lot: 5.962 for the training data, but 31.219 for the test data. This indicates that the model describes the training data very well, but is not appropriate for independent data. Hence, we have to modify our model, which is done in the following parts of this exercise.
5.4 Generalized Linear (Mixed) Model
517
Fig. 5.184 Setting the model settings for the Poisson regression. That is, the “Poisson” distribution and “logarithmic” as link function
Fig. 5.185 Final stream of part 2 of the exercise
518
5
Regression Models
Fig. 5.186 Output of the Analysis node for Poisson regression on the ship data
3. Often occurrences of events are counted on different timescales, and so appear to happen equally often, although it is not the case. For example, in our ship data, the variable “service” describes the aggregated months of service, which differ for each ship. So, discovered damage is based on different timescales for each ship and thus is not exactly comparable. The “offset” is an additional variable that balances this disparity in the model, so that the damages are observed at nearly the same time intervals. For more information on the offset, we refer to Hilbe (2014). 4. To update our model with an offset variable, we have to calculate the logarithm of the “service” variable. Therefore, we insert a Derive node between the Source and
5.4 Generalized Linear (Mixed) Model
519
Fig. 5.187 Calculation of the offset term in the Derive node
the Partition node. In this node, set the formula “log(service+1)” to calculate the offset term and name this new variable “log_month_service” (see Fig. 5.187). The “+1” in the formula is just a correction term added since some “service” values are 0 (this can be easily seen by running the Data Audit node) for which the logarithm cannot be calculated. With this extra term (“+1”) the logarithmtransformation can be done for all values of the “service” variable. Then, we open the GenLin node and set the “log_month_service” variable as the Offset field. This is displayed in Fig. 5.188. Now, we run the stream again,
520
5
Regression Models
Fig. 5.188 Specification of the offset variable. Here, the log_month_service
and the output of the new model pops up (see Fig. 5.189). We observe that both standard deviations do not differ that much anymore from each other. Thus, by setting an offset term, the accuracy of the model has increased, and the model is now able to predict ship damage from ship data not involved in the modeling process.
5.5 The Auto Numeric Node
521
Fig. 5.189 Output for the Poisson regression with offset term
5.5
The Auto Numeric Node
The SPSS Modeler provides us with an easy way to build multiple models in one step, using the Auto Numeric node. The Auto Numeric node considers various models, which use different methods and techniques, and ranks them according to a quantification measure. This is very advantageous to data miners for two main reasons. First, we can easily compare the different settings of a mining method within a single stream instead of running multiple streams with diverse settings. This helps to quickly find the best method setup and easily try different parameter settings, like in the case of a polynomial regression (see Sect. 5.4.4). This can help to quickly get a better understanding of the data. A second reason to use the Auto Numeric node is that it comprises models from different data mining approaches, such as the regression models described in this chapter, but also neuronal networks, regression trees or boosting models. The latter algorithms are most famous and
522
5
Regression Models
Fig. 5.190 Model types and nodes supported by the Auto Numeric node. The top list contains model types discussed in detail in this chapter. The bottom list shows further model types and nodes that are included in the Auto Numeric node
commonly used for classification problems (see Chap. 8). However, they can be also very effective when it comes to regression problems. Figure 5.190 shows a list of all model types resp. nodes supported by the Auto Numeric node, and we recommend Kuhn and Johnson (2013), IBM(2019b) and Chap. 8 for detailed descriptions of the methods and nodes not discussed in this chapter. After building, the Auto Numeric node takes the best-fitting models and joins them together into a model ensemble (see Sect. 5.3.6 for the concept of ensemble models). That means, when predicting the target variable value, each of the models in the ensemble processes the data and predicts the outputs, which are then aggregated into one final prediction using mean. This aggregation process of values, predicted from different models, has the advantage that data points favored as outliers or leveraged by one model type are smoothed out by the output of the other models. Furthermore, overfitting becomes less likely. We would like to point out that building a huge number of models is very time consuming. That’s why a large number of considered models in the Auto Numeric node may take a very long time to calculate, up to a couple of hours.
5.5.1
Building a Stream with the Auto Numeric Node
In the following, we show how to effectively use the Auto Numeric node to build an optimal model for our data mining task. A further advantage of this node is its capability of running a cross-validation within the same stream. This results in more clearly represented streams. In particular, no additional stream for validation of the model has to be constructed. We include this useful property in our example stream.
5.5 The Auto Numeric Node
Description of the model Stream name Based on dataset Stream structure
523
Auto numeric node housing.data.txt (see Sect. 12.1.20)
Important additional remarks The target variable must be continuous in order to use the auto numeric node Related exercises: All exercises in Sect. 5.5.3
1. First, we import the data, in this case the Boston housing data, housing.data.txt. 2. To cross-validate the models within the stream, we have to split the dataset into two separate sets: the training data and the test data. Therefore, we add the Partition node to the stream and partition the data appropriately: 70% of the data to train the models and the rest for the validation. The Partition node is described in more detail in Sect. 2.7.7. Afterward, we use the Type node to assign the proper variable types. 3. Now, we add the Auto Numeric node to the canvas and connect it with the Type node, then open it with a double-click. In the “Fields” tab, we define the target and input variables, where MEDV describes the mean house prize and is set as the target variable, and all other variables, except for the partitioning field, are chosen as inputs. The partition field is selected in the Partition drop-down menu for the Modeler, to indicate that this field defines both the training and the test sets (see Fig. 5.191 for details). 4. In the “Model” tab, we enable the “Use partitioned of data” option (see the top arrow in Fig. 5.192). This option will lead the model to be built based on the training data alone. In the “Rank models by” selection field, we can choose the score that validates the models and compares them with each other. Possible measures are: – Correlation: This is the correlation coefficient between the observed target values and the, by the model estimated, target values. A coefficient near 1 or 1 indicates a strong linear relationship and that the model fits the data potentially better than a model with a coefficient close to 0.
524
5
Regression Models
Fig. 5.191 Definition of target and input variables and the partition field in the Auto Numeric node
– Number of fields: Here, the number of variables that are included in the model are considered. A model with fewer predictors might be more robust and have a smaller chance of overfitting. – Relative error: This is a measure of how well the model predicts the data, compared to the naive approach that just uses the mean as an estimator. In particular, the relative error equals the variance of the observed values from those predicted, divided by the variance of the observed values from the mean.
5.5 The Auto Numeric Node
525
Fig. 5.192 Model tab with the criterion to decide which models should be included in the ensemble
With the “rank” selection, we can choose if the models should be ranked by the training or the test partition, and how many models should be included in the final ensemble. Here, we select that the ensemble should have four models (see the bottom arrow in Fig. 5.192). At the bottom of the tab, we can define more precise exclusion criteria, to ignore models that are unsuitable for our purposes. If these thresholds are too strict, however, we might end up with an empty ensemble; that is, no model fulfills the criteria. If this happens, we should loosen the criteria. In the Model tab, we can also choose to calculate the predictor importance, and we recommend enabling this option each time. For predictor importance, see Sect. 5.3.3. 5. The next tab is the “Expert” tab. Here, the models that should be calculated and compared can be specified (see Fig. 5.193). Besides the known Regression, Linear and GenLin nodes, which are the classical approaches for numerical
526
5
Regression Models
Fig. 5.193 Selection of the considered data mining methods. Model types chosen as candidates for the final ensemble model are Regression, Generalized Linear, LSVM, XGBoost Linear, XGBoost Tree, Linear, CHAID, C&R Tree, Random Forest and Neural Net
estimation, we can also include decision trees, neuronal networks, support vector machines, random forests and boosting models as candidates for the ensemble. Although these models are more prominent for different types of data mining tasks, such as classification problems, which are the typical applications, these models are also capable of estimating numeric values. We omit a description of the particular models here and refer to the corresponding Chap. 8 on classification, IBM (2019a) and Kuhn and Johnson (2013) for further information on the nodes, methods and algorithms.
5.5 The Auto Numeric Node
527
Fig. 5.194 Specification of the Linear model building settings
We can also specify multiple settings for one node, in order to include more model variations and to find the best model of one type. Here, we used two setups for the Regression node and eight for the Linear node (see Fig. 5.193). We demonstrate how to put these settings with the Linear node. For all other nodes, the steps are the same, but the options are of course different for each node. 6. To include more models of the same type in the calculations, we click on the “Model parameters” field next to the particular model. We choose the option “Specify” in the opening selection bar. Another window pops up in which we can define all model variations. For the Linear node, this looks like Fig. 5.194.
528
5
Regression Models
Fig. 5.195 Specification of the variable selection methods to consider in the Linear node model
7. In the opened window, we select the “Expert” tab and in the “Options” field next to it, we click on each parameter that we want to change (see the arrow in Fig. 5.194). Another, smaller window, opens with the possible selectable parameter “values” (see Fig. 5.195). We mark all parameter options that we wish to be considered, here, the “Forward stepwise” and “Best subset” model selection methods, and confirm by clicking the “ok” button. With all other parameters, proceed in an analog fashion. 8. If we have set all the model parameters of our choice, we run the model, and the model nugget should appear in the stream. For each possible combination of selected parameter options, the Modeler now generates a statistical model and compares it to all other build models. If it is ranked high enough, the model is included in the ensemble.
5.5 The Auto Numeric Node
529
Fig. 5.196 Analysis node to deliver statistics of the predicted values. Select “Separate by partition” to process the training and test sets separately
We would again like to point out that running a huge number of models can lead to time-consuming calculations. 9. We add an Analysis node to the model nugget to calculate the statistics of the predicted values by the ensemble, for the training and test sets separately. We make sure that the “Separate by partition” option is enabled, as displayed in Fig. 5.196. 10. Figure 5.197 shows the output of the Analysis node and the familiar distribution values of the errors for the training data and test data. To evaluate if the ensemble model can be used to process independent data, however, we have to compare the statistics and, in particular, the standard deviation, which is the RMSE (see Sect. 5.1.2). Since this is not much higher for the test data than for
530
5
Regression Models
Fig. 5.197 Analysis output with statistics from both the training and the test data
the training, the model passes the cross-validation test and can be used for predictions of further unknown data.
5.5.2
The Auto Numeric Model Nugget
In this short section, we take a closer look at the model nugget, generated by the Auto Numeric node, and the options it provides. Model Tab and the Selection of Models Contributing to the Ensemble In the Model tab, the models suggested for the ensemble by the Auto Numeric node are listed (see Fig. 5.198). In our case, we set the ensemble model to comprise four models: these are a boosted tree, two decision trees (C&R and CHAID trees) and a linear regression model. The models are ordered by their correlation, as this is the rank chosen in the node options (see previous section). The order can be manually changed in the drop-down menu on the left, labeled “Sort by”, in Fig. 5.198. Here, the model statistics, which are the correlation and relative error, are calculated for the test set, and thus rank according to these values (see the right arrow in Fig. 5.198). We see that the XGBoost Tree has a notably higher correlation (0.911) than the other models. Hence, this model is the most suitable on fitting the data. We can change the basis of the calculations to be the training data, on which all ranking and fitting statistics will then be based. The test set, however, has an
5.5 The Auto Numeric Node
531
Fig. 5.198 Model tab of the Auto Numeric model nugget. Specification of the models in the ensemble, which are used to predict the output
advantage in that the performance of the models is verified on unknown data. Whereas, to inspect if each model fits the data well, we recommend looking at the individual model nuggets manually, to inspect the parameter values. Double-clicking on the model itself will allow the inspection of each model, its included variables, and the model fitting process, with its quantity values such as R2. This is highlighted by the left arrow in Fig. 5.198. This will open the model nugget of each particular node in a new window, and the details of the model will be displayed. Since the XGBoost Tree model nugget displays only the predictor importance (in the version of this book, i.e., SPSS Modeler 18.2) and each of the other model nuggets is introduced and described separately in the associated chapters, we refer to these chapters, as we omit a description of each individual node here. We further refer to IBM (2019b) for a description of all model nuggets whose discussion is out of scope of this book. Double-clicking on a graph opens a new window, which displays a scatterplot of the target variable versus its prediction. In a precise and proper model, the points should be oriented as a straight line along the diagonal, which is the case in all of the models included in our ensemble, with the prediction of the XGBoost Tree having the lowest variation, and thus, the best prediction of the target variable MEDV. In the furthest left column labeled “Use?”, we can choose which of the models should process data for prediction. More precisely, each of the enabled models takes the input data and estimates the target value individually. Then, all outputs are averaged to one single output. This process of aggregating can prevent overfitting and minimize the impact of outliers, which will lead to more trustworthy predictions.
532
5
Regression Models
Fig. 5.199 Graph tab of the Auto Numeric model nugget, which displays the predictor importance and scatterplot of observed and predicted values
Predictor Importance and Visualization of the Accuracy of Prediction In the “Graph” tab, the predicted values are plotted against the observations on the left-hand side (see Fig. 5.199). There, not every single data point is displayed, but instead bigger colored dots that describe the density of data points in the area. The darker the color, the higher the density, as explained in the legend right next to the scatterplot. If the points adumbrate a line around the diagonal, the ensemble describes the data properly. In the graph on the right, the importance of the predictors is visualized in the known way. We refer to Sect. 5.3.3 and Fig. 5.71 for explanations of the predictor importance and the shown plot. The predictor importance of the ensemble model is calculated on the averaged output data. Setting of Additional Output The “Settings” tab provides us with additional output options (see Fig. 5.200). If the first option is checked, the predictions of each individual model are removed from the prediction output. Otherwise, if we want to display or further process the non-aggregated values, uncheck the box for this option. Furthermore, we can add the standard error to our output, estimated for each prediction (see the second check box in Fig. 5.200). We recommend playing with these options and previewing the output data, to understand the differences between each created output. For more information, we recommend consulting IBM (2019b).
5.5 The Auto Numeric Node
533
Fig. 5.200 Settings tab of the Auto Numeric model nugget. Specification of the output
5.5.3
Exercises
Exercise 1: Longley Economics Dataset The “longley.csv” (see Sect. 12.1.25) data contains annual economic statistics from 1947 to 1962 of the United States of America, including, among other variables: gross national product; the number of unemployed people; population size; and the number of employed people. The task in this exercise is to find the best possible regression model with the Auto Numeric node, for predicting the number of employed people from the other variables in the dataset. What is the best model node, as suggested from the Auto Numeric procedure? What is the highest correlation achieved? Exercise 2: Cross-Validation with the Auto Numeric Node Recall Exercise 2 from Sect. 5.4.5, where we built a stream with cross-validation in order to find the optimal Generalized Linear Model, to predict the outcome of a test score. There, three separate GenLin nodes were used to build the different models. Use the Auto Numeric node to combine the model building and validation processes in a single step.
534
5.5.4
5
Regression Models
Solutions
Exercise 1: Longley Economics Dataset Name of the solution streams Theory discussed in section
Longley_regression Sect. 5.5
For this exercise, there is no definite solution, and it is just not feasible to consider all possible model variations and parameters. This exercise serves simply as a practice tool for the Auto Numeric node and its options. One possible solution stream can be seen in Fig. 5.201. 1. To build the “Longley_regression” stream of Fig. 5.201, we import the data and add the usual Type node. The dataset consists of 7 variables and 16 data records, which can be observed with the Table node. We recommend using this node in order to inspect the data in the data file and to get an impression of the data. 2. We then add the Auto Numeric node to the stream, define the variables, with the “Employed” variable as the target, and all other variables as predictor variables (see Fig. 5.202). In this solution to the exercise, we use the default settings of the Auto Numeric node for the model selection. That is, using the partitioned data for ranking the models, with the correlation as indicator. We also want to calculate the importance of the predictors (see Fig. 5.203). Furthermore, we include the same model types as in the explanation in Sect. 5.5.1, but only in their default settings, which are standardly specified by the Auto Numeric node (see Fig. 5.204). We strongly recommend also testing other models and playing with the parameters of these
Fig. 5.201 Stream of the regression analysis of the Longley data with the Auto Numeric node
5.5 The Auto Numeric Node
535
Fig. 5.202 Definition of the variables for the Longley dataset in the Auto Numeric node
models, in order to find a more suitable model and to optimize the accuracy of the prediction. 3. After running the stream, the model nugget pops up. When opening the nugget, we see in the “model rank” tab that the best three models or nodes chosen by the
536
5
Regression Models
Fig. 5.203 Model build options in the Auto Numeric node
Auto Numeric node are the Linear node, a Regression node and the GLM (see Fig. 5.205). There the GenLin node tried to estimate the data relationship using a line: the default setting. We observe that for all of the three models, the Correlation coefficient is extremely high, i.e., 0.998. Moreover, the model stemming from the Linear node is slimmer, insofar as fewer predictors are included in the final model, i.e., 4 instead of 6. As mentioned above, this is not the definite solution. We recommend experimenting with the parameters of the models, in order to increase the correlation coefficient.
5.5 The Auto Numeric Node
Fig. 5.204 Definition of the models included in the fitting and evaluation process
537
538
5
Regression Models
Fig. 5.205 Overview of selected models by the Auto Numeric node
Fig. 5.206 Scatterplot to visualize the model performance and the predictor importance
4. High correlation between the predicted values and the observed values in the Longley dataset is also affirmed by the Scatterplot in the “Graph” tab, displayed in Fig. 5.206. The plotted points form an almost perfect straight line.
5.5 The Auto Numeric Node
539
Furthermore, Fig. 5.206 visualizes the importance of the predictors. All predictors are almost equally important. There is no input variable that outranks the others. This gives a stability to the model, as the prediction is not dependent on only one input variable. Exercise 2: Cross-Validation with the Auto Numeric Node Name of the solution streams Theory discussed in section
test_scores_Auto_numeric_node Sect. 5.1 Sect. 5.1.2 Sect. 5.5
Figure 5.207 shows the final stream of this exercise. 1. To build the above stream, we first open the “001 Template-Stream test_scores” stream and save it under a different name. Now, add a Partition node to the stream and split the dataset into training, test, and validation sets, as described in Fig. 5.167. 2. To include both the model building and the validation process in a single node, we now add the Auto Numeric node to the stream and select the “posttest” field as the target variable, the Partition field as the partition and all other variables as inputs (see Fig. 5.208). 3. In the “Expert” tab, we specify the models that should be used in the modeling process. We only select the Generalized Linear Model (see Fig. 5.209). Then, we click on the “Specify” option to specify the model parameters of the different models that should be considered. This can be viewed in Fig. 5.209.
Fig. 5.207 Complete stream of the exercise to fit a regression into the test score data with the Auto Numeric node
540
5
Regression Models
Fig. 5.208 Definition of the input, target, and partitioning variables
4. In the opened pop-up window, we go to the “Expert” tab and click on the right “Option” field in the Link function, to define its parameters (see Fig. 5.210). Then, we select the “Power” function as the link function, see Fig. 5.211, and 0.5, 1, and 1.5 as the exponents (see Fig. 5.212). After confirming these selections by clicking on the “ok” button, all the final parameter selections are shown (see
5.5 The Auto Numeric Node
541
Fig. 5.209 Definition of the GLM that is included in the modeling process
Fig. 5.213). Now, three models are considered in the Auto Numeric node, each having the “Power” function as link function, but with different exponents. 5. After running the stream, the model nugget appears. In the “Model” tab of the nugget, we see that the three models all fit the training data very well. The correlations are all high, around 0.981. As in Exercise 2 in Sect. 5.4.5, the highest ranked is the model that has exponent 1 (see Fig. 5.214). The differences between the models are minimal, however, and thus can be ignored. 6. If we look at the ranking of the test set, the order of the model changes. Now, the best-fitting model is model number 3, with exponent 1.5 (see Fig. 5.215). We
542
Fig. 5.210 Definition of the Link function
Fig. 5.211 Definition of the Power function as Link function
5
Regression Models
5.5 The Auto Numeric Node
Fig. 5.212 Definition of the exponents of the link functions
Fig. 5.213 Final definition of the model parameters
543
544
Fig. 5.214 Build models ranked by correlation with the training set
Fig. 5.215 Build models ranked by correlation with the testing set
5
Regression Models
References
545
Fig. 5.216 Output of the Analysis node
double-check that all the models are properly fitted, and then we choose model 3 as the final model, based on the ranking of the test data. 7. After adding an Analysis node to the model nugget, we run the stream again (see Fig. 5.216 for the output of the Analysis node). We see once again that the model fits the data very well (low and equal RMSE, which is the standard deviation in the Analysis node, see Sect. 5.1.2), which in particular means that the validation of the final model is as good as for the testing and training sets. This coincides with the results of Exercise 2 in Sect. 5.4.5.
References Abel, A. B., & Bernanke, B. (2008). Macroeconomics (Addison-Wesley series in economics). Boston: Pearson/Addison Wesley. Boehm, B. W. (1981). Software engineering economics (Prentice-Hall advances in computing science and technology series). Englewood Cliffs, NJ: Prentice-Hall. Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (Econometric society monographs) (Vol. 53, 2nd ed.). Cambridge: Cambridge University Press. Fahrmeir, L. (2013). Regression: Models, methods and applications. Berlin: Springer. Gilley, O. W., & Pace, R. (1996). On the Harrison and Rubinfeld data. Journal of Environmental Economics and Management, 31(3), 403–405. Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81–102. Hilbe, J. M. (2014). Modeling of count data. Cambridge: Cambridge University Press.
546
5
Regression Models
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679–688. IBM. (2019a). IBM SPSS Modeler 18.2 algorithms guide. Accessed December 16, 2019, from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.2/en/AlgorithmsGuide. pdf IBM. (2019b). IBM SPSS Modeler 18.2 modeling nodes. Accessed December 16, 2019, from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.2/en/ ModelerModelingNodes.pdf IBM. (2019c). IBM SPSS Analytics Server 3.2.1 overview. Accessed December 16, 2019, from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/3.2.1/English/IBM_ SPSS_Analytic_Server_3.2.1_Overview.pdf James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 103). New York: Springer. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. Kutner, M. H., Nachtsheim, C., Neter, J., & Li, W. (2005). Applied linear statistical models (The McGraw-Hill/Irwin series operations and decision sciences) (5th ed.). Boston: McGraw-Hill Irwin. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. Thode, H. C. (2002). Testing for normality, statistics, textbooks and monographs (Vol. 164). New York: Marcel Dekker. Tuffery, S. (2011). Data mining and statistics for decision making (Wiley series in computational statistics). Chichester: Wiley. Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design (McGraw-Hill series in psychology) (3rd ed.). New York: McGraw-Hill. Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms (Chapman & Hall/CRC machine learning & pattern recognition series). Boca Raton, FL: Taylor & Francis.
6
Factor Analysis
After finishing this chapter, the reader is able to: 1. Evaluate data using more complex statistical techniques such as factor analysis. 2. Explain the difference between factor and cluster analysis. 3. Describe the characteristics of principal component analysis and principal factor analysis. 4. Apply especially the principal component analysis and explain the results. Ultimately, the reader will be called upon to propose well-thought-out and practical business actions from the statistical results. 5. Distinguish between feature selection and feature reduction.
6.1
Motivating Example
Factor analysis is used to reduce the number of variables in a dataset, identify pattern, and reveal hidden variables. There are a wide range of applications: in social science, factor analysis is being used to identify hidden variables that can explain or that are responsible for behavioral characteristics. The approach can also be used in complex applications, such as face recognition. Various types of factor analysis are similar, in terms of calculating the final results. The steps are generally the same but the assumptions, and therefore the interpretation of the results, are different. In this chapter, we want to present the key idea of factor analysis. Statistical terms are discussed if they are necessary for understanding the calculation and helping us to interpret the results. We use an example that represents the dietary characteristics of 200 respondents, as determined in a survey. The key idea of the dataset and some of the interpretations that can be found in this book are based on explanations from Bühl (2012). Here though, we use a completely new dataset. Using the categories “vegetarian”, “low meat”, “fast food”, “filling”, and “hearty”, the respondents were asked to rate the characteristics of their diet on an # Springer Nature Switzerland AG 2021 T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler, https://doi.org/10.1007/978-3-030-54338-9_6
547
548
6
Factor Analysis
Please rate how the following dietary characteristics describe your preferences ... Respondent 1 1 ... (never) 2 ... (sometimes) 3 ... (very) often
vegetarian
1
2
3
Respondent 2
1
2
.....
3
Respondent n
1
2
3
low_meat fast_food filling hearty
Similarities in terms of variation for variable,filling′and,hearty′
Fig. 6.1 The dietary characteristics of three respondents
ordinal scale. The concrete question was “Please rate how the following dietary characteristics describe your preferences . . .”. As depicted in Fig. 6.1, the answer of each respondent can be visualized in a profile chart (columns). Based on these graphs, we can find some similarities between the variables “filling” and “hearty”. If we analyze the profile charts row by row, we find that the variables are somehow similar because the answers often go in the same direction. In statistical terms, this means that the fluctuation of the corresponding answers (the same items) is “approximately” the same. "
Factor analysis is based on the idea of analyzing and explaining common variances in different variables.
"
If the fluctuation of a set of variables is somehow similar, then behind these variables a common “factor” can be assumed. The aim of factor analysis is to determine these factors, to define subsets of variables with a common proportion of variance.
"
The factors are the explanation and/or reason for the original fluctuation of the input variables and can be used to represent or substitute them in further analyses.
6.2 General Theory of Factor Analysis
6.2
549
General Theory of Factor Analysis
If we think about fluctuation in terms of the amount of information that is in the data, we can typically divide the volatility in two components: 1. The part or percentage of fluctuation that different variables have in common, because a common but hidden variable exists in the “background” that is responsible for this volatility. 2. The residual fluctuation that is not related to the fluctuation of the other variables and therefore cannot be explained. So in the example of analyzing the characteristics of respondents’ diets, we want to extract and describe the reasons for fluctuation and to explain and understand the habits of different types of consumers. Therefore, let us define what we want to call a factor loading, as well as the communality. "
Factor analysis can be used to reduce the number of variables. The algorithm determines factors that can explain the common variance of several variable subsets. The strength of the relationship between the factor and a variable is represented by the factor loadings.
"
The variance of one variable explained by all factors is called the communality of the variable. The communality equals the sum of the squared factor loadings.
"
The factors determined by a Principal Factor Analysis (PFA) can be interpreted as “the reason for the common variance”. Whereas, the factors determined by a Principal component analysis (PCA) can be described as “the general description of the common variance” (see Backhaus 2011, p. 357). PCA is used much more often than PFA.
"
Both methods, PFA and PCA, can be also differentiated by their general approaches used to find the factors: the idea of the PCA is that the variance of each variable can be completely explained by the factors. If there are enough factors there is no variance left. If the number of factors equals the number of variables, the explained variance proportion is 100%.
We want to demonstrate the PCA as well as the PFA in this chapter. Figure 6.2 outlines the structure of the chapter.
Fig. 6.2 Structure of the factor analysis chapter
550
6
Factor Analysis
Fig. 6.3 Using correlation or covariance matrix as the basis for factor analysis
Assessing the Quality of a Factor Analysis In the SPSS Modeler, factor analysis can be done using a PCA/Factor node. In the expert settings of the node that is shown in Fig. 6.3, the correlation or the covariance matrix can be defined as the basis of the calculations. The correlation matrix is used in the majority of the applications. This is why the covariance depends on the units of the input variables. The quality of the factor analysis heavily depends on the correlation matrix; because of this different measures have been developed to determine if a matrix is appropriate, and a reliable basis for the algorithm. We want to give here an overview of different aspects that should be assessed. Test for the Significance of Correlations The elements of the matrix are the bivariate correlations. These correlations should be significant. The typical test of significance should be used.
6.2 General Theory of Factor Analysis
551
Bartlett-Test/Test of Sphericity Based on Dziuban and Shirkey (1974), the test tries to ascertain that the sample comes from a population where the input variables are uncorrelated. Or in other words, that the correlation matrix is only incidentally different from the unit matrix. Based on the chi-square statistic, however, the Bartlett test is necessary but not sufficient. Dziuban and Shirkey (1974, p. 359) stated: “That is, if one fails to reject the independence hypothesis, the matrix need be subjected to no further analysis. On the other hand, rejection of the independence hypothesis with the Bartlett test is not a clear indication that the matrix is psychometrically sound”. Inspecting the Inverse Matrix Many authors recommend verifying that the non-diagonal elements of the inverse matrix are nearly zero (see Backhaus 2011, p. 340). Inspecting the Anti-image-Covariance Matrix Based on Guttman (1953), the variance can more generally be divided into an image and an anti-image. The anti-image represents the proportion of the variance that cannot be explained by the other variables (see Backhaus 2011, p. 342). For a factor analysis, the anti-image matrix should be a diagonal matrix. That means that the non-diagonal elements should be nearly zero (see Dziuban and Shirkey 1974). Some authors prefer values smaller than 0.09. Measure of Sampling Adequacy (MSA): Also Called Kaiser–Meyer–Olkin Criterion (KMO) Each element of the correlation matrix represents the bivariate correlation of two variables. Unfortunately, this correlation can also be influenced by other variables. For example, the correlation between price and sales volume typically depends also on marketing expenditures. With partial correlations, one can try to eliminate such effects of third variables and try to measure the strength of the correlation between only two variables. The MSA/KMO compares both types of correlation based on the elements of the anti-image-correlation matrix. For details see IBM (2011). Based on Kaiser and Rice (1974), values should be larger than 0.5, or better still, larger than 0.7 (see also Backhaus 2011, p. 343). In particular, the MSA/KMO criterion is widely used and strongly recommended in literature on the subject. Unfortunately, the SPSS Modeler does not offer this or any of the other statistics or measures described above. The user has to be sure that the variables and the correlation matrix are a reliable basis for a factor analysis. In the following section, we will show possibilities for assessing the quality of the matrix. Number of Factors to Extract Thinking about the aim of a factor analysis, the question at hand is how to determine the number of factors that will best represent the input variables. A pragmatic solution is to look at the cumulative variance explained by the number of factors extracted. Here, the user can decide if the proportion is acceptable. Also for this reason, the Modeler offers specific tables.
552
6
Factor Analysis
Fig. 6.4 Dimension reduction and eigenvectors
Eigenvalues can help to determine the number of components to extract too, as illustrated in Fig. 6.4, for a dataset with just three different values. We can consider a plane with the objects as points pinned on it. After determining this plane, we can orientate by using just two vectors. These vectors are called eigenvectors. For each eigenvector exists an eigenvalue. The eigenvalue represents the volatility that is in the data in this direction. So if we order the eigenvalues in descending order, we can stepwise extract the directions with the largest volatility. We call these the principal components. Normally, principal components with an eigenvalue larger than one should be used (see Kaiser 1960, p. 146). This rule is also implemented and activated in the SPSS Modeler, as shown in Fig. 6.3. Based on the eigenvalues and the number of factors extracted, a so-called “scree plot” can be created (see Cattell 1966). Other statistic software packages offer this type of diagram. The SPSS Modeler unfortunately does not. We will show in an example how to create it manually.
6.3
Principal Component Analysis
6.3.1
Theory
Principal component analysis (PCA) is a method for determining factors that can be used to explain common variance in several variables. As the name of the method suggests, let us assume the PCA tries to reproduce the variance of the original variables (principal components) after determining subsets of them. As outlined in the previous section, the identified factors or principal components can be described as “the general description [not reason!] of the common variance” in the case of a PCA (see Backhaus 2011, p. 357).
6.3 Principal Component Analysis
553
In this chapter, we want to extract the principal components of variables and therefore reduce the actual number of variables. We are more interested in finding “collective terms” (see Backhaus 2011, p. 357) or “virtual variables”, and not the cause for the dietary habits of the survey respondents. That’s because here we use the PCA algorithm. There are several explanations of the steps in a PCA calculation. The interested reader is referred to Smith (2002). "
PCA identifies factors that represent the strength of the relationship between the hidden and the input variables. The squared factor loadings equal the common variance in the variables. The factors are ordered by their size or by the proportion of variance in the original variables that can be explained.
"
PCA can be used to: 1. Identify hidden factors that can be used as “collective terms” to describe (not explain) the behavior of objects, consumers, etc. 2. Identify the most important variables and reduce the number of variables.
6.3.2
Building a Model in SPSS Modeler
Description of the model Stream name Related to the explanation in this chapter: pca_nutrition_habits.str Extended version with standardized values: pca_nutrition_habits_standardized. str Based on pca_nutrition_habites.sav dataset Stream structure
Related exercises: all exercises in Sect. 6.3.3
554
6
Factor Analysis
Fig. 6.5 Template-Stream_nutrition_habits
The data we want to use in this section describe the answers of respondents to questions on their dietary habits. Based on the categories “vegetarian”, “low meat”, “fast food”, “filling”, and “hearty”, the respondents rated the characteristics of their diets on an ordinal scale. The concrete question was “Please rate how the following dietary characteristics describe your preferences . . .”. The scale offered the values “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. For details see also Sect. 12.1.28. We now want to create a stream to analyze the answers and to find common factors that help us to describe the dietary characteristics. We also will explain how PCA can help to reduce the number of variables (we call that variable clustering), and help cluster similar respondents, in terms of their dietary characteristics. 1. We open the Template stream‚ “Template-Stream_nutrition_habits”, which is also shown in Fig. 6.5. This stream is a good basis for adding a PCA calculation, and interpreting the results graphically. To become familiar with the data, we run the Table node. In the dialog window with the table, we activate the option “Display field and value labels”, which is marked in Fig. 6.6 with an arrow. As we can see, the respondents answered five questions regarding the characteristics of their diets. The allowed answers were “never”, “sometimes”, and “(very) often”. Checking the details of the Filter node predefined in the stream, we realize that here the respondents ID is removed from the calculations. This is because the ID is unnecessary for any calculation (Fig. 6.7). "
There are at least three options for excluding variables from a stream. – In a (Statistics) File node, the variables can be disabled in the tab “Filter”. – The role of the variables can be defined as “None” in a Type node. – A Filter node can be added to the stream to exclude a variable from usage in any nodes that follow.
6.3 Principal Component Analysis
Fig. 6.6 Records from the dataset “nutrition_habites.sav”
Fig. 6.7 Filter node and its parameter
555
556
6
Factor Analysis
Fig. 6.8 Scale types and the roles of variables in “nutrition_habites.sav”
"
The user should decide which option to use. We recommend creating transparent streams that each user can understand, without having to inspect each node. So we prefer Option 3.
Now we want to reduce the number of variables and give a general description of the behavior of the respondents. Therefore, we perform a PCA. To do so we should verify the scale types of the variables and their role in the stream. We double-click the Type node. Figure 6.8 shows the result. The ID was excluded with a Filter node. All the other variables with their three codes, “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often” are ordinally scaled. As explained in detail in Sect. 4.5, we want to calculate the correlations between the variables. That means we determine the elements of the correlation matrix. Here, we have ordinal input variables that ask for Spearmans rho as an appropriate measurement (see for instance Scherbaum and Shockley 2015, p. 92). Pearson’s correlation coefficient is an approximation, based on the assumption that the distance between the scale items is equal. The number of correlations that are at most weak (smaller 0.3) should be relatively low. Otherwise, the dataset or the correlation matrix is not appropriate for a PCA.
6.3 Principal Component Analysis
557
Fig. 6.9 Sim Fit node is added to calculate the correlation matrix
"
The SPSS Modeler does not provide the typical measures used to assess the quality of the correlation matrix, e.g., the inverse matrix, the AntiImage-Covariance Matrix, and the Measure of Sampling Adequacy (MSA)—also called Kaiser–Meyer–Olkin criterion (KMO).
"
After reviewing the scale types of the variables, the correlation matrix should be inspected. The number of correlations that are at most weak or very low (below 0.3) should be relatively small.
"
It is important to realize that the Modeler determines Pearsons correlation coefficient, which is normally appropriate for metrically (interval or ratio) scaled variables. Assuming constant distances between the scale items, the measure can also be used for ordinally scaled variables, but this would only be an approximation.
To calculate the correlation matrix, we can use a Sim Fit node as explained in Sect. 4.5. We add this type of node to the stream from the Output tab of the Modeler and connect it with the Type node (Fig. 6.9). 2. Running the Sim Fit node and reviewing the results, we can find the correlation matrix as shown in Fig. 6.10. With 200 records, the sample size is not too large. The number of small correlations that are at most weak (smaller 0.3) is 4 out of 10, or 40%, and so not very small. However, reviewing the correlations shows that they can be explained logically. For example, the small correlation of 0.012 between the variables “fast_food” and “vegetarian” makes sense. We therefore accept the matrix as a basis for a PCA. 3. As also explained in Sect. 4.5, the Sim Fit node calculates the correlations based on an approximation of the frequency distribution. The determined approximation of the distributions can result in misleading correlation coefficients. The Sim Fit node is therefore a good tool, but the user should verify the results by using other functions, e.g., the Statistics node. To be sure that the results reflect the correlations correctly, we want to calculate the Pearson correlation with a Statistics node. We add this node to the stream and connect it with the Type node (see Fig. 6.11).
558
Fig. 6.10 Correlation matrix is determined with the Sim Fit node
Fig. 6.11 Statistics node is added to the stream
6
Factor Analysis
6.3 Principal Component Analysis
559
Fig. 6.12 Parameter of the Statistics node
4. The parameter of the Statistics node must be modified as follows: – We have to add all the variables in the field “Examine” by using the dropdown list button on the right-hand side. This button is marked with an arrow in Fig. 6.12. Now all the statistics for the frequency distributions of the variables can be disabled. – Finally, all the variables must also be added to the field “Correlate”. Once more the corresponding button on the right-hand side of the dialog window must be used. Now we can run the Statistics node. 5. The results shown in Fig. 6.13 can be found in the last row or the last column of the correlation matrix in Fig. 6.10. The values of the matrix are the same. So we can accept our interpretation of the correlation matrix and close the window. N.B.: We outlined that there are several statistics to verify and to ensure the correlation matrix is appropriate for a PCA. In the download section of this book, an R-Script for this PCA is also available. Here, the following statistics can be determined additionally: KMO-statistics Bartlett significance
0.66717 4.8758e152
(>0.5 necessary) (1 year 1
Recyclable packaging 1
1
1
0
1
0
0
1
0
0
1
Table 7.3 Contingency table for Tanimotocoefficient formula
Object 1
1 0
Object 2 1 a c
0 b d
For qualitative variables, we should focus on measuring similarities between the objects. Table 7.2 shows an example of three products and four binary variables. If the variables of two objects have the same expression of characteristics, then they are somehow similar. In Table 7.2 these are products A and B. Considering such binary variables, several similarity functions can be used, e.g., Tanimoto, simple matching, or Russel and Rao coefficients. Nonbinary qualitative variables must be recoded into a set of binary variables. This will be discussed in Exercise 2 of Sect. 7.2.1 for more details. The interested reader is referred to Timm (2002, pp. 519–522) and Backhaus (2011, p. 402). To present the principal procedure, we want to calculate the Tanimoto coefficient here. Its formula. sij ¼
a aþbþc
is based on a contingency table shown in Table 7.3. As we can see, the Tanimoto coefficient is the proportion of the common characteristics a and the characteristics that are present in at least one object, represented by a + b + c. Other measures, e.g., the Russel and Rao coefficient, use other proportions. See Timm (2002, p. 521). Given the example of products A and B in Table 7.2, we determine the frequencies as shown in Table 7.4. The Tanimoto coefficient is then. sAB ¼
2 ¼ 0:4 2þ2þ1
The solution can also be found in the Microsoft Excel file, “cluster dichotomous variables example.xlsx”.
630
7
Table 7.4 Contingency table for product A and B presented in Table 7.1
Product A
1 0
Cluster Analysis
Product B 1 2 1
0 2 0
Fig. 7.6 2D-plot of objects represented by two variables
In the case of quantitative metrical variables, the geometric distance can be used to measure the dissimilarity of objects. The larger the distance, the less similar or the more dissimilar they are. Considering two prices of products, we can use the distance, e.g., given by the absolute value of the difference. If two or three metrical variables are available to describe the product characteristics, then we can create a diagram as shown in Fig. 7.6 and measure the distance by using the Euclidean distance, known from school math. This approach can also be used in n-dimensional vector space. To have a more outlier-sensitive measure, we can improve the approach by looking at the squared Euclidean distance. Table 7.5 gives an overview of the different measures, depending on the scale type of the variables. "
Proximity measures are used to identify objects that belong to the same subgroup in a cluster analysis. They can be divided into two groups: similarity and dissimilarity measures. Nominal variables are recoded into a set of binary variables, before similarity measures are used. Dissimilarity measures are mostly distance-based. Different approaches/metrics exist to measure the distance between objects described by metrical variables.
"
The SPSS Modeler offers the log-likelihood or the Euclidean distance measures. In the case of the log-likelihood measure, the variables have to be assumed as independent. The Euclidean distance can be calculated for only continuous variables.
7.2 General Theory of Cluster Analysis
631
Table 7.5 Overview of proximity measures Nominally scaled variables
Metrical variables (at least interval scaled)
Similarity measures are used to determine the similarity of objects Examples: Tanimoto-, simple matching-, Russel and Rao coefficients Distance measures are used to determine the dissimilarity. See also Exercise 4 in Sect. 7.2.1. Examples: Object x and object y are described by (variable1, variable2, . . ., variablen) ¼ (x1, x2, . . ., xn) and (y1, y2, . . ., yn). Using the vector components xi and yi, the metrics are defined as follows: Minkowski-metric (L-metric) n 1r P d¼ jxi yi jr i¼1
Considering two components per vector, it is 1
d ¼ ðjx1 y1 jr þ jx2 y2 jr Þr with specific value of r, this becomes: City-block metric (L1-metric with r = 1) n P d¼ jxi yi j i¼1
Euclidean distance (L2-metric with r = 1) see also Fig. 7.6. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n P ðxi yi Þ2 d¼ i¼1
This measure can be used only if all variables are continuous. Squared Euclidean distance n P ðxi yi Þ2 d¼ i¼1
This measure can be used only if all variables are continuous. Log-likelihood distance measure As explained in Sect. 4.7, a chi-quadrat distribution can be used in contingency tables to determine a probability, if the observed frequency in a contingency table lets us assume that the data comes from a specific distribution. The log-likelihood distance measure is a probability-based distance. The decrease in the likelihood, by combining two objects/clusters, is a measure of the distance between them. See IBM (2015a, pp. 398–399) for details. The method assumes that all variables are independent, but it can deal with metrical and categorical variables at the same time.
The SPSS Modeler implements two clustering methods in the classical sense: TwoStep and K-Means. Additionally, the Kohonen algorithm, as a specific neural network type, can be used for classification purposes. The Auto Cluster node summarizes all these methods and helps the user to find an optimal solution. Table 7.6 includes an explanation of these methods. To understand the advantages and disadvantages of the methods mentioned, it is helpful to understand in general, the steps involved in clustering algorithms. Therefore, we will explain the theory of cluster algorithms in the following section. After that, we will come back to TwoStep and K-Means. Based on the theory, we can
632
7
Cluster Analysis
Table 7.6 Clustering methods implemented in the SPSS Modeler TwoStep
K-means
Kohonen
Auto cluster
The algorithm handles datasets by using a tree in the background. Based on a comparison of each object with the previous objects inspected, TwoStep, in the first step, assigns each object to an initial cluster. The objects are organized in the form of a tree. If this tree exceeds a specific size, it will be reorganized. Due to the stepby-step analysis of each object, the result depends on the order of the records in the dataset. Reordering may lead to other results. See IBM (2015b), p. 201. Next, hierarchical clustering is used to merge the predefined cluster stepwise. We will explain and use the TwoStep algorithm in Sect. 7.3. The technical documentation can be found in IBM (2015b, pp. 201–207) and IBM (2015a, pp. 397–401). The theory of clustering algorithms shows that comparing all objects one-by-one with all other objects is time-consuming and especially hard to handle with large datasets. As a first step, K-means determines a cluster center within the data. Then each object is assigned to the cluster center with the smallest distance. The cluster centers are recalculated and the clusters are optimized, by rearranging some objects. The process ends if iteration does not improve the quality, e.g., no object is assigned to another cluster. For more details and applications, see Sect. 7.4, as well as IBM (2015b, pp. 199–200). This is an implementation of a neural network. Based on training data, a selforganizing map is created. This map is used to identify similar objects. New objects presented to the network are compared with the learned pattern. The new object will be assigned to a class where other objects are most similar. So an automated classification is implemented. Details to this algorithm are presented in Sect. 7.5.1. An application can be found in Exercise 2 of Sect. 7.5.3. See also IBM (2015b, pp. 196–199). The Auto Cluster node uses TwoStep, K-Means, and the Kohonen algorithm. Here, the SPSS Modeler tries to determine models that can be used for clustering purposes. The theory and application of this node is explained in Sect. 7.5. Further details can also be found in IBM (2015b, pp. 67–69).
understand the difficulties in dealing with clustering algorithms and how to choose the most appropriate approach for a given dataset.
7.2.1
Exercises
Exercise 1: Recap Cluster Analysis Please answer the following questions: 1. Explain the difference between hierarchical and partitioning clustering methods. For each method, name one advantage and one disadvantage. 2. Consider a given dataset with six objects, as shown in Fig. 7.7. (a) How many variables are defined per object? (b) Show the difference between a divisive and an agglomerative approach.
7.2 General Theory of Cluster Analysis
633
Fig. 7.7 Given dataset with six objects
Exercise 2: Similarity Measures in Cluster Analysis 1. Define in your own words what is meant by proximity, similarity, and distance measure. 2. Name one example for a measure of similarity as well as one measure for determining dissimilarity/distances. 3. Explain why we can’t use distance measures for qualitative variables. 4. Similarity measures can typically only deal with binary variables. Consider a nominal or ordinal variable with more than two values. Explain a procedure to recode the variable into a set of binary variables. 5. In the theoretical explanation, we discussed how to deal with binary and with quantitative/metrical variables in cluster analysis. For binary variables, we use similarity measures as outlined in this section by using the Tanimoto coefficient. In the case of metrical variables, we also have a wide range of measures to choose from. See Table 7.5. Consider a dataset with binary and metrical variables that should be the basis of a cluster analysis. Outline at least two possibilities of how to use all of these variables in a cluster analysis. Illustrate your approach using an example. Exercise 3: Calculating Tanimoto Similarity Measure Table 7.2 shows several variables and their given values for three products. In the theory, we explained how to use the Tanimoto similarity measure. In particular, we calculated the similarity coefficient for products A vs. B. Please calculate the Tanimoto similarity coefficient for products A vs. C and additionally for products B vs. C. Exercise 4: Distance Measures in Cluster Analysis 1. Figure 7.6 illustrates two objects represented by two variables. Consider a firm with employees P1 and P2. Table 7.7 shows their characteristics, depending on employment with a company and their monthly net income. Using the formulas in Table 7.5, calculate the City-block metric (L1-metric), the Euclidean distance (L2metric), and the squared Euclidean distance.
634
7
Cluster Analysis
Table 7.7 Employee dataset Object P1 P2
Employment with a company x1 [months] 10 25
Monthly net income x2 [USD] 2400 3250
2. In the theoretical part, we said that the squared Euclidian distance is more outliersensitive than the Euclidian distance itself. Explain!
7.2.2
Solutions
Exercise 1: Recap Cluster Analysis 1. See Sect. 7.1. 2. For each object, two variables are given. We know this because in Fig. 7.7, we can see a two-dimensional plot with one variable at each axis. For a segmentation result, using an agglomerative as well as a partitioning method, see Fig. 7.8. Here, we will get three clusters, however, because the “star” object has a relatively large distance from all the other objects.
Fig. 7.8 Comparison between agglomerative and divisive clustering procedures
7.2 General Theory of Cluster Analysis
635
Exercise 2: Similarity Measures in Cluster Analysis 1. A similarity measure helps to quantify the similarity of two objects. The measure is often used for qualitative (nominal or ordinal) variables. A distance measures calculate the geometrical distance between two objects. They can therefore be interpreted as a dissimilarity measure. Similarity and distance measures together are called proximity measures. 2. See Table 7.5 for examples. 3. As explained in Sects. 3.1.1 and 3.1.2, nominally and ordinally scaled variables can be called qualitative. The characteristics of both scales are that the values can’t be ordered (nominal scale) and that we can’t calculate a distance between the values (nominal and ordinal scale). For an example, as well as further theoretical details, see Table 7.8. Due to the missing interpretation of distances, we can’t use distance measures such as the Euclidean distance. 4. Considering an ordinal variable (e.g., hotel categories 2-star, . . ., 5-star), Table 7.9 shows how to recode it into a set of binary variables. Given n categories we need n 1 binary variables. 5. Given a binary variable, we can use, e.g., the Tanimoto coefficient. Unfortunately, this measure can’t be used for metrical values, e.g., the price of a product. Now we have two options: (a) Recode the metrical variable into a dichotomous variable: By defining a threshold, e.g., the median of all prices, we can define a new binary variable: 0 ¼ below median and 1 ¼ above median. This variable can then be used in the context of the other variables and the similarity measure. A disadvantage is the substantial reduction of information in this process. Table 7.8 Comparison of the characteristics of qualitative variables Scale type of a qualitative measure Nominal
Ordinal
Example Nonperforming loan (yes or no) Categories of hotels (2-star, 3-star, etc.) Satisfaction of customers with a product (very satisfied, satisfied, etc.)
Comment Nominally scaled variables have no inherent order that can be used to order the values ascending or descending. Distances can never be measured. Ordinally scaled variables have a naturally defined inherent order. In the case of the mentioned hotel categories, it is clear that there is a distance between a 2-star and 3-star hotel, but we cannot specify further details. Sometimes we can find a scale representation 1, 2, . . ., 5 (often called Likert-type scale). Nevertheless the distances should be interpreted with care, or not at all. This fact has created controversy in literature. See Vogt et al. (2014), p. 34. The researcher can’t be sure that the distance between 1 and 2 equals the distance between 2 and 3, etc.
636
7
Cluster Analysis
Table 7.9 Scheme for recoding a nominal or ordinal variable into a binary variable 2-star 3-star 4-star 5-star
Binary variable 1 0 1 1 1
Binary variable 2 0 0 1 1
Binary variable 3 0 0 0 1
(b) Recode the metrical variable into an ordinal variable: In comparison with the approach using only a binary variable, we can reduce the loss of precision by defining more than one threshold. We get more than two price classes (1, 2, 3, etc.) for the products, which represent an interval or ordinally scaled variable. This variable can’t be used with the similarity measure either. As described in the answer to the previous question, we have to recode the variable into a set of binary variables. Exercise 3: Calculating the Tanimoto Similarity Measure To calculate the Tanimoto similarity coefficient for products A vs. C, as well as for products B vs. C, the contingency tables must be determined. Then the formula can be used. The solution can also be found in the Microsoft Excel file “cluster dichotomous variables example.xlsx”.
Product A
Product C 1 1 1
1 0
The Tanimoto coefficient is sAC ¼
1 ¼ 0:2 1þ3þ1 Product C 1 0
Product B
1 0
1 1
2 0
The Tanimoto coefficient is sBC ¼
1 ¼ 0:25 1þ2þ1
0 3 0
7.3 TwoStep Hierarchical Agglomerative Clustering
637
Table 7.10 Distances based on different metrics Metric City-block metric (L1-metric) n P jxi yi j d¼
Distance d ¼ |10 25| + |2400 3250| d ¼ 865
i¼1
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d ¼ ð10 25Þ2 þ ð2400 3250Þ2 d ¼ 850.13
Euclidean distance (L2-metric) sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n P ðxi yi Þ2 d¼ i¼1
d ¼ (10 25)2 + (2400 3250)2 d ¼ 722725
Squared Euclidean distance n P d¼ ðxi yi Þ2 i¼1
Exercise 4: Distance Measures in Cluster Analysis 1. Table 7.10 shows the solution. 2. The Euclidean distance (L2-metric) can be calculated by. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n X d¼ ð xi yi Þ 2 i¼1
whereas the squared Euclidean distance follows by d¼
n X
ð xi yi Þ 2
i¼1
Considering the one-dimensional case of two given values 1 and 10, we can see that the Euclidean distance is 9 and the squared Euclidean distance is definitely 81, but the reason for the outlier-sensitivity is nothing to do with getting a larger number! If we have two objects with a larger distance, we have to square the difference of their components, but in the case of the Euclidian distance, we will reduce the effect by calculating the square root at the end. See also the explanation of standard deviation. Here, we can find the same approach for defining an outliersensitive measure for volatility in statistics.
7.3
TwoStep Hierarchical Agglomerative Clustering
7.3.1
Theory of Hierarchical Clustering
Example vs. Modeler Functionalities To understand cluster analysis, we think it would be helpful to discuss a simple example. We want to show the several steps for finding a cluster of objects that are “similar”. Going through each of the steps, parameters and challenges can be
638
7
Cluster Analysis
identified. In the end, it will be easier to understand the results of the clustering algorithms that the SPSS Modeler provides. We recommend studying the details presented in the following paragraphs, but if more theoretical information is not of interest, the reader can also proceed onto Sect. 7.3.2. Clustering Sample Data Remark: the data, as well as the distance measures and matrices explained in this section, can also be found in the Microsoft Excel spreadsheet “Car_Simple_ Clustering_distance_matrices.xlsx”. So far, we have discussed the aim and general principle of cluster analysis algorithms. The SPSS Modeler offers different clustering methods. These methods are certainly advanced, however, to understand the advantages and the challenges of using clustering algorithms; we want to explain here the idea of hierarchical clustering in more detail. The explanation is based on an idea presented in Handl (2010, pp. 364–383), but we will use our own dataset. The data represent prices of six cars in different categories. The dataset includes the name of the manufacturer, the type of car, and a price. Formally, we should declare that the prices are not representative for the models and types mentioned. Table 7.11 shows the values. The only variable that can be used for clustering purposes here is the price. The car ID is nominally scaled and we could use a distance measure (e.g., the Tanimoto coefficient) for such variables; but remember that the IDs are arbitrarily assigned to the cars. Based on the data given in this example, we can calculate the distance between the objects. We discussed distance measures for metrical variables in the previous chapter. The Euclidean distance between car 1 and 2 is d Euclidean ð1; 2Þ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð13 19Þ2 ¼ 6
Due to increased outlier-sensitivity (see Exercise 4 Question 2 in the previous chapter), a lot of algorithms use the squared Euclidean distance, which in this case is.
Table 7.11 Prices of different cars ID 1 2 3 4 5 6
Manufacturer Nissan Kia Ford Chevrolet BMW Mercedes-Benz
Model Versa Soul F-150 Silverado 1500 3 series C-class
Dealer ABC motors California motors Johns test garage Welcome cars Four wheels fine Best cars ever
(Possible) Price in 1000 USD 13 19 27.5 28 39 44
7.3 TwoStep Hierarchical Agglomerative Clustering
d squared
Euclidean ð1; 2Þ
639
¼ ð13 19Þ2 ¼ 36
The algorithm K-Means of the SPSS Modeler is based on this measure. See IBM (2015a, pp. 229–230). We will discuss this procedure later. For now we want to explain with an example, how a hierarchical cluster algorithm works in general. The following steps are necessary: 1. Calculating the similarities/distances (also called proximity measures) of the given objects. 2. Arranging the measures in a similarity/distance matrix. 3. Determining the objects/clusters that are “most similar”. 4. Assigning the identified objects to a cluster. 5. Calculating the similarities/distances between the new cluster and all other objects. Updating the similarity/distance matrix. 6. If not all objects are assigned to a cluster, go to step 3, otherwise stop. Initially, the objects are not assigned to a specific cluster, but mathematically we can say that each of them form a separate cluster. Table 7.12 shows the initial status of the clustering. Step 1: Calculating the Similarities/Distances In the example of the car prices used here, the distance measure is quite simple to determine because we have just one dimension. Figure 7.9 visualizes the original data given.
Table 7.12 Overview of initial clusters and the objects assigned
object or car ID type of the car
Cluster description Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
1
2
Nissan Versa
Kia Soul
13
19
Objects/cars that belong to the cluster 1 2 3 4 5 6
3
4
5
6
Ford Chevrolet BMW 3 Mercedes F-150 Silverado 1500 Series Benz C-Class
price
Fig. 7.9 Visualization of the given data
27.5 28
39
44
price in 1,000 USD
640
7
Cluster Analysis
The squared Euclidean distance for each pair of objects can be calculated easily. Although we have six cars here, we do not have to calculate 36 distances: the distance between each object and itself is zero and the distances are symmetrical, e.g., d ð1; 2Þ ¼ dð2; 1Þ For the distances we get: dð1; 2Þ ¼ ð13 19Þ2 ¼ 36 dð1; 3Þ ¼ ð13 27:5Þ2 ¼ 210:25 dð1; 4Þ ¼ ð13 28Þ2 ¼ 225 dð1; 5Þ ¼ ð13 39Þ2 ¼ 676 dð1; 6Þ ¼ ð13 44Þ2 ¼ 961 dð2; 3Þ ¼ ð19 27:5Þ2 ¼ 72:25 dð2; 4Þ ¼ ð19 28Þ2 ¼ 81 ... dð6; 3Þ ¼ ð44 27:5Þ2 ¼ 272:25 dð6; 4Þ ¼ ð44 28Þ2 ¼ 256 dð6; 5Þ ¼ ð44 39Þ2 ¼ 25 Step 2: Arranging the Measures in a Similarity/Distance Matrix The rows of the distance matrix represent the object where to start the distance measurement. The columns stand for the end point in the distance measurement. Using the notation d(1; 2) ¼ 36, the distance will be assigned to the cell in the first row, second column of the distance matrix. Table 7.13 shows the final matrix. All matrices discussed in this section can also be found in the Microsoft Excel spreadsheet “Car_Simple_Clustering_distance_matrices.xlsx”. Start of Iteration 1 We can see in the steps of the algorithm that more than one iteration will be necessary, most of the time. In the following paragraphs, we extend the description of steps 3 to 6 with the number of the iteration.
7.3 TwoStep Hierarchical Agglomerative Clustering Table 7.13 Initial distance matrix of the car data From
Table 7.14 Overview of clusters and the objects assigned after iteration 1
1 2 3 4 5 6
To 1 0
2 36 0
641
3 210.25 72.25 0
Cluster description Cluster 1 Cluster 2 Cluster 3 new Cluster 5 Cluster 6
4 225 81 0.25 0
5 676 400 132.25 121 0
6 961 625 272.25 256 25 0
Objects/cars that belong to the cluster 1 2 3; 4 5 6
Iteration 1, Step 3: Determining the Objects/Clusters That Are “Most Similar” To keep this example simple, we want to call the object that has the smallest difference, the “most similar”. We will find out later that this is not always the best definition of “similarity”, but for now it is an appropriate approach. The minimum distance is d(3; 4) ¼ 0.25. Iteration 1, Step 4: Assigning the Identified Objects to a Cluster For that reason, the cars with ID 3 and 4 are assigned to one cluster. We call it “cluster 3 new”. Table 7.14 gives an overview. Iteration 1, Step 5: Calculating the Similarities/Distances of the New Cluster, Update of the Similarity/Distance Matrix In the cluster, we now find cars 3 and 4. The distance to the other cars, 1; 2; 5; 6 to this cluster, can be determined pairwise. Based on the distances in Table 7.13, it is d ð3; 1Þ ¼ 210:25 d ð4; 1Þ ¼ 225 Once more based on the principle that the smallest distance will be used, it is dðf3; 4g; 1Þ ¼ 210:25 Table 7.15 shows the new distance matrix.
642
7
Table 7.15 Distance matrix cluster step 1 From
Table 7.16 Overview of clusters and the objects assigned after iteration 1
1 2 3; 4 5 6
To 1 0
Cluster description Cluster 1 Cluster 2 Cluster 3 Cluster 4 new
2 36 0
3; 4 210.25 72.25 0
Cluster Analysis
5 676 400 121 0
6 961 625 256 25 0
Objects/cars that belong to the cluster 1 2 3; 4 5; 6
Iteration 1, Step 6: Check If Algorithm Can Be Finished The clustering algorithm ends if all objects are assigned to exactly one cluster. Otherwise, we have to repeat steps 3–6. Here, the objects 1; 2; 5; 6 are not assigned to a cluster. So we have to go through the procedure once again. Start of Iteration 2 Iteration 2, Step 3: Determining the objects/clusters that are “most similar” The minimum distance is d(5; 6) ¼ 25. Iteration 2, Step 4: Assigning the identified objects to a cluster For that reason, the cars with the IDs 5 and 6 are assigned to a new cluster (see Table 7.16). Iteration 2, Step 5: Calculating the similarities/distances of the new cluster, Update of the similarity/distance matrix In cluster 4, we find cars 5 and 6. The distance from this cluster to cars 1 and 2 can be determined using Table 7.15. For the distance from car 1, we get: d ð5; 1Þ ¼ 676 d ð6; 1Þ ¼ 961 The minimum distance is dðf5; 6g; 1Þ ¼ 676 For the distance from the new cluster to car 2, we get:
7.3 TwoStep Hierarchical Agglomerative Clustering Table 7.17 Distance matrix cluster iteration 2 (part 1)
From
Table 7.18 Distance matrix cluster iteration 2 (part 2)
From
1 2 3; 4 5; 6
1 2 3; 4 5; 6
643
To 1 0
To 1 0
2 36 0
3; 4 210.25 72.25 0
5; 6 676 400 ? 0
2 36 0
3; 4 210.25 72.25 0
5; 6 676 400 121 0
d ð5; 2Þ ¼ 400 d ð6; 2Þ ¼ 625 The minimum distance is dðf5; 6g; 2Þ ¼ 400 Table 7.17 shows the new distance matrix so far, but we also have to calculate the distances between the cluster with cars 3; 4 and the new cluster with cars 5 and 6. We use Table 7.15 and find that dðf3; 4g; 5Þ ¼ 121 and dðf3; 4g; 6Þ ¼ 256 So the distance is dðf3; 4g; f5; 6gÞ ¼ min ð121; 256Þ ¼ 121 See also Table 7.18. Iteration 2, Step 6: Check if algorithm can be finished The algorithm does not end because objects 1 and 2 are not assigned to a cluster. Iteration 3, 4, and 5 The minimum distance in Table 7.18 is 36. So the objects or cars 1 and 2 are assigned to a new cluster. See Table 7.19.
644
7
Table 7.19 Overview of clusters and the objects assigned after iteration 2
Cluster description Cluster 1 new Cluster 3 Cluster 4
Objects/cars that belong to the cluster 1; 2 3; 4 5; 6
Table 7.20 Distance matrix iteration 3 From
Cluster Analysis
1; 2 3; 4 5; 6
To 1; 2 0
3; 4 72.25 0
5; 6 400 121 0
The distances from the new cluster {1; 2} to the other existing clusters can be determined using Table 7.18. They are dð1; f3; 4gÞ ¼ 210:25 and dð2; f3; 4gÞ ¼ 72:25 So the distance dðf1; 2g; f3; 4gÞ ¼ min ð210:25; 72:25Þ ¼ 72:25 Furthermore, the distances from the new cluster {1; 2} to the existing cluster {5; 6} are dð1; f5; 6gÞ ¼ 676 dð2; f5; 6gÞ ¼ 400 So the distance dðf1; 2g; f5; 6gÞ ¼ min ð676; 400Þ ¼ 400 Table 7.20 shows the new distance matrix. The remaining iterations are so far very similar in detail to the explained one. We show here the distance matrices in Tables 7.21 and 7.22. This result is typical for a hierarchical clustering algorithm. All the objects are assigned to one cluster. Statistical software normally does not provide the huge amount of detail we have presented here, but this example should allow us to understand what happens in the background. Furthermore, we can identify the parameters of clustering algorithms, such as the method to determine the clusters that have to be merged in the next step. For now, we want to pay attention to the so-called “dendrogram” in Fig. 7.10. This is produced with SPSS statistics. Unfortunately, the SPSS Modeler does not offer this helpful diagram type.
7.3 TwoStep Hierarchical Agglomerative Clustering
645
Table 7.21 Distance matrix iteration 4 From
To 1; 2; 3; 4 0
1; 2; 3; 4 5; 6
Table 7.22 Distance matrix iteration 5 From
5; 6 121 0
To 1; 2; 3; 4; 5; 6 –
1; 2; 3; 4; 5; 6
re-scaled distance cluster combine 0 manufacturer
type
ID
Ford
F-150
3
Chevrolet
Silverado 1500
4
Nissan
Versa
1
Kia
Soul
2
BMW
3 series
5
Mercedes-Benz C-Class
6
5
10
15
20
25
cutoff
Fig. 7.10 Dendrogram for simple clustering example using the car dataset
The dendrogram shows us the steps of the cluster algorithm in the form of a tree. The horizontal axis annotated with “Rescaled Distance Cluster Combine” represents the minimum distances used in each step, for identifying which objects or clusters to combine. The minimum distance (measured as the squared Euclidean distance) between objects 3 and 4 was 0.25 (see Table 7.13). The maximum distance was 121 in iteration 5 (see Table 7.21). If the maximum in the dendrogram is 25, than we get. 25 0:25 ¼ 0:0517: 121
646
7
Table 7.23 Rescaled distances for drawing a dendrogram
Distances 0.25 25.00 36.00 72.25 121.00
Cluster Analysis
Rescaled distances 0.05 5.17 7.44 14.93 25.00
Table 7.23 shows all the other values that can be found in the dendrogram in Fig. 7.10. This calculation can also be found in the Microsoft Excel spreadsheet “Car_Simple_Clustering_distance_matrices.xlsx”. "
A dendrogram can be used to show the steps of a hierarchical cluster analysis algorithm. It shows the distance between the clusters, as well as the order in which they are joined. Depending on the horizontal distance between the visualized cluster steps, the researcher can decide how many clusters are appropriate for best describing the dataset.
Single-Linkage and Other Algorithms to Identify Clusters to Merge To determine the distance between two objects (or cars) or clusters, we used the squared Euclidean distance measure. Other so-called “metrics” or “proximity measures” are shown in Table 7.5. So far, we have defined how to measure the distance between the objects or clusters. The next question to answer with a clustering algorithm is which objects or clusters to merge. In the example presented above, we used the minimum distance. This procedure is called “Single-linkage” or the “nearest neighbor” method. In the distance matrices, we determine the minimum distance and merge the corresponding objects by using the row and the column number. This means that the distance from a cluster to another object “A” equals the minimum distance between each object in the cluster and “A”. d ðobject A; fobject 1; object 2gÞ ¼ min ðdðobject A; object 1Þ; d ðobject A; object 2ÞÞ:
The disadvantage of single-linkage is that less separated groups cannot normally be detected. Furthermore, outliers tend to remain isolated points (see Fig. 7.11). Table 7.24 shows a summary of other algorithms and their characteristics. Determining the Number of Clusters The general approach for clustering algorithms can be explained using this more or less simple example. For sure, if we analyze the given dataset in Table 7.11, the results are not a surprise. By analyzing the dendrogram in Fig. 7.10 in comparison to the car prices, the steps of the clustering algorithm makes sense; for instance, the price difference between the cars Ford F-150 and Chevrolet Silverado 1500 in this dataset is very small. Both cars belong to the same cluster, however, additional information (e.g., the type of car) is not presented to the algorithm. So the decision is made based only on price.
7.3 TwoStep Hierarchical Agglomerative Clustering
647
Fig. 7.11 Example for two clusters with an extremely small distance between them
Fig. 7.12 Elbow criterion (II) to identify the number of clusters
In the case of using a hierarchical cluster algorithm, the number of clusters need not be determined in advance. In the case of an agglomerative algorithm, the cluster will be identified and then merged. We have to decide how many clusters should be used to describe the data best, however. This situation is similar to PCA or PFA factor analysis. There also, we had to determine the number of factors to use (see Sect. 6.3). In the dendrogram in Fig. 7.10, the cutoff shows an example using two clusters. A lot of different methods exist to determine the appropriate number of clusters. Table 7.25 shows them with a short explanation. For more details, the interested
648
7
Cluster Analysis
Fig. 7.13 Elbow criterion (I) to identify the number of clusters Table 7.24 Hierarchical clustering algorithms Name of algorithm Single-linkage
Complete-linkage/maximum distance
Average linkage/average distance/ within groups linkage
Centroid/average group linkage/ between groups linkage
Median
Ward’s method
Characteristics Smallest distance between objects is used. All proximity measures can be used. Outliers remain separated—identification is possible. Groups that are close to each other are not separated. Largest distance between objects is used. All proximity measures can be used. Outliers are merged to clusters—identification not possible. Tends to build many small clusters. All proximity measures can be used. The average of the object distances between two clusters is used. All possible object distances will be taken into account. All proximity measures can be used. Similar to centroid, but the distances between cluster centers are taken into account. The compactness of the clusters is also relevant. Squared Euclidean distance is used. Similar to centroid, but the cluster centers are determined by the average of the centers, taking the number of objects per cluster into account (number of objects is known as the weight). Only distance measures can be used. Cluster centers are determined. The distances between all objects in each cluster and the center are determined and cumulated. If clusters are merged, then the distance from the objects to the new cluster center increases. The algorithm identifies the two clusters where the distance increment is the lowest. See Bühl (2012, p. 650). Cluster sizes are approximately the same. Only distance measures can be used.
7.3 TwoStep Hierarchical Agglomerative Clustering
649
Table 7.25 Methods to determine the number of clusters qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi In reference to Mardia et al. (1979), this rule of thumb approximates the number of objects 2 number of clusters. Elbow-criterion As explained in Sect. 6.3.2 and Fig. 6.18, the dependency of a classification criterion vs. the number of clusters can be visualized in a 2D-chart. In the case of PCA/PFA, this is called a scree plot. In cluster analysis, at the vertical axis the sum of squares of distances/ errors can be used. In reverse, the percentage of variance/information explained can be assigned to the vertical axis. See Figs. 7.12 and 7.13. An example can be found in Exercise 5 of Sect. 7.4.3. Silhouette charts To measure the goodness of a classification, the silhouette value S can be calculated: 1. Calculation of average distance from the object to all objects in the nearest cluster. 2. Calculation of average distance from the object to all objects in the same cluster the object belongs to. 3. Calculation of the difference between the average distances (1)–(2). 4. This difference is then divided by the maximum of both those average distances. This standardizes the result. avg:
Information criterion approach
dist: to nearest clusteravg: dist: to objects in same cluster maximum of both avg: above
More details can be found in Struyf et al. (1997, pp. 5–7). S can take on values between 1 and +1. If S ¼ 1, the object is not well classified. If S ¼ 0, the object lies between two clusters, and if S ¼ + 1 the object is well classified. The average of the Silhouette values of all objects represents a measure of goodness for the overall classification. As outlined in IBM (2015b, p. 77) and IBM (2015b, p. 209), the IBM SPSS Modeler calculates a Silhouette Ranking Measure based on the Silhouette value. It is a measure of cohesion in the cluster and separation between the clusters. Additionally, it provides thresholds for poor (up to +0.25), fair (+0.5), and good (above +0.5) models. Information criterions such as the Bayesian Information Criterion (BIC) or the Akaikes Information Criterion (AIC) are used to determine the appropriate number of clusters. For more details, see Tavana (2013, pp. 61–63).
reader is referred to Timm (2002, pp. 531–533). In our example, the rule of thumb tells us that we should analyze clustering results with rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffi number of objects 6 ¼ 2 ðclustersÞ 2 2 Later on we will also use the Silhouette plot of the SPSS Modeler. "
In cases using a hierarchical clustering algorithm, specifying the number of clusters to determine is unnecessary. The maximum compression of the information included in the data can be found, if all objects belong to one cluster. The maximum accuracy can be realized, by assigning each object to a separate cluster.
650
"
7
Cluster Analysis
To find the optimal number of clusters, a different rule of thumb exists. The SPSS Modeler provides the Silhouette chart as a measure of cohesion in the cluster, and a measure of separation between the clusters.
Interpretation of Clustering Results As mentioned above in the example presented, here we should analyze the results for two clusters determined by the algorithm. It is important to note that the clustering methods do not provide any help with finding an appropriate description for each cluster. The researcher has to figure out by herself/himself how to best describe each cluster. Looking at the results in Table 7.21, or in the dendrogram of Fig. 7.10, we can see that cars with the IDs 1; 2; 3; 4 are assigned to one cluster and cars 5; 6 to another cluster. The former cluster can probably be described as “cheap or moderate priced cars” and the latter cluster with two cars can be described as “luxury cars”. Disadvantages of Hierarchical Clustering Methods A lot of distances must be calculated in order to find clusters that can be merged. Furthermore, the distances are not determined globally; if one object is assigned to a cluster, it cannot be removed and assigned to another one. This is because the distances within the clusters are irrelevant.
7.3.2
Characteristics of the TwoStep Algorithm
As mentioned in the introduction to the clustering algorithms in Sect. 7.1, and especially summarized in Table 7.6, TwoStep is an implementation of a hierarchical clustering algorithm. It tries to avoid difficulties caused by huge datasets and the necessity to compare the objects pairwise. Measuring their similarity, or determining their distance, is time consuming and requires excellent memory management. In the so-called “pre-clustering”, TwoStep uses a tree to assign each object to a cluster. The tree is built by analyzing the objects one-by-one. If the tree grows and exceeds a specific size, the tree is internally reorganized. So the procedure can handle a huge number of objects. The disadvantage of pre-clustering is that assignment of the objects to the (smaller) clusters is fixed. Based on the characteristics of hierarchical clustering methods, we can easily understand that the result from TwoStep depends also on the order of the objects in the dataset. In the second step, the predefined clusters are the basis for using a hierarchical clustering algorithm. That’s also because the number of clusters is much less than the sample size of the original data. The clusters are merged stepwise. The Modeler offers the Euclidean distance or the log-likelihood distance, to determine the dissimilarity of objects or clusters. For the log-likelihood method, the variables must be continuous and normally distributed. The Euclidean distance measure often leads to inappropriate clustering results. Details can be found in the solution to Exercise 3 in Sect. 7.4.4.
7.3 TwoStep Hierarchical Agglomerative Clustering
651
As explained in Sect. 4.7, “Contingency Tables”, a chi-quadrat distribution can be used to determine a probability, if the observed frequency in a contingency table lets us assume that the data come from a specific distribution. The log-likelihood distance measure is a probability-based distance, which uses a similar approach. See IBM (2015a, pp. 398–399) for details. The decrease in the likelihood caused by combining two clusters is a measure of the distance between them. Mathematical details can be found in IBM (2015a, pp. 397–401). Advantages – TwoStep can deal with categorical and metrical variables at the same time. In these, cases the log-likelihood measure is used. – By using Bayesian information criterion, or Akaike’s Information Criterion, the TwoStep algorithm implemented in the SPSS Modeler determines the “optimal” number of clusters. – The minimum or maximum number of clusters that come into consideration can be defined. – The algorithm tends not to produce clusters with approximately the same size. Disadvantages – TwoStep clustering assumes that continuous variables are normally distributed, if log-likelihood estimation is used. Transformation of the input variables is necessary. The log transformation can be used if all input variables are continuous. – The clustering result depends on the order of the objects in the dataset. Reordering might lead to other results. – TwoStep results are relatively sensitive in relation to mixed type attributes in the dataset. Different scales or codes for categorical variables can result in different clustering results. See, e.g., Bacher et al. (2004, pp. 1–2). "
TwoStep is a hierarchical clustering algorithm that uses pre-clustering to assign objects to one of the smaller clusters. Larger datasets can be used because of this. The pre-assignment of the objects is not revised in the following steps, however, and the order of the objects in the dataset influences the clustering result.
"
Important assumptions when using TwoStep clustering are that continuous variables are assumed to be normally distributed, and categorical variables are multinomially distributed. Therefore, they should be transformed in advance. Often the log transformation of continuous values can be used.
"
Alternatively, the Euclidean distance measure can be used, but it often produces unsatisfying results.
652
7.3.3
7
Cluster Analysis
Building a Model in SPSS Modeler
In Sect. 7.3.1, we discussed the theory of agglomerative clustering methods using the single-linkage algorithm. Additionally, we learned in the previous section that the TwoStep algorithm is an improved version of an agglomerative procedure. Here, we want to use TwoStep to demonstrate its usage for clustering objects in the very small dataset “car_simple”. We know the data from the theoretical section. They are shown in Table 7.11. For more details see also Sect. 12.1.5. First we will build the stream and discuss the parameters of the TwoStep algorithm, as well as the results. Description of the model Stream name Based on dataset Stream structure
car_clustering_simple car_simple.Sav
Related exercises: All exercises in 7.3.4
Creating the Stream 1. We open the “Template-Stream_Car_Simple”. It gives us access to the SPSSdataset “car_simple.sav” (Fig. 7.14). Fig. 7.14 Stream “TemplateStream_Car_Simple”
7.3 TwoStep Hierarchical Agglomerative Clustering
653
Fig. 7.15 Dataset “car_simple.sav” used in “Template-Stream_Car_Simple”
2. Then we check the data by clicking on the Table node. We find the records as shown in Fig. 7.15. These cars will be the basis for our clustering approach. 3. Checking the variables in Fig. 7.15, we can see that the ID can’t be used to cluster the cars. That’s because the ID does not contain any car-related information. The variables “manufacturer”, “model”, and “dealer” are nominally scaled. Theoretically, we could use the variables with the binary coding procedure and the Tanimoto similarity measure explained in Sect. 7.1. So the type of the variables is not the reason to exclude them from clustering, but the variables “manufacturer” as well as “model” and “dealer” do not provide any valuable information. Only a scientist can cluster the cars based on some of these variables, using knowledge about the reputation of the manufacturer and the typical price ranges of the cars. So the only variable that can be used here for clustering is the price. We now should check the scale of measurement. 4. Using the Type node, we can see that all the variables are already defined as nominal, except the price with its continuous scale type (Fig. 7.16). Normally, we can define the role of the variables here and exclude the first three, but we recommend building transparent streams and so we will use a separate node to filter the variables. We add a Filter node to the stream, right behind the Type node (see Fig. 7.17). We find out that apart from the variable “price”, no other variable can be used for clustering. As we know from the theoretical discussion in Sect. 7.1 though, a clustering algorithm provides only cluster numbers. The researcher must find
654
7
Cluster Analysis
Fig. 7.16 Defined scale types of variables in dataset “car_simple.sav”
Fig. 7.17 A Filter node is added to the template stream
useful descriptions for each cluster, based on the characteristics of the assigned objects. So any variable that helps us to identify the objects should be included in the final table alongside the cluster number determined by the algorithm. In our dataset, the manufacturer and the model of the car is probably helpful. The name of the dealer does not give us any additional input, so we should exclude
7.3 TwoStep Hierarchical Agglomerative Clustering
655
Fig. 7.18 Filtered variables
Fig. 7.19 Stream with added TwoStep node
this variable. We exclude it by modifying the Filter node parameter, as shown in Fig. 7.18. 5. For the next step, we add a TwoStep node from the Modeling tab of the SPSS Modeler. We connect this node to the Filter node (see Fig. 7.19). So the only input variable for this stream is the price of the cars. Only this information should be used to find clusters.
656
7
Cluster Analysis
Fig. 7.20 Definition of the variables used in the TwoStep node
6. We double-click on the TwoStep node. In the Fields tab, we can choose to use variables based on the settings in the Type node of the stream. Here, we will add them manually. To do this, we use the button on the right marked with an arrow in Fig. 7.20. "
For transparency reasons, we recommend adding a Filter node behind the Type node of the stream, when building a stream to cluster objects. All variables that do not contribute any additional information, in terms of clustering the objects, should generally be excluded.
"
The researcher should keep in mind, however, that the clusters must be described based on the characteristics of the objects assigned. So it is helpful not to filter the object IDs along with the names. This is helpful even if they are not used in the clustering procedure itself.
"
The variables used for clustering should be determined in the clustering node, rather than using the Type node settings.
7. In the Model tab, we can find other parameters as shown in Fig. 7.21. We will explain them in more detail here. By default, numeric fields are standardized. This is very important for the cluster procedure. Different scales or different coding of input variables result in
7.3 TwoStep Hierarchical Agglomerative Clustering
657
Fig. 7.21 Options in the TwoStep node
various large differences between their attributes. To make the values comparable they must be standardized. We outlined the z-standardization in Sect. 2.7.6. Here, this method is automatically activated and should be used. Using the option “Cluster label”, we can define whether a cluster is a string or a number. See the arrow in Fig. 7.21. Unfortunately, there is a failure in the software of Modeler version 17 (and probably before). If this option “cluster label” is changed to “number”, then the clustering result also changes for no reason. We have reported this bug, and it will be fixed in the following releases. For now, we do not recommend using this option, despite the fact that it makes handling cluster numbers easier. "
The option “Cluster label” must be used carefully. The clustering results can change for no reason. It is recommended to use the option “string” instead.
The advantage of TwoStep implementation in the Modeler is that it tries to automatically find the optimal number of clusters. For technical details, see IBM (2015b, p. 202). After the first trial, we will probably have to modify the predefined values by using the methods outlined in Table 7.25.
658
7
Cluster Analysis
The SPSS Modeler offers the log-likelihood or the Euclidean distance measure. In the case of the log-likelihood measure, the variables have to be assumed as independent. It is also assumed that continuous variables are normally distributed. Therefore, we would have to use an additional Transform node in advance. The Euclidean distance can be calculated for only continuous variables. See also Table 7.5, and so we recommend using the log-likelihood distance. As also mentioned in Table 7.25, the last option “Clustering criterion”, relates to the automatic detection of the number of clusters. Normally, numbers are easier to handle, but we found out that the cluster result changes (or is sometime wrong) if we use the option “number”. So we suggest using “String”. We don’t have to modify the options in the TwoStep node yet. Neither do we need an outlier exclusion nor should we try to determine the number of clusters. We click on “Run” to start the clustering. 8. Unfortunately, we get the message “Error: Too few valid cases to build required number of clusters”. This is because the very small sample size of six records means automatic detection in the TwoStep node can’t determine the number automatically. Here, we use the rule of thumb mentioned in Table 7.5 and used in Sect. 7.3.1. The simple calculation rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffi number of objects 6 ¼ 2 2 2 tells us to try and fix a number of clusters. We modify the options in the TwoStep node as shown in Fig. 7.22. We run the node now once again with these new parameters. 9. We get a new model nugget as shown in Fig. 7.23. 10. Before we start to show the details of the analysis, we finish the stream by adding a Table node behind the model nugget node. Figure 7.24 shows the final stream. "
Using the TwoStep algorithm, the following steps are recommended
1. The scale types defined in the Type node are especially important for the cluster algorithms. That’s because the usage of the distance measure (log-likelihood or Euclidean), for example, depends on these definitions. 2. In a Filter node, variables can be excluded that are unnecessary for the clustering itself. The object ID (and other descriptive variables) should not be filtered. That’s because it/they can help the user to identify the objects. 3. For transparency reasons, it is optimal to select the variables used for clustering directly in the clustering node itself, rather than defining the variable roles in the Type node. 4. Standardization of numerical variables is recommended.
7.3 TwoStep Hierarchical Agglomerative Clustering
Fig. 7.22 Options in the TwoStep node, with two clusters specified
Fig. 7.23 TwoStep cluster model nugget node is added to the stream
5. An important assumption, when using TwoStep clustering (with the log-likelihood distance measure), is that continuous variables are normally distributed. Therefore, they should be transformed by using a Transform node in advance.
659
660
7
Cluster Analysis
Fig. 7.24 Final TwoStep cluster stream for simple car example
6. The algorithm tries to identify the optimal number of clusters. The user should start the node by using this option. If this fails, methods outlined in Table 7.25 should be used to determine the number of clusters manually and set the minimum and the maximum in the TwoStep node options. The option “Clustering criterion” is related to automatic cluster number detection. 7. The log-likelihood distance measure is recommended. It can also be applied if not all variables are continuous. 8. If the results are unsatisfying or hard to interpret, outlier exclusion can be activated. Interpretation of the Determined Clusters To analyze the results of the TwoStep clustering method, we open the Table node connected to the model nugget. Figure 7.25 shows the clusters assigned to each of the cars. The number in the variable “$T-TwoStep” can be determined by using the options “Cluster label” and “Label prefix”, as shown in Fig. 7.22. A double-click on the model nugget opens a new dialog window with many options. In the model summary—also called Model viewer—on the left in Fig. 7.26, we can see the silhouette measure of cohesion and separation. We explained this in Sect. 7.3.1 and in particular in Table 7.25. A value between fair (+0.25) and good (above +0.5) lets as assume that the clustering was successful. If the number of clusters will be reduced, generally (not always) the silhouette value will decrease and vice versa. For more details, see Exercise 5 in Sect. 7.4.3. "
The silhouette value helps to assess the goodness of a classification. It measures the average distance from each object to the other objects belonging in the same cluster it is assigned to, as well as the average distance from the other clusters. The values range from 1 to +1. Above +0.5, it is a (quite) “good model”.
7.3 TwoStep Hierarchical Agglomerative Clustering
Fig. 7.25 Cluster assigned to the cars
Fig. 7.26 Model summary of a TwoStep node
661
662
7
Cluster Analysis
Table 7.26 Silhouette measure with full precision—Copied with option “Copy Visualization Data” Category 1
Silhouette measure of cohesion and separation 0.7411
V3 0.7
Fig. 7.27 Detailed cluster profiles in the model viewer
"
The SPSS Modeler shows the silhouette value in the model summary of the Model Viewer on the left. Moving the mouse over the diagram is one way to get the silhouette value. This is depicted in Fig. 7.26 in the middle. To get a more precise result, one can use the option “Copy Visualization Data”. To do this, the second button from left in the upper part of the Model Viewer should be clicked. Then the copied values must be pasted using simple word processing software. Table 7.26 shows the result.
On the right in Fig. 7.26, the Modeler shows us that we have two clusters, with four and two elements, respectively. Their size ratio is definitely two. In the left corner of Fig. 7.26, we select the option “Clusters” from the drop-down list, instead of the “Model Summary”. We can then analyze the clusters as shown in Fig. 7.27. By selecting cluster 1 (marked with an arrow in Fig. 7.27) on the left, we will get a more detailed output. In the drop-down list on the right, field “View”, we can choose “Cell Distribution”.
7.3 TwoStep Hierarchical Agglomerative Clustering
663
Fig. 7.28 Input variable distribution per cluster on the left
By selecting the cluster on the left, we can see the frequency distribution of the objects on the right. Another valuable analysis is offered in the model viewer, if we use the symbols in the left window below the clustering results. We can activate the “cell distribution” button, marked with an arrow in Fig. 7.28. The distribution of each variable in the clusters then appears above. Focusing once more on Fig. 7.25, we can see that the clustering result is exactly the same as the result achieved in Sect. 7.3.1. Comparing the results of the manual calculation with the visualization in Fig. 7.10, in the form of a dendrogram, we see that also here cars 1–4 are assigned to one cluster and cars 5 and 6 to another one. Summary With the TwoStep algorithm, we used a small dataset to find clusters. We did this in order to show that the result that we got based on the “single-linkage method”, from the theory section, is the same. Unfortunately, we had to decide in advance, the number of clusters to determine. Normally, this is not necessary using the TwoStep algorithm. After determining “segments” of the objects, the silhouette plot helps to assess the quality of the model generally. Furthermore, we learn how to analyze the different clusters step-by-step, using the different options from the model viewer.
664
7.3.4
7
Cluster Analysis
Exercises
Exercise 1: Theory of Cluster Algorithms 1. Outline why clustering methods, e.g., TwoStep, belong to the category of “unsupervised learning algorithms”. 2. In Sect. 2.7.7 “Partitioning dataset”, we explained how to divide a dataset into training and test partitions. A validation subset can probably be separated from the original data before starting the training; proof that it is unnecessary to use this method for the TwoStep algorithm; in fact it should be avoided. 3. We explained the meaning of the silhouette chart in Sect. 7.3.1, and particularly in Table 7.25. A value between fair (+0.5) and good (above +0.5) lets us assume that the clustering was successful. Explain how this chart—or rather the theory this chart is based on—can be used to determine the optimal number of clusters to use. Exercise 2: IT User Satisfaction Based on PCA Results In a survey, the satisfaction of IT users with an IT system was determined. Questions were asked relating to several characteristics, such as “How satisfied are you with the amount of time your IT system takes to be ready to work (from booting the system to the start of daily needed applications)?” The users could rate the aspects using the scale “1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent”. See also Sect. 12.1.24. You can find the results in the file “IT user satisfaction.sav”. In Sect. 6.3.3, Exercise 3, we used results of a PCA to determine technical and organizational satisfaction indices. Now the users should be divided into groups based on their satisfaction with both aspects. 1. Open the stream “pca_it_user_satisfaction.str” and save it under another name. 2. In the lower sub-stream, the indices for technical and organizational are calculated based on the PCA results. Open the 2D-plot created by the Plot node and interpret the chart. Determine the number of possible clusters you expect to find. 3. Using a TwoStep node, determine clusters of users based on the indices. Don’t forget to add a Type node first. Explain your findings. 4. Optional: The TwoStep algorithm assumes normally distributed continuous values. We explained in Sect. 3.2.5 how to assess and transform variables to meet this assumption. The stream created above can now be modified so that transformed PCA results are used for clustering. Outline your findings. Exercise 3: Consumer Segmentation Using Habits Using the options, “vegetarian”, “low meat”, “fast food”, “filling”, and “hearty”, consumers were asked “Please indicate which of the following dietary characteristics describe your preferences. How often do you eat . . .”. The respondents had the chance to rate their preferences on a scale “(very) often”, “sometimes”, and “never”. The variables are coded as follows: “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. They are ordinally scaled. See also Sect. 12.1.28. 1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The Stream “Template-Stream_nutrition_habits” uses this dataset. Please open the
7.3 TwoStep Hierarchical Agglomerative Clustering
665
template stream and make sure the data can be loaded. Save the stream with another name. 2. Use a TwoStep clustering algorithm to determine consumer segments. Assess the quality of the clustering, as well as the different groups of consumers identified. Use a table to characterize them. 3. The variables used for clustering are ordinally scaled. Explain how proximity measures are used to deal with such variables in clustering algorithms. Remark Please note that in Exercise 2 in Sect. 7.5.3 and also in the solution in Sect. 7.5.4, an alternative Kohonen and K-Means model will be presented using the Auto Cluster node.
7.3.5
Solutions
Exercise 1: Theory of Cluster Algorithms 1. Unsupervised learning methods do not need a target field to learn how to handle objects. In contrast to the tree methods for example, these algorithms try to categorize the objects into subgroups, by determining their similarity or dissimilarity. 2. First of all, in clustering we find more than one correct solution for segmentation of the data. So different parameters, e.g., proximity measures or clustering methods, should be used and their results should be compared. This is also a “type of validation”, but there are many reasons not to divide the original dataset when using TwoStep: (a) Partitioning reduces the information presented to the algorithm for finding “correct” subgroups of objects. Reducing the number of records in cases of small samples will often lead to worse clustering results. We recommend using the partitioning option only in cases of huge sample size. (b) Clusters are defined by inspecting each object and measuring the distance from or similarity to all other objects. If objects are excluded from clustering, by separating them in a test or validation partition, the algorithm will not take them into account and will not assign any cluster number to these objects. (c) TwoStep does not produce a “formula” for how to assign objects to a cluster. Sure, we can assign a completely new object to one of the clusters, but this process is based only on our “characterisation” of the clusters, using our knowledge of the data’s background, where the data came from. 3. To measure the goodness of a classification, we determine the average distance from each object to the points of the cluster it is assigned to and the average distance to the other clusters. The silhouette plot shows this measure. If we assume that the silhouette is a measure of the goodness of the clustering, we can compare different models using their silhouette value. For instance, we can modify the number of clusters to determine in a TwoStep model, and then use the model with the highest silhouette value. This method is used by the Autoclustering node, as we will see later.
666
7
Cluster Analysis
Exercise 2: IT User Satisfaction Based on PCA Results Name of the solution streams Theory discussed in section
clustering_pca_it_user_satisfaction Sect. 6.3 for PCA theory Sect. 6.3.3, Exercise 3 for indices calculation based on PCA Sect. 7.3.2 for TwoStep algorithm Sect. 3.2.5 for transformation toward normal distribution
1. We opened the stream “pca_it_user_satisfaction.str” and saved it under the name “clustering_pca_it_user_satisfaction.str”. Figure 7.29 shows the stream. 2. We open the 2D-plot created by the Plot node and marked with an arrow in Fig. 7.29. As we can see in Fig. 7.30, the scatterplot shows a cloud with fewer separated points. The points are arranged on the bisecting line of the diagram. So in the lower left-hand corner, we can find the users who are less satisfied, and in the upper right-hand corner, the users who are satisfied. In this exercise, we use a clustering algorithm to assign the users automatically to these “satisfaction groups”. We expect to have two or three groups. Outliers on the left and on the right are probably separated. 3. We must add a Type node before we add a TwoStep node to the stream, to determine clusters of users based on the indices. Figure 7.31 shows the settings in the “fields” tab of the TwoStep node. Only the technical and organizational satisfaction indices are used to determine the clusters. We do not modify other parameters in this node. We run the TwoStep node and get the model in nugget form. Figure 7.32 shows the last sub-stream. Finally, we add a Plot node to visualize the result of the TwoStep node clustering. Figure 7.33 shows the parameters of
Fig. 7.29 Stream “custering_pca_it_user_satisfaction”
7.3 TwoStep Hierarchical Agglomerative Clustering
Fig. 7.30 Organizational vs. technical satisfaction indices
Fig. 7.31 TwoStep node settings
667
668
Fig. 7.32 Clustering sub-stream is added
Fig. 7.33 Parameter of Plot node shown in Fig. 7.32
7
Cluster Analysis
7.3 TwoStep Hierarchical Agglomerative Clustering
669
Fig. 7.34 Clustered users by their satisfaction indices
the Plot node. Here, we used the option “Size” to show the cluster number. Of course the option “Color” can also be used. As expected, Fig. 7.34 shows two subgroups of users. So each user is assigned to exactly one “satisfaction group”. A more detailed analysis in the Model viewer tells us that the segments with 50% of the users have exactly the same size. If we should need to generate a unique user ID for each record, we could do that by adding a Derive node with the @INDEX function. 4. The solution for this part of the exercise can be found in the stream “cluster_it_user_satisfaction_transformed”. The stream will not be explained in detail here. The reader is referred to Sect. 3.2.5 “Transform node and SuperNode”. As explained in Sect. 3.2.5, the Transform node should be used to assess the normality of data and to generate Derive nodes to transform the values. We add a Transform node. Assessing the data, we find that using a Log transformation could help to move the distributions toward normality. We generate a SuperNode for these transformations. A Shapiro Wilk test would show whether or not the
670
7
Cluster Analysis
Fig. 7.35 Clustered users by their satisfaction indices, based on transformed data
original, rather than the transformed, variables are normally distributed. The transformation helps to improve the quality though, as we will see when assessing the clustering results. We add a TwoStep node and use the transformed two variables to cluster the user according to his/her level of satisfaction. Plotting the original (and not the transformed) variables against each other in a Plot node, we get the result shown in Fig. 7.35. The algorithm can separate the groups, but the result is unsatisfying. The two users with technical satisfaction ¼ 5 at the bottom belong more to the cluster in the middle than to the cluster on the left. If we restrict the number of clusters produced by TwoStep to exactly two, we get the solution depicted in Fig. 7.36. Also here, 50% of the users are assigned to each of the clusters. This solution shows that for the given dataset, the identification of the number of clusters determined a higher appropriate number of segments. Generally though, the algorithm works fine on these skewed distributions.
7.3 TwoStep Hierarchical Agglomerative Clustering
671
Fig. 7.36 Two users clusters, determined from transformed data
Exercise 3: Consumer Segmentation Using Dietary Habits Name of the solution streams Theory discussed in section
cluster_nutrition_habits Sect. 7.3.2
Remarks The TwoStep algorithm assumes that non-continuous variables are multinomially distributed. We do not verify this assumption here. For the dependency of the silhouette measure, and the number of clusters determined by TwoStep or K-Means, see also the solution to Exercise 5 in Sect. 7.4.4. The solution can be found in the Microsoft Excel file “kmeans_cluster_ nutrition_habits.xlsx”. Please note that in Exercise 2 in Sect. 7.5.3, and also in the solution in Sect. 7.5.4, an alternative Kohonen and K-Means model will be presented using the Auto Cluster node.
672
7
Cluster Analysis
Fig. 7.37 Parameters of the Filter node
1. We open the stream “Template-Stream_nutrition_habits” and save it under the new name “cluster_nutrition_habits.str”. 2. Before we start to cluster, we have to understand the settings in the stream. For this, we open the Filter node and the Type node, to show the scale type (see Figs. 7.37 and 7.38). Here, it can be helpful to enable the ID in the Filter node to identify the consumers related to their assigned cluster number. We add a TwoStep node to the stream. To be sure the correct variables are used for the cluster analysis, we can add them in the Filed tab of the TwoStep node (see Fig. 7.39). Additionally, we must make sure that the ordinal variables are standardized in the Model tab, as shown in Fig. 7.40. Running the TwoStep node, we get the final stream as shown in Fig. 7.41. The advantage of using a TwoStep node is that it determines the number of clusters automatically. Double-clicking the model nugget, we get the model summary in the Model Viewer, as in Fig. 7.42. The quality of clustering is good, based on the silhouette plot. As described in Sect. 7.3.3, we get a more precise silhouette value by using the “Copy Visualization Data”, also highlighted with an arrow in Fig. 7.42. Here, the silhouette value is 0.7201.
7.3 TwoStep Hierarchical Agglomerative Clustering
Fig. 7.38 Scale type settings in the Type node
Fig. 7.39 Fields tab in the TwoStep node
673
674
7
Cluster Analysis
Fig. 7.40 Model tab in the TwoStep node
Fig. 7.41 Final stream “cluster_nutrition_habits”
Using the option “clusters” from the drop-down list in the left-hand corner of the Model Viewer in Fig. 7.42 and then the option “cells show absolute distribution”, we get the frequency distribution per variable and cluster, as shown in Fig. 7.43. Table 7.27 shows a short assessment of the different clusters. Cluster 4 is particularly hard to characterize. The TwoStep algorithm should probably be used to determine only four clusters.
7.3 TwoStep Hierarchical Agglomerative Clustering
Fig. 7.42 Summery of TwoStep clustering in the Model Viewer
Fig. 7.43 Detailed assessment of the clusters
675
676
7
Cluster Analysis
Table 7.27 Characterization of clusters Cluster number Cluster-1 Cluster-2 Cluster-3 Cluster-4 Cluster-5
Description Avoids vegetarian food but does not prefer lots of meat Respondents who eat hearty, filling, and fast food but sometimes vegetarian Mainly vegetarian food preferred, but also eats “low meat” Avoids vegetarian, meat, and fast food. Sometimes eats “low meat” Sometimes vegetarian and sometimes also “low meat”
3. In this case, the proximity measure is a similarity measure. Ordinal variables are recoded internally into many dual variables. For details, see Exercise 2 in Sect. 7.2.2. To determine the similarity between the dual variables, the Tanimoto coefficient can be used. For details see Exercise 3 in Sect. 7.2.2, as well as the explanation in Sect. 7.1.
7.4
K-Means Partitioning Clustering
7.4.1
Theory
In hierarchical clustering approaches, the distance between all objects must be determined to find the clusters. As described in Sect. 7.3.2, TwoStep tries to avoid the difficulties of handling a large number of values and objects, by using a tree to organize the objects based on their distances. Additionally, TwoStep results are relatively sensitive, relative to mixed type attributes in the dataset. Dealing with large datasets in cluster analysis is challenging. With the K-Means algorithm, it is not necessary to analyze the distance between all objects. For this reason, the algorithm is used often. In this section, we will outline the theory K-Means is based upon, as well as the advantages and the disadvantages that we can conclude, based on this knowledge. The steps of the K-Means algorithm can be described as follows (see also IBM (2015a), pp. 227–232): 1. The user specifies the number of clusters k. 2. Metrical variables are transformed to have values between 0 and 1, using the following formula: xi,
new
¼
xi xmin xmax xmin
Nominal or ordinal variables (also called symbolic fields) are recoded, using binary coding as outlined in Sect. 7.2.1, Exercise 2 and especially in Table 7.9. Additionally, the SPSS Modeler uses a scaling factor to avoid having the variables overweighed in the following steps. For details see also IBM (2015a),
7.4 K-Means Partitioning Clustering
3.
4.
5. 6.
677
p. 227–228. Normally, the factor equals the square root of 0.5 or 0.70711, but the user can define his/her own values in the expert tab of the K-Means node. The k cluster centers are defined as follows (see IBM 2015a, p. 229): (a) The values of the first record in the dataset are used as the initial cluster center. (b) Distances are calculated from all records to the cluster centers so far defined. (c) The values from the record with the largest distance to all cluster centers are used as a new cluster center. (d) The process stops if the number of clusters equals the number predefined by the user, i.e., until k cluster centers are defined. The squared Euclidean distance (see Tables 7.5 and 7.10) between each record or object and each cluster center is calculated. The object is assigned to the cluster center with the minimal distance. The cluster centers are updated, using the “average” of the objects assigned to this cluster. This process stops if either a maximum number of iterations took place, or there is no change in the cluster centers recalculation. Instead of “no change in the cluster centers”, the user can define another threshold for the change that will stop the iterations.
"
K-Means is a clustering algorithm suitable for large datasets in particular. First k cluster centers are determined and then the objects are assigned to the nearest cluster center. The number of clusters k must be defined by the user. In the following iterations, the clustering is improved by assigning objects to other clusters and updating the cluster centers. The process stops if a maximum number of iterations are obtained or there is no change in the clusters, or the change is smaller than a user-defined threshold.
"
The Auto-Clustering node also uses K-Means clustering. Here, the number of clusters does not have to be determined in advance. Based on several goodness criteria, as defined by the user, “the best” model will be determined.
Advantages – Each cluster consists of at least one item. – The clusters do not overlap. – Using large datasets, K-Means is faster than hierarchical algorithms. K-Means can probably be used if other algorithms crash because of insufficient memory. Disadvantages – The number of clusters k must be defined by the user. – It tends to produce clusters of approximately the same size. – Finding an appropriate number of clusters is difficult. – The result depends to some extent on the order of the objects in the dataset. This is also because the first record is being used as the initial cluster.
678
7.4.2
7
Cluster Analysis
Building a Model in SPSS Modeler
Customer segmentation, or market segmentation, is one of the main fields where cluster analysis is often used. A market is divided into subsets of customers who have typical characteristics. The aim of this segmentation is to identify target customers or to reduce risks. In the banking sector, this technique is used to improve the profitability of business and to avoid risks. If a bank can identify customer groups with lower or higher risk of default. Then the bank can define better rules for money lending or credit card offers. We want to apply the K-Means algorithm for customer segmentation purposes here too. The dataset comes from the IBM Website (2014). See Sect. 12.1.9. Description of the model Stream name Based on data set Stream structure
customer_bank_segmentation_K_means customer_bank_data.Csv
Related exercises: All exercises in Sect. 7.4.3
Creating the Stream 1. We open the “Template-Stream_Customer_Bank”. It includes a Variable File node, a Type node to define the scale types, and a Table node to show the records (Fig. 7.44). 2. To check if the records are imported and to understand the data, let’s open the Table node first (see Fig. 7.45). 3. Now we have to be sure the correct scale types are assigned to the variables, so we open the Type node as shown in Fig. 7.46. There is no need to define the role of the variables here. That’s because we will add the variables we want to use later one-by-one to the clustering node. We think this method is more transparent than using options that have side effects on other nodes. 4. Interpreting the variables shown in Fig. 7.46 and additionally described in detail in Sect. 12.1.9, we can select variables that help us to find subgroups of customers, in terms of risk of default. We assume: (a) “ADDRESS” and “CUSTOMERID” are not helpful and can be excluded.
7.4 K-Means Partitioning Clustering
679
Fig. 7.44 Stream “Template-Stream_Customer_Bank”
Fig. 7.45 Records of “customer_bank_data.csv”
(b) “DEFAULTED” should be excluded too, because it is the result, and the clustering algorithm is an unsupervised method, which should identify the pattern by itself and should not “learn” or “remember” given facts. (c) “AGE”, “CARDDEBT”, “EDUCATION”, “INCOME”, “OTHERDEBT”, and “YEARSEMPLOYED” are relevant variables for the segmentation. 5. The “EDUCATION” variable is defined as nominal because it is a string, but in fact it is ordinal. To define an order, we use a Reclassify node as described in Sect. 3.2.6. After adding this node from the “Field Ops” tab of the Modeler to the stream (see Fig. 7.47), we can define its parameters as shown in Fig. 7.48. We define here a new variable “EDUCATIONReclassified”. Later in this stream, we have to add another Type node, for assigning the correct scale type to this new variable. For now, we want to check all variables to see if any of them can be used in clustering.
680
7
Cluster Analysis
Fig. 7.46 Scale types of variables defined for “customer_bank_data.csv”
Fig. 7.47 Added Reclassify node to the stream
6. Interpreting the potential useful variables, we can see that some of them are correlated. A customer with a high income can afford to have higher credit card and other debt. Using these original variables doesn’t add much input to the model. Let’s first of all check the correlation coefficient of the variables by using a Sim Fit node, however. We add the node to the stream and execute it. We explained how to use this node in Sect. 4.4. Figure 7.49 shows the actual stream. Figure 7.50 shows the correlations.
7.4 K-Means Partitioning Clustering
681
Fig. 7.48 Assigning ordinal scaled values to the variable “EDUCATION”
We can see that the variables “CARDDEBT” and “OTHERDEBT”, as well as “INCOME” and “CARDDEBT”, and “INCOME” and “OTHERDEBT” are correlated. 7. So it is definitely necessary to calculate a new variable in the form of the ratio of credit card debt and other debt to income. To do so, we add a Drive node from the “Field Ops” tab of the Modeler. Using its expression builder, as outlined in Sect. 2.7.2 we define the parameters as shown in Fig. 7.51. The name of the new variable is “DEBTINCOMERATIO” and the formula “(CARDDEBT + OTHERDEBT)/INCOME * 100”. So “DEBTINCOMERATIO” equals the summarized debt of the customer vs. the income in percent. 8. To assign the correct scale types to the reclassified educational characteristics in “EDUCATIONReclassify”, and the new derived variable “DEBTINCOMERATIO”, we must add another Type node at the end of the stream. We define “DEBTINCOMERATIO” as continuous and “EDUCATIONReclassify” as ordinal (see Figs. 7.52 and 7.53).
682
7
Cluster Analysis
Fig. 7.49 Reclassify and Sim Fit nodes are added to the stream
Fig. 7.50 Correlation of metrical variables included in “customer_bank_data.csv”
9. Now we have finished the preliminary work and we are ready to try to cluster the customer. We add a K-Means node to the stream from the “Modeling” tab and open its parameter dialog window (Fig. 7.54).
7.4 K-Means Partitioning Clustering
Fig. 7.51 Parameters of the Derive node to calculate the debt-income ratio
Fig. 7.52 Stream with another added Type node at the end
683
684
7
Cluster Analysis
Fig. 7.53 Assigning scale types to “DEBTINCOMERATIO” and “EDUCATIONReclassified”
Fig. 7.54 Variables used in K-Means node
In the Fields tab, we add the variables previously discussed. To do so, we enable the option “Use custom settings” and click the variable selection button on the right. Both are marked with an arrow in Fig. 7.54.
7.4 K-Means Partitioning Clustering
685
Using the K-Means algorithm, we have to define the number of clusters in the tab “Model” of the node. The rule of thumb explained in Table 7.25 of Sect. 7.3 was rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffi number of objects 850 ¼ 20:62: 2 2 The derived 21 clusters are definitely too much, because we have to describe the characteristics of each customer cluster based on our knowledge. The four variables used here do not allow us that precision. We should therefore start with a lower number and decide to use five clusters first (see Fig. 7.55). The option “Generate distance field“ would give us the opportunity to calculate the Euclidean distance between the record and its assigned cluster center. These values are assigned to each record and appear in a Table node attached to the model nugget, in a variable called “$KMD-K-Means”. We don’t want to use this option here. For details, see IBM (2015a, pp. 231–232). 10. We can start the K-Means clustering with the “Run” button. A model nugget will be added to the stream (Fig. 7.56). 11. Double-clicking on the node, we can assess the clustering results (see Fig. 7.57). We can see on the left in the Silhouette plot that the clustering quality is just “fair”. On the right, the Modeler shows that there is a cluster 3 with 5.8% of the records.
Fig. 7.55 Cluster parameters used in K-Means node
686
7
Cluster Analysis
Fig. 7.56 K-Means node and Model nugget in the stream
Fig. 7.57 Model summary for the five-cluster solution
To assess the model details, we choose the option “Clusters” in the left-hand corner of Fig. 7.57. In the left part of the window in the Model viewer, we can find the details, as shown in Fig. 7.58. Obviously, the difference between the clusters is not that remarkable. So the first conclusion of our analysis is that we have to reduce the number of clusters. We also want to assess the quality of the predictors used to build the model, however. The SPSS Modeler offers this option in the drop-down list on the right-hand side of the window. It is marked with an arrow in Fig. 7.57. Figure 7.59 shows us that the clustering is dominated by the education of the customer. First of all, this is not very astonishing. So the practical conclusion is not that useful. Furthermore, in terms of clustering quality, we want to have the significant influence of more than one variable in the process. The second conclusion of our analysis is that we should try to exclude the variable “EDUCATIONclassified” from the K-Means clustering.
7.4 K-Means Partitioning Clustering
687
Fig. 7.58 Cluster details for the five-cluster solution
Predictor Importance EDUCATION Reclassified AGE YEARSEMPL OYED DEBTINCOM ERATIO 0.0
0.2
0.4
0.6
DEBTINCOMERATIO
Least Important
Fig. 7.59 Predictor importance in the five-cluster solution
0.8
1.0
EDUCATIONReclassified
Most Important
688
7
Cluster Analysis
We can close the Model viewer with “OK”.
"
Predictor importance, determined by the SPSS Modeler, can help identify the best variables for clustering purposes, but it is not in itself proof of the importance of a variable, in terms of improving the model accuracy. The importance value is just rescaled so that the sum of all predictors is one. A larger importance means one variable is more appropriate than another.
"
For categorical variables, the importance is calculated using Pearson’s chi-square. If the variables are continuous, an F-test is used. For calculation details, see IBM (2015a, pp. 89–91).
12. In the K-Means node, we remove the “EDUCATIONclassified” variable from the variable list and define the number of clusters the algorithm should identify as four. The final settings are shown in Figs. 7.60 and 7.61. We click “Run” to start the modeling process again. 13. Figure 7.62 shows the summary of the model in the Model Viewer. The cluster quality has improved, as we can see in the silhouette plot: the smallest represents 15.5% of the 850 records. 14. If we use the drop-down list on the left for an assessment of the clusters, we get the values shown in Fig. 7.63. No cluster has a similar characteristic to another one. Also, the importance of the predictors in Fig. 7.64 is well balanced. 15. The Model Viewer offers a wide range for analysis of the cluster. If we click on the button for the absolute distribution in the middle of the left window, marked with an arrow in Fig. 7.65, we can analyze the distribution of each variable per cluster on the right. To do so, we have to click one of the table cells on the left.
Fig. 7.60 Modified list of variables in the K-Means node
7.4 K-Means Partitioning Clustering
Fig. 7.61 Final model tab options in the K-Means node
Fig. 7.62 Model summary of the four-cluster solution
689
690
7
Cluster Analysis
Clusters Input (Predictor) Importance 1.0 Cluster Label
cluster-3
cluster-1
0.8
0.6
0.4
cluster-4
0.2
0.0
cluster-2
Description
Size
32.6% (277)
31.5% (268)
20.4% (173)
15.5% (132)
Inputs AGE 26.66
AGE 36.57
AGE 44.25
AGE 37.36
YEARSEMPLOYED 3.30
YEARSEMPLOYED 8.05
YEARSEMPLOYED 18.46
YEARSEMPLOYED 7.71
DEBTINCOMERATIO DEBTINCOMERATIO DEBTINCOMERATIO DEBTINCOMERATIO 9.58 6.40 9.01 20.60
Fig. 7.63 Cluster details from the four-cluster solution
Predictor importance
AGE
YEARSEMPL OYED
DEBTINCOM ERATIO 0.0
0.2
0.4
0.6
DEBTINCOMERATIO
Least Important
Fig. 7.64 Predictor importance in the four-cluster solution
0.8
1.0 AGE
Most Important
Fig. 7.65 Details of Predictor importance
7.4 K-Means Partitioning Clustering 691
692
7
Cluster Analysis
Interpretation of the Determined Clusters The clustering algorithms do not produce any description of the identified clusters, but in our example, Figs. 7.63 and 7.65 help us to characterize them very well. Table 7.28 summarizes the findings. The practical conclusions are more difficult, however, as is often expected. Based on these results, we could think that customers assigned to cluster 4 are uninteresting to the bank. A more detailed analysis can help us be more precise. To get more information, we add a Table node and a Data Audit node to the stream and connect them with the model nugget (see Fig. 7.66). In the Table node of Fig. 7.67, we can find a column with the assigned cluster number. We remember that in the original dataset, a variable “Defaulted” was included (see Fig. 7.45). We should use these past defaults to find out more details related to our segmentation, based on the logic of the K-Means algorithm. We add at least a Matrix node and connect it with the Model nugget. The node is shown in Fig. 7.66 at the end. We discussed how to use a Matrix node in Sect. 4.7. There we also explained how to conduct a chi-square test of independence. Here, we assign the variable “DEFAULTED” to the rows and the cluster number to the columns (see Fig. 7.68). Table 7.28 Cluster description of the customer segmentation Cluster number as shown in Fig. 7.63 3
1 4 2
Characteristics of customer segment Young customers with an average age of 27 years and therefore a relatively low number of years of employment (average 3.3 years). Remarkable debt-income ratio of 10%. This means 10% of the income is required to pay credit card or other debt. Middle-aged or approximately 36–37 years old and with a very low debt-income ratio of 6% on average. Older customers with 44 years and a long employment history. Average debt-income ratio of 9%. Middle-aged customers but slightly older than those in cluster 1. In contrast, the number of years of employment is lower than in cluster 1. The average debt-income ratio is above 20%.
Fig. 7.66 Final K-Means customer segmentation stream
7.4 K-Means Partitioning Clustering
693
Fig. 7.67 Records with the assigned cluster number
Fig. 7.68 Variables are assigned to columns and rows in the Matrix node
To find out details of the dependency, we enable the option “Percentage in column” in the “Appearance” tab of the Matrix node (see Fig. 7.69). As shown in Fig. 7.70, the relative frequency of customers for whom we have no information regarding their default is approximately the same in each cluster. So this information gap does not affect our judgement very much.
694
Fig. 7.69 Percentage per column should be calculated in the Matrix node
Fig. 7.70 Result of Default analysis per cluster
7
Cluster Analysis
7.4 K-Means Partitioning Clustering
695
More surprisingly, we can see that the default rate in cluster 1 and cluster 4 is relatively low; because of this, customers in cluster 1 are a good target group for the bank. First private banking may generate profit, and the number of loans given to them can be increased. We described cluster 2 as the class of middle-aged customers with an average debt-income ratio of above 20%. These may be the customers who bought a house or a flat and have large debts besides their credit card debt. The loss in case of a default is high, but contrary to this the bank would be able to generate high profit. The very high default rate of above 43% means we recommend separating these customers, however, and paying attention to them. "
Cluster analysis identifies groups in data. To analyze the characteristics of these groups, or the differences between groups, the following approaches can be used:
1. If the variables that should be used for further analysis are nominal or ordinal, a Matrix node must be used. A chi-square test of independence can be performed here also. 2. Often it is also necessary to calculate several additional measures, e.g., the mean of credit card debts per cluster. Here, the Means node can be used to determine the averages and additionally to perform a t-test or a one-way ANOVA. See Exercise 1 in Sect. 7.4.3 for details. The characteristics of cluster 3, with its young customers, its lower debt-income ratio, and the default rate of 32%, are different from cluster 2, where in some specific cases we found it better to discontinue the business relationship. These customers may default very often, but the loss to the bank is relatively low. Probably, payments for loans with lower rates are delayed. The bank would do very well to support this group because of the potential to generate future profit. Summary We separated groups of objects in a dataset using the K-Means algorithm. To do this, we assessed the variables related to their adequacy for clustering purposes. This process is more or less a decision of the researcher, based on knowledge of the practical background. New measures must be calculated, however, to condense information from different variables into one indicator. The disadvantage of the algorithm is that the number of clusters must be defined in advance. Using statistical measures, e.g., the silhouette, as well as practical knowledge or expertise, we found an appropriate solution. In the end, the clusters must be described by assessing their characteristics in the model viewer of the K-Means nugget node. Knowledge of the practical background is critically important for finding an appropriate description for each cluster.
696
7.4.3
7
Cluster Analysis
Exercises
Exercise 1: Calculating Means Per Cluster Using the K-Means algorithm, we identified four groups of customers. Assessing the results and characterizing the clusters, we found that customers in cluster 2 are middle-aged with an average debt-income ratio of above 20%. We speculated about credit card and other debts. Probably, these customers bought a house or a flat and have large “other” debts in comparison to their credit card debt. 1. Open the stream “customer_bank_segmentation_K_means.str” we created in Sect. 7.4.2. Determine the average credit card debt and the average other debt, using a Means node from the Output tab of the Modeler. 2. Assess the result and explain if cluster 2 does indeed have remarkably higher other debts. Exercise 2: Improving Clustering by Calculating Additional Measures In Sect. 7.4.2, we created a stream for customer segmentation purposes. We assessed all variables and defined a new debt-income ratio. The idea behind this measure is that a customer with a higher income is able to service higher debts. In the end, we can interpret the customer segmentation result; however, the silhouette plot in Fig. 7.62 lets us assume that the fair model quality can be improved. Here, we want to show how additional well-defined measures can improve the clustering quality. 1. Open the stream “customer_bank_segmentation_K_means.str” we created in Sect. 7.4.2. Assess the variables included in the original dataset. 2. Obviously, older customers also normally have longer employment history and therefore also a higher number of years in employment. So the idea is to calculate the ratio of years employed to the age of the customer. Add this new variable (name “EMPLOY_RATIO”) to the stream. 3. Now update the clustering, using the new variable instead of the variables “AGE” and “YEARSEMPLOYED” separately. Explain your findings using the new variable. Exercise 3: Comparing K-Means and TwoStep Results In the previous section, we used the K-Means algorithm to create a stream for customer segmentation purposes. In this exercise, we want to examine if the TwoStep algorithm discussed in Sect. 7.3 leads to the same results. 1. Open the stream “customer_bank_segmentation_K_means.str” we created in Sect. 7.4.2. Modify the stream so that a cluster model based on TwoStep will also be calculated. Bear in mind the assumption that the TwoStep algorithm has normally distributed values. 2. Now consolidate the results of both models so they can be analyzed together. Use a Merge node. 3. Now add the necessary nodes to analyze the results. Also add nodes to show the default rates per cluster, depending on the cluster method used.
7.4 K-Means Partitioning Clustering
697
4. Assess the clustering calculated by the TwoStep algorithm and compare the results with the findings presented in Sect. 7.4.2, using K-Means. Exercise 4: Clustering on PCA Results In Chap. 6, we discussed principal component analysis (PCA) and principal factor analysis (PFA), as different types of factor analyses. PCA is used more often. The factors determined by a principal component analysis (PCA) can be described as “a general description of the common variance”. If the fluctuation of a set of variables is somehow similar, then behind these variables a common “factor” can be assumed. The factors are the explanation and/or reason for the original fluctuation of the input variables and can be used to represent or substitute them in further analyses. In this exercise, PCA and K-Means should be combined to find clusters of people with homogeneous dietary characteristics. 1. Open the stream “pca_nutrition_habits.str”, created and discussed intensively in Sect. 6.3. Save it under another name. 2. Recap the aim of factor analysis and especially describe the meaning of the factor scores. Furthermore, outline the aim of cluster analysis in your own words and explain the difference from factor analysis in general. 3. Based on the factor scores, a cluster analysis should now be performed. In the 2D plot of factor scores in Fig. 6.43, we found out that three clusters are most appropriate for describing the structure of the customers based on their dietary characteristics. Now please extend the stream to perform the cluster analysis using K-Means. Show the result in an appropriate diagram. 4. The factor scores, or more accurately the principal component scores, express the input variables in terms of the determined factors, by reducing the amount of information represented. The loss of information depends on the number of factors extracted and used in the formula. The factor scores are standardized. So they have a mean of zero and a standard deviation of one. Other multivariate methods, such as cluster analysis, can be used based on the factor scores. The reduced information and the reduced number of variables (factors) can help more complex algorithms to converge or to converge faster. Outline why it could be helpful and sometimes necessary to use PCA, and then based on the factor scores, a clustering algorithm. Create a “big picture” to visualize the process of how to decide when to combine PCA and K-Means or TwoStep. Exercise 5: Determining the Optimal Number of Clusters K-Means does not determine the number of clusters to use for an optimal model. In Sect. 7.3.1, we explained the theory of clustering and discussed several methods for determining rules or criterions that can help to solve this problem. Table 7.25 shows different approaches.
698
7
Cluster Analysis
To measure the goodness of a classification, we can determine the average distance of each object from the other objects in the same cluster and the average distance from the other clusters. Based on these values, the silhouette S can be calculated. More details can be found in Struyf et al. (1997, pp. 5–7). As outlined in IBM (2015b, p. 77) and IBM (2015b, p. 209), the IBM SPSS Modeler calculates a Silhouette Ranking Measure based on this silhouette value. It is a measure of cohesion in the cluster and separation between the clusters. Additionally, it provides thresholds for poor (up to +0.25), fair (+0.5), and good (above +0.5) models. See also Table 7.26. The SPSS Modeler shows the silhouette value in the model summary of the Model Viewer on the left. To get the precise silhouette value, the option “Copy Visualization Data” can be used. To do this the second button from left in the upper part of the Model Viewer should be clicked. Then the copied values must be pasted in simple word processing software. Table 7.26 shows the result. The aim of this exercise is to use the silhouette value to assess the dependency between the silhouette value and the number of clusters or to find an appropriate number of clusters when using the K-Means algorithm. 1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The Stream “Template-Stream_nutrition_habits” uses this dataset. Please open the template stream. Save the stream under another name. Using the dietary types, “vegetarian”, “low meat”, “fast food”, “filling”, and “hearty”, the consumers were asked “Please indicate which of the following dietary characteristics describes your preferences. How often do you eat . . .”. The respondents had the chance to rate their preferences on a scale “(very) often”, “sometimes”, and “never”. The variables are coded as follows: “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. See also Sect. 12.1.28. The K-Means clustering algorithm should be used to determine consumer segments, based on this data. Add a K-Means node to the stream. 2. Now the dependency of the silhouette value and the number of clusters should be determined. Create a table and a diagram that shows this dependency, using spreadsheet software, e.g., Microsoft Excel. Start with two clusters. 3. Explain your findings and determine an appropriate number of clusters, also keeping the background of the data in mind. 4. OPTIONAL: Repeat the steps by using the TwoStep algorithm and explain your findings.
7.4.4
Solutions
Exercise 1: Calculating Means Per Cluster Name of the solution streams Theory discussed in section
customer_bank_segmentation_K_means_extended_1 and Microsoft Excel file with a ANOVA in “customer_bank_data_ANOVA.xlsx” Sect. 7.4.2
7.4 K-Means Partitioning Clustering
699
1. As described in the exercise, we add a Means node to the existing stream and connect it with the Model nugget node. Figure 7.71 shows the extended stream, and Fig. 7.72 shows the parameters of the Means node. Running the node, we get a result as shown in Fig. 7.73.
Fig. 7.71 Means node is added to the stream
Fig. 7.72 Parameters of the Means node
700
7
Cluster Analysis
Fig. 7.73 Averages between clusters in the Means node
2. The results in Fig. 7.73 confirm our suspicion that customers in cluster 2 have high other debts. The Means node performs a t-test in the cases of two different clusters and a one-way ANOVA if there are more than two clusters or groups. See IBM (2015c, pp. 298–300). The test tries to determine if there are differences between the means of several groups. In our case, we can find either 100% or less than a 0% chance that the means are the same. The significance levels can be defined in the Options tab of the Means node. More detailed statistics can also be found in Microsoft Excel file “customer_bank_data_ANOVA.xlsx”. Exercise 2: Improving Clustering by Calculating Additional Measures Name of the solution streams Theory discussed in section
customer_bank_segmentation_K_means_extended_2 Sect. 7.4.2
1. Using the right table node in the final stream, depicted in Fig. 7.66, we get the variables included in the original dataset and the clusters numbers as shown in Fig. 7.74. The variables of interest here are “AGE” and “YEARSEMPLOYED”. 2. To add a Derive node to the stream, we remove the connection between the Derive node for the “DEBTINCOMERATIO” and the second Type node. The name of the new variable is “EMPLOY_RATIO” and the formula “YEARSEMPLOYED/AGE”. This is also shown in Fig. 7.75. We connect the new Type node with the rest of the stream. Figure 7.76 shows the final stream and the new Derive node in its middle.
7.4 K-Means Partitioning Clustering
Fig. 7.74 Variables and cluster numbers
Fig. 7.75 Parameters of the added Derive node
Fig. 7.76 Stream is extended with the added variable “EMPLOY_RATIO”
701
702
7
Cluster Analysis
Fig. 7.77 Parameters of the updated K-Means node
3. Now we update the clustering node by removing the variables “AGE” and “YEARSEMPLOYED”. Subsequently, we add the new variable “EMPLOY_RATIO” (see Fig. 7.77). Running the stream, we can see in the model viewer that the quality of the clustering could be improved, based on assessment of the silhouette plot in Fig. 7.78. In comparison to the previous results presented in Fig. 7.62, cluster 1 is larger with 37.4% (previously 31.5%), whereas all other clusters are smaller. Using ratios or calculated measures can improve the quality of clustering, but the result may be harder to interpret because of the increased complexity in the new calculated variable. Additionally, the segmentation can be totally different, as we can see here: although the percentage of records per cluster in Fig. 7.78 lets us assume that there are probably some records now assigned to other clusters, detailed analysis shows another picture. Summarizing the results in Figs. 7.79 and 7.80, we can find two interesting groups, in terms of risk management. The younger customers in cluster 2 have a remarkable debt-income ratio and a default ratio of 28%. Every second customer assigned to cluster 4 defaulted in the past. The customers here are older and have a lower debt-income ratio of 17%. Clusters 1 and 3 consist of customers that are good targets for new bank promotions.
7.4 K-Means Partitioning Clustering
Fig. 7.78 New cluster result overview
Fig. 7.79 Details per cluster in the Model Viewer
703
704
7
Cluster Analysis
Fig. 7.80 Default rate per cluster
Exercise 3: Comparing K-Means and TwoStep Results Name of the solution streams Theory discussed in section
customer_bank_segmentation_clustering_comparison Sect. 7.3 Sect. 7.4
In the given stream, we used “AGE”, “YEARSEMPLOYED”, and “DEBTINCOMERATIO” as input variables. The TwoStep algorithm expects these variables to be normally distributed. That’s because we have to assess and transform them using a Transform node. The detailed procedure is described in Sect. 3.2.5 as well as in Exercise 9 in Sect. 3.2.8 as well as 3.2.9. At the end of the exercise, we can verify if usage of the Euclidean distance measure helps us to produce better clustering results, based on the fact that the normality assumption is avoided for the log-likelihood distance measure. We did not modify the variables mentioned above for the K-Means node, because interpreting the clustering results is much easier. 1. First we open the stream “customer_bank_segmentation_K_means” created in Sect. 7.4.2 that we need to modify here. We remove the three nodes at the end of the stream, depicted in Fig. 7.81. Then we add a Transform node and a Data Audit node (Fig. 7.82).
7.4 K-Means Partitioning Clustering
705
Fig. 7.81 Original K-Means stream
Fig. 7.82 Transform and Data Audit nodes are added
2. As described in detail in Sect. 3.25 we assess the distribution of the variables “AGE”, “YEARSEMPLOYED”, and “DEBTINCOMERATIO” in the Transform node and the skewness in the Data Audit node. To do this, we add the three variables in the Transform node (see Fig. 7.83). 3. The assessment and the steps for adding a SuperNode, for the transformation of “AGE” and “DEBTINCOMERATIO”, can be found in Sect. 3.2.5 as well as in Exercise 9 in Sect. 3.2.7. Figure 7.84 shows the main aspects. We use a non-standardized transformation and create the SuperNode.
706
7
Cluster Analysis
Fig. 7.83 Parameters of the Transform node
4. Now we are able to use the transformed variable for clustering purposes in a TwoStep node, so we connect the SuperNode with the original Type node. To make sure the scale definitions for the transformed variables are correct, we must add another Type node after the SuperNode. Finally, we can add a TwoStep node. All these steps are shown in Fig. 7.85. 5. Additionally, we have to change the number of clusters the TwoStep node should determine. The number of clusters should be four, as shown in Fig. 7.86. Normally, TwoStep would offer only three clusters here. 6. We run the TwoStep node to get its model nugget. 7. To analyze the results of both streams, we must merge the results of both sub-streams. We add a Merge node to the stream from the Record Ops tab (see Fig. 7.87). 8. As outlined in Sect. 2.79, it is important to disable duplicates of variables in the settings of this node. Otherwise the stream will no longer work correctly. Figure 7.88 shows that we decided to remove variables coming from the K-Means sub-stream section. 9. Theoretically, we could now start to add the typical nodes for analyzing the data, but as we have a lot of different variables, reordering them is helpful. Therefore, we add a Field Reorder node from the Field ops tab (see Fig. 7.89). Figure 7.90 shows the parameters of this node. 10. Finally, we add a Table node, a Data Audit node, and two Matrix nodes. The parameters of the matrix node are the same as shown in Figs. 7.68 and 7.69, as well as the TwoStep results saved in the variable “$T-TwoStep”.
Fig. 7.84 Distribution analysis in the Transform node
7.4 K-Means Partitioning Clustering 707
708
Fig. 7.85 Clustering stream is extended
Fig. 7.86 Parameters of the TwoStep node are modified
7
Cluster Analysis
7.4 K-Means Partitioning Clustering
Fig. 7.87 Stream with added Merge node for comparing K-Means and TwoStep results
Fig. 7.88 Settings of the Merge node for removing duplicated variables
709
710
Fig. 7.89 Final Stream to compare K-Means and TwoStep results
Fig. 7.90 Parameters of the Field Reorder node
7
Cluster Analysis
7.4 K-Means Partitioning Clustering
711
11. The Table node in Fig. 7.91 shows us that probably only the cluster names have been rearranged. Using the Data Audit node, however, we can find that the frequency distributions that are dependent on cluster numbers are slightly different for both methods. Comparing the distributions in Fig. 7.92 for K-Means and Fig. 7.93 for TwoStep, we can see differences in the size of the clusters. Figure 7.94 shows once more the default rates for K-Means in a Matrix node, also previously analyzed in Fig. 7.70. As we explained in Sect. 7.4.2, analyzing the percentage per column is useful for proving if the default rates are independent or not from the cluster numbers. In Figs. 7.94 and 7.95, we can see that this is true for both methods.
Fig. 7.91 Cluster numbers in the Table node
Fig. 7.92 Frequency per cluster determined by K-Means
712
7
Cluster Analysis
Fig. 7.93 Frequency per cluster determined by TwoStep
Fig. 7.94 Default analysis per cluster, based on K-Means
Inspecting the details of the distribution for each variable, Fig. 7.96 shows us, in comparison with Fig. 7.64, that the importance of the variable “YEARSEMPLOYED” has increased. Previously, this variable was ranked as number 2.
7.4 K-Means Partitioning Clustering
713
Fig. 7.95 Default analysis per cluster, based on TwoStep, with log-likelihood distance measure
YEARSEM PLOYED
AGE_ Square Root DEBTINC OMERATI O_Square Root 0.0
0.2
DEBTINCOMERATIO_Square...
0.4
0.6
0.8
1.0
YEARSEMPLOYED
Least Important
Most Important
Fig. 7.96 Predictor importance in the TwoStep model viewer
Summarizing all our findings that are also shown in Fig. 7.97 for the TwoStep algorithm, we can say that the usage of different methods leads to significantly different results in clustering. To meet the assumption of normally distributed variables, TwoStep needs us to transform the input variables, but they then no longer exactly meet these assumptions.
714
7
Cluster Analysis
Fig. 7.97 Detailed analysis of the TwoStep result in the Model Viewer
As outlined in Sect. 7.3.2, we can use the Euclidean distance measure instead. Activating this distance measure in the TwoStep node dramatically decreases the quality of the clustering result. Here, we get a silhouette measure of 0.2 and one very small cluster that represents 3.9% of the customers. These results are unsatisfying. That’s because we have concluded using the log-likelihood distance measure, even when we can’t completely meet the assumption of normal distributed variables here. To decide which model is the best, we need to assess the clusters by inspecting in detail the records assigned. This is beyond the scope of this book. The higher importance of “YEARSEMPLOYED” seems to address the risk aspect of the model better than the higher ranked “AGE” using K-Means, but the smaller differences in “DEBTINCOMERATIO” per cluster for TwoStep separate the subgroups less optimally. Exercise 4: Clustering Based on PCA Results Name of the solution streams Theory discussed in section
k_means_nutrition_habits Sect. 6.3 Sect. 7.4.2
7.4 K-Means Partitioning Clustering
715
Fig. 7.98 Factor Scores defined by the PCA
1. The solution can be found in the stream “k_means_nutrition_habits”. 2. The user tries to determine factors that can explain the common variance of several subsets of variables. Factor analysis can be used to reduce the number of variables. The factor scores, or more accurately the principal component scores, shown in Fig. 7.98, express the input variables in terms of the determined factors, by reducing the amount of information represented. The loss of information depends on the number of factors extracted and used in the formula. The reduced information and the reduced number of variables (factors) can help more complex algorithms, such as cluster analysis, to converge or to converge faster. Cluster analysis represents a class of multivariate statistical method. The aim is to identify subgroups/clusters of objects in the data. Each given object will be assigned to a cluster based on similarity or dissimilarity/distance measures. So the difference between, let’s say PCA and K-Means, is that PCA helps us to reduce the number of variables, and K-Means reduces the number of objects we have to look at, using defined consumer segments and their nutrition habits. 3. Figure 7.99 shows the extended stream. First we added a Type node on the right. This is to ensure the correct scale type is assigned to the factor scores defined by the PCA/Factor node. Figures 7.100 and 7.101 show the parameters of the K-Means node. First we defined the factor scores to be used for clustering purposes, and then we defined the number of clusters as three. This is explained in Sect. 6.3.2 and shown in Fig. 6.43. Figure 7.102 shows the results of the K-Means clustering. Judging from the silhouette plot, the model is of good quality. The determined clusters are shown in a 2D-plot using a Plot node. Figure 7.103 shows the parameters of the node Here, we used the size of the bubbles as visualization of the cluster number. That is due to printing restrictions with this book. There is of course the option to use different colors.
716
7
Cluster Analysis
Fig. 7.99 Extended stream “pca_nutrition_habits.str”
Fig. 7.100 Parameters in the Field tab of the K-Means node
As expected and previously defined in Fig. 6.43, however, K-Means finds three clusters of respondents (see Fig. 7.104). The cluster description can be found in Table 6.1.
7.4 K-Means Partitioning Clustering
Fig. 7.101 Parameters in the Model tab of the K-Means node
Fig. 7.102 Results of the K-Means algorithm shown in the model viewer
717
718
Fig. 7.103 Parameters of the Plot node to visualize the results of K-Means
Fig. 7.104 Clusters of respondents based on their dietary habits
7
Cluster Analysis
7.4 K-Means Partitioning Clustering
Dataset with many variables
719
Performance problems using clustering methods?
yes
Complexity reduction of number of variables with PCA
no
Clustering based on standardized variables
Clustering based on standardized factor scores
Fig. 7.105 Process for combining PCA and cluster algorithms
4. Clustering data is complex. The TwoStep algorithm is based on a tree, to manage the complexity of the huge number of proximity measures and to make comparisons of the different objects. K-Means is based on a more pragmatic approach that determines the cluster centers before starting to cluster. In cluster analysis, however, using either approach can often lead to computer performance deficits. Here, the PCA can help to reduce the number of variables or more to the point determine the variables that are most important. Figure 7.105 shows the process for combining PCA and clustering algorithms. Using original data to identify clusters will lead to precise results, but PCA can also help with using these algorithms, in the case of large datasets. Details can be found in Ding and He (2004). Exercise 5: Determining the Optimal Number of Clusters Name of the solution file Theory discussed in section
Stream: “kmeans_cluster_nutrition_habits.Str” Microsoft excel: “kmeans_cluster_nutrition_habits.Xlsx” Sect. 7.3.1, Table 7.25 Sect. 7.4.2
Remark In this exercise, we specify the number of clusters to determine manually and fit a model for each new cluster number. In the solution to Exercise 1 in Sect. 7.5.4, we demonstrate how to use the Auto Cluster node for the same procedure. 1. We open the stream “Template-Stream_nutrition_habits” and save it under the new name “kmeans_cluster_nutrition_habits.str”. Before we start to add the
720
7
Cluster Analysis
Fig. 7.106 Scale type settings in the Type node
K-Means node, we should check the scale type defined in the Type node (see Fig. 7.106). Now we can add a K-Means node to the stream. To be sure the correct variables are used for the cluster analysis, we can add them in the Fields tab of the K-Means node (see Fig. 7.107). 2. We should determine the dependency between the number of clusters and the silhouette value. To do that, we start with two clusters as shown in Fig. 7.108. Running the K-Means node, we get the final stream with the model nugget as depicted in Fig. 7.109. By double-clicking on the model nugget, we get a model summary in the Model Viewer, as in Fig. 7.110. The quality of clustering is fair, judging from the silhouette plot. As described in Sect. 7.3.3, we get the precise silhouette value by using the button “Copy Visualization Data”, highlighted with an arrow in Fig. 7.110. We paste the copied data into a text processing application, e.g., Microsoft Word. Here, the silhouette value is 0.4395. There is no need to assess the clusters here in detail. We are actually only interested in the silhouette values depending on the number of clusters. So we repeat the procedure for all other cluster numbers from 3 to 13. In the solution to exercise 1 in Sect. 7.5.4, we demonstrate how to use the Auto Cluster node for the same procedure. Table 7.29 shows the dependency of silhouette measure vs. the number of clusters determined. Figure 7.111 shows the 2D-plot of the data.
7.4 K-Means Partitioning Clustering
Fig. 7.107 Fields tab of the K-Means node
Fig. 7.108 Model tab of the K-Means node
721
722
7
Cluster Analysis
Fig. 7.109 Final stream “kmeans_cluster_nutrition_habits.str”
Fig. 7.110 Summary of K-Means clustering in the Model Viewer
3. In the solution to Exercise 3 in Sect. 7.3.5, we found that five clusters are difficult to characterize. The graph for K-Means tells us that there is an approximately linear dependency between the number of clusters and the silhouette value. Using a simple regression function, we find that with each additional cluster, the quality of the clustering improves by 0.0494, in terms of the silhouette measure. Reducing the number of clusters from five to four, however, results in a very low silhouette measure of 0.5337. So when using K-Means, the better option is to determine three clusters.
7.4 K-Means Partitioning Clustering
723
Table 7.29 Dependency of silhouette measure and the number of clusters determined by K-Means Number of clusters 2 3 4 5 6 7 8 9 10 11 12 13
Silhouette measure of cohesion and separation 0.4395 0.5773 0.5337 0.7203 0.7278 0.8096 0.8494 0.8614 0.9116 0.9658 0.9708 1.0000
Fig. 7.111 Graph of silhouette measure vs. number of clusters determined by K-Means
Fig. 7.112 Graph of silhouette measure vs. number of clusters determined by TwoStep
4. Figure 7.112 shows results from using the TwoStep algorithm for clustering the data. The average difference in the silhouette value is 0.0525 when increasing the number of clusters by one. The linear character of the curve is clear. Here, we can
724
7
Cluster Analysis
modify the number of clusters based also on the background of the application of the clustering algorithm. As suggested in Exercise 3 of Sect. 7.3.5, we can also use three or four clusters, if we think this is more appropriate.
7.5
Auto Clustering
7.5.1
Motivation and Implementation of the Auto Cluster Node
General Motivation In the previous sections, we intensively discussed using the TwoStep and K-Means algorithms to cluster data. An advantage of TwoStep implementation in the SPSS Model is its ability to identify the optimal number of clusters to use. The user will get an idea of which number will probably best fit the data. Although K-Means is widely used, it does not provide this convenient option. Based on practical experience, we believe the decision on how many clusters to determine should be made first. Additionally, statistical measures such as the silhouette value give the user the chance to assess the goodness of fit of the model. We discussed the dependency of the number of clusters and the clustering quality in terms of silhouette value in more detail in Exercise 5 of Sect. 7.4.3. Determining different models with different cluster numbers, and assessing the distribution of the variables within the clusters, or profiling the clusters, leads eventually to an appropriate solution. In practice, realizing this process takes a lot of experience and time. Here, the idea of supporting the user by offering an Auto Cluster node seems to be a good one, especially if different clustering algorithms will be tested. We will show how to apply this node here and then summarize our findings. Implementation Details The Auto Cluster node summarizes the functionalities of the TwoStep, the K-Means node, and a node called Kohonen. The functionalities of the TwoStep as a hierarchical agglomerative algorithm are intensively discussed in Sect. 7.3. The K-Means algorithm and its implementation are also explained in Sect. 7.4. The Auto-Clustering node also uses a partitioning K-Means clustering algorithm. Here too, the user must define the number of clusters in advance. Models can be selected based on several goodness criteria,. Kohonen is the only algorithm that we have not discussed so far in detail. Table 7.6 outlines some details. In this special type of neural network, an unsupervised learning procedure is performed. So here no target variable is necessary. Input variables defined by the user build an input vector. This input vector is then presented to the input layer of a neural network. This layer is connected to a second output layer. The parameters in this output layer are then adjusted using a learning procedure, so that they learn the different patterns included in the input data. Neurons in the output layer that are unnecessary are removed from the network.
7.5 Auto Clustering
725
Fig. 7.113 Visualization of Kohonen’s SOM algorithm used for clustering purposes
After this learning procedure, new input vectors are presented to the model. The output layer tries to determine a winning neuron that represents the most similar pattern previously learned. The procedure is also depicted in Fig. 7.113. The output layer is a two-dimensional map. Here, the winning neuron is represented by its coordinates X and Y. The different combinations of the coordinates of the winning neuron are the categories or the clusters recognized by the algorithm in the input data. Interested readers are referred to Kohonen (2001) for more details. "
The Kohonen network is the implementation of an unsupervised learning algorithm, in the form of a neural network. Input vectors of an n-dimensional space are mapped onto a two-dimensional output space. This is called a “self-organizing” map.
"
The network tries to learn patterns included in the input data. Afterward, new vectors can be presented to the algorithm, and the network determines a winning neuron that represents the most similar pattern learned. The different combinations of the coordinates of the winning neuron equal the number or the name of the cluster. The number of neurons can be determined by restricting the width and the length of the output layer.
"
The Auto Cluster node offers the application of TwoStep, K-Means, and Kohonen node functionalities for use with data. TwoStep and the Kohonen implementation in the SPSS Modeler determine the “optimal” number of clusters automatically. The user must use K-Means to choose the number of clusters to determine.
"
Using the Auto Cluster node allows the user steer three algorithms at the same time. Several options allow the user to define selection criterions for the models tested and presented.
726
7.5.2
7
Cluster Analysis
Building a Model in SPSS Modeler
We would like to show the Auto Cluster node in use, by clustering a dataset representing diabetes data from a Pima Indian population near Phoenix, Arizona. A detailed description of the variables included in the dataset can be found in Sect. 12.1.10. Using several variables from this dataset, we should be able to make predictions about the individual risk of suffering from diabetes. Here, we would like to cluster the population, so we look for typical characteristics. It is not the aim of this section to go into the medical details. Rather, we would like to identify the best possible algorithm for clustering the data and identify the advantages and the disadvantages of the SPSS Modeler’s Auto Cluster node. As we extensively discussed in Sect. 7.3.2, and in the solution to Exercise 3 in Sect. 7.4.4, the TwoStep algorithm also implemented in the Auto Cluster node needs normally distributed variables. Using the Euclidian distance measure instead of the log-likelihood measure seems to be a bad choice too, as shown at the end of exercise 3 in Sect. 7.4.4. That’s because we want to create a stream for clustering the data based on the findings in Sect. 3.2.5 “Concept of ‘SuperNodes’ and Transforming Variable to Normality”. Here, we assessed the variables “glucose_concentration”, “blood_pressure”, “serum_insulin”, “BMI”, and “diabetes_pedigree” and transformed them into normal distribution. Remark The solution can be found in the stream named “cluster_diabetes_auto.str”. 1. We open the stream “transform_diabetes” and save it under another name. Figure 7.114 shows the initial stream. 2. The stream offers the option of scrolling through the records using the Table node on the right, as well as assessing the scale type and the frequency distribution in the Data Audit node, also on the right. As we can see in Fig. 7.115, the variable in the column “test result” (named “class_variable”) shows us the medical test results. The variable is binary, as depicted in Fig. 7.116. Based on this variable,
Fig. 7.114 Initial template stream “transform_diabetes”
7.5 Auto Clustering
727
Fig. 7.115 Sample records in the Table node
Fig. 7.116 Frequency distributions in the Data Audit node
we can verify the cluster quality by determining the frequency distribution of class_variable/test result in the different clusters. 3. To use the transformed variables in the Auto Cluster node later, we must add a Type node behind the SuperNode (see Fig. 7.117). In the Type node, no modifications are necessary. 4. We now have to define the variables used for clustering purposes, however. The class_variable that represents the test result as the target variable does not have to be included as an input variable in the cluster node. That’s because cluster analysis is an unsupervised learning procedure. We add an Auto Cluster node from the Modeling tab of the SPSS Modeler and connect it with the Type node (see Fig. 7.117). As we do not want to use the settings of the Type node for the role of the variable, we open the Auto Cluster node by double-clicking on it.
728
7
Cluster Analysis
Fig. 7.117 An Auto Cluster node is added to the stream
Fig. 7.118 Fields tab in the Auto Cluster node
5. In the fields tab of the Auto Cluster node, we activate the option “Use custom settings” and determine the “class_variable” as the evaluation variable. See Fig. 7.118).
7.5 Auto Clustering
729
6. By analyzing the meaning of the variables found in Sect. 12.1.10 we can state that the following variables should be helpful for determining segments of patients according to their risk of suffering from diabetes: – transformed plasma glucose concentration in an oral glucose tolerance test (“glucose_concentration_Log10”), – diastolic blood pressure (mm Hg, “blood_pressure”), – transformed 2-Hour serum insulin (mu U/ml, “serum_insulin_Log10”) – transformed body mass index (weight in kg/(height in m)\^2, “BMI_Log10”) and – transformed diabetes pedigree function (DBF, “diabetes_pedigree_Log10”). This result is based on different pre-tests. We can also imagine using the other variables in the dataset. The variable settings in the Auto Cluster node in Fig. 7.118, however, should be a good starting point. 7. We don’t have to modify an option in the model tab of the Auto Cluster node (see Fig. 7.119). It is important to note, however, that the option “Number of models to keep” determines the number that will be saved and presented to the user later in the model nugget.
Fig. 7.119 Model tab in the Auto Cluster node
730
"
7
Cluster Analysis
The option “number of models to keep” determines the number of models that will be saved. Depending on the criteria for ranking the models, only this number of models will be presented to the user.
8. The Expert tab (Fig. 7.120) gives us the chance to disable the usage of algorithms and determine stopping rules, in the case of huge datasets. We want to test all three algorithms, TwoStep, K-Means, and Kohonen, as outlined in the introduction to the section. We must pay attention to the column “Model parameters” though. The K-Means algorithm will need to define the number of clusters to identify. We click on the “default” text in the second row, which represents the K-Means algorithm (see Fig. 7.120). Then we use the “Specify” option, and a new dialog window will open (see Fig. 7.121). Here, we can determine the number of clusters. As we have patients that tested both positive and negative, we are trying to determine two clusters in the data. We define the correct number as shown in Fig. 7.122. After that, we can close all dialog windows. 9. We can start the clustering by clicking on “Run”. We then get a model nugget, as shown in the bottom right corner of Fig. 7.123. 10. We double-click on the model nugget. As we can see in Fig. 7.124, three models are offered as long as we do not exclude models with more than two clusters—
Fig. 7.120 Expert tab in the Auto Cluster node
7.5 Auto Clustering
731
Fig. 7.121 Parameters of the K-Means algorithm in the Auto Cluster node
Fig. 7.122 Number of clusters that should be determined by K-Means
which we are going to enable in Fig. 7.131 later. The K-Means algorithm determines the model with the best silhouette value of 0.373. The model has two clusters that are equal to the number of different values in the “class_variable”, which represents the test result “0 ¼ negative test result” and “1 ¼ positive test result” for diabetes. This is also true for the model determined by TwoStep. 11. We can now decide which model should be assessed. We discussed the consequences of the variable transformation toward normality for the clustering algorithms in Exercise 3 of Sect. 7.4.4. To outline here the consequences once
732
7
Cluster Analysis
Fig. 7.123 Model nugget of the Auto Cluster node
Fig. 7.124 Model results in the Auto Cluster node
more in detail, we will discuss the TwoStep model. We enable it in the first column in the dialog window shown in Fig. 7.124. As we know from discussing TwoStep and the K-Means in Sects. 7.3 and 7.4, we can assess the cluster by double-clicking on the symbol in the second column, as in Fig. 7.124. Figures 7.125 and 7.126 show details from the TwoStep model. The model quality is fair, and the averages of the different variables in the clusters are different. Using the mouse, we can see that this is also true for “diabetes_pedigree”, with 0.42 in cluster 1 on the left and 0.29 in cluster 2 on the right. The importance of the predictors is also good.
7.5 Auto Clustering
Fig. 7.125 Overview of results for TwoStep clustering in the Model Viewer
Fig. 7.126 Model results from TwoStep clustering in the Model Viewer
733
734
7
Cluster Analysis
"
In the Auto Cluster nugget node, the determined models will be listed if they meet the conditions defined in the “Discard” dialog of the Auto Cluster node.
"
There are many dependencies between the parameters regarding the models to keep, to determine, and to discard; for instance, when the determined number of models to keep is three and the number of models defined to calculate in K-Means is larger. Another example is when models with more than three clusters should be discarded, but in the K-Means node, the user defines models with four or more clusters to determine. So, if not all expected models are presented in the model nugget, the user is advised to verify the restrictions “models to keep” vs. the options in the “discard section”.
"
A model can be selected for usage in the following calculations of the stream, by activating it in the first column of the nugget node.
12. The variable “test_result”, also shown as a predictor on the right in Fig. 7.126, is definitely not a predictor. It is obviously a not correct repesentation of the settings in the Modeler and not based on inaprropriate parameters in Fig. 7.118. Here we excluded the variable “test_result”. In our model, it is the evaluation variable. If we close the model viewer and click on the first column of Fig. 7.124, we get a bar chart as shown in Fig. 7.127. Obviously, the patients that tested positive are assigned to cluster 2. So the classification is not very good, but satisfying enough. 13. Finally, we can verify the result by adding a Matrix node to the stream and applying a chi-square test of independence to the result. Figure 7.128 shows the final stream. Figure 7.129 shows the parameters of the Matrix node, using a cross tabulation of the cluster result vs. the clinical test result. The resulting chi-square test of independence in Fig. 7.130 proves our result: the cluster results are not independent from the clinical test result. 14. We found out so far that the Auto Cluster node offers a good way to apply different clustering algorithms in one step. K-Means and TwoStep algorithms seem to produce fair clustering results for the given dataset. By modifying the Auto Cluster node parameters, we want to hide models, such as the Kohonen model, which are not appropriate. We double-click on the Auto Cluster node and open the Discard tab as shown in Fig. 7.131. As we know, we want to produce a model that distinguishes two groups of patients. So we set the parameter “Number of clusters is greater than” at the value 2. 15. We run the Auto Cluster node with the new settings and get the results shown in Fig. 7.132. The Auto cluster node can help to apply the TwoStep, K-Means, and Kohonen algorithms to the same variables at the same time. Furthermore, it allows testing of
7.5 Auto Clustering
735
Fig. 7.127 Distribution of the evaluation variable “class_variable”, depending on the cluster
Fig. 7.128 Final stream with an added Matrix node
different model parameters. For instance, in Figs. 7.120, 7.121, and especially Fig. 7.122, a multiple number of clusters for testing can be defined for the K-Means algorithm. The user can determine the parameters of all these models at the same time. So using the Auto Clustering node can help to determine many models and select the best. The Auto Cluster node can also be used to produce a series of models with different cluster numbers at the same time. Then the user can compare the models
736
Fig. 7.129 Matrix node settings for producing a contingency table
Fig. 7.130 Chi-square test of independence
7
Cluster Analysis
7.5 Auto Clustering
737
Fig. 7.131 Discard tab in the Auto Cluster node
Fig. 7.132 Model results in the Auto Cluster node
and select the most appropriate one. This functionality will be demonstrated in Exercise 1. It is important to note here, however, that all the algorithms must use the same input variables. This is a disadvantage, because the TwoStep algorithm needs transformed variables to meet the assumption of normally distributed values. Using the Euclidean distance measure for TwoStep instead produces a bad model. Interested users can test this. We came to the same conclusion in Exercise 3 of Sect. 7.4.4. In summary, the user has to deal with whether he/she should avoid the assumption of normally distributed values for the TwoStep or produce a model
738
7
Cluster Analysis
based on this assumption. The consequence is that the K-Means algorithm, which uses the same variables in the Auto Cluster node, often cannot perform well. We will show in Exercise 1 that using untransformed data will lead to better results for K-Means. "
The Auto Cluster node allows access details inside the data, by determining a set of models, all with different parameters at the same time. When applying this node to data, the following aspects should be kept in mind:
Advantages – The node allows testing for the application of algorithms K-Means, TwoStep, and Kohonen, at the same time, to the same selected variables. – The user gets possible appropriate models in reference to the selected variables. – Several options can be defined for the algorithms individually. – Restrictions for the identified models, such as number of clusters and size of clusters can be determined in the discard tab of the Auto Cluster node. – The Auto Cluster node can be used to produce a series of models with different cluster numbers at once. Then the user can compare the models and select the most appropriate. Disadvantages – To meet the TwoStep algorithms assumption of normally distributed input variables, the variables must be transformed. As all algorithms must deal with the same input variables, then the K-Means node does not perform very well. So the user must often deal with the trade-off between avoiding the normality assumption for TwoStep or producing probably better results with K-Means. – Usage of the Auto Cluster node needs experience, because a lot of parameters, such as “different numbers of clusters to test”, can be defined separately for each algorithm. For experienced users, it is a good option for ranking models. Nevertheless, caution is needed to keep track of all parameters and discard unhelpful options.
7.5.3
Exercises
Exercise 1: K-Means for Diabetes Dataset In this section, we used the Auto Cluster node to find an appropriate algorithm that determines two segments of patients in the diabetes dataset. Judging from these results, the K-Means algorithm produces the best result in terms of silhouette value. In this exercise, you should assess the model. Furthermore, you should examine the dependency of the silhouette value and the number of clusters to separate.
7.5 Auto Clustering
739
1. Open the stream “cluster_diabetes_auto.str” and save it under another name. 2. Using the Auto Cluster node, you should determine the best K-Means model with two clusters and assess the result. Bear in mind the findings related to the transformed variables in Sect. 7.5.2. 3. Based on the model identified in step 2, now determine K-Means models with between 2 and 10 clusters, and compare their quality in terms of silhouette value. Exercise 2: Auto Clustering the Diet Dataset In this exercise, the functionality of the Auto cluster node should be applied to the diet dataset (see Sect. 12.1.28). We discussed TwoStep clustering in great detail in Sect. 7.3.4 in Exercise 3. The solution details for the application of TwoStep to the diet dataset can be found in Sect. 7.3.5. Using the diet types, “vegetarian”, “low meat”, “fast food”, “filling”, and “hearty”, consumers were asked “Please indicate which of the following dietary characteristics describe your preferences. How often do you eat . . .”. The respondents had the chance to rate their preferences on a scale “(very) often”, “sometimes”, and “never”. The variables are coded as follows: “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. They are ordinally scaled. 1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The Stream “Template-Stream_nutrition_habits” uses this dataset. Save the stream under another name. 2. Use an Auto Cluster node to determine useful algorithms for determining consumer segments. Assess the quality of the different cluster algorithms.
7.5.4
Solutions
Exercise 1: K-Means for Diabetes Dataset Name of the solution streams Theory discussed in section
Stream: cluster_diabetes_K_means Microsoft Excel: “cluster_diabetes_K_means.xlsx” Sect. 7.4.2
1. The name of the solution stream is “cluster_diabetes_K_means”. The parameter of the Auto Cluster included in the stream can also be used here, but first we should disable the TwoStep and the Kohonen models (Fig. 7.133). 2. In Sect. 7.5.2, we assessed the TwoStep model. The quality was fair and could be improved by using other input variables, as determined in Fig. 7.118. That’s because the K-Means algorithm doesn’t need normally distributed variables. So we substitute the transformed variables for the original variables (see Fig. 7.134). Only K-Means allows us to run the stream to determine the segmentation. As shown in Fig. 7.135, we get just one model to assess. The quality of the model based on the untransformed variables is better. The silhouette value
740
Fig. 7.133 Modified parameters in the Auto Cluster node
Fig. 7.134 Modified variable selection in the Auto Cluster node
7
Cluster Analysis
7.5 Auto Clustering
741
Fig. 7.135 K-Means model with two clusters in the model nugget
Fig. 7.136 Model summary in the K-Means node
increases from 0.373, found in Fig. 7.132, to 0.431. To start the model viewer, we double-click on the model nugget in the second column. More detailed K-Means results are shown in Fig. 7.136. The clustering quality is based on assessment of the silhouette value of 0.431 as good. The importance
742
7
Cluster Analysis
Fig. 7.137 Model details from K-Means clustering in the Model Viewer
of the variables in Fig. 7.137 changes, in comparison to the results found in Fig. 7.126. We now can investigate the model quality using the Matrix node. In Fig. 7.138, we illustrate the results for the frequency distribution of the diabetes test results per cluster. In comparison with the results of the Auto Cluster node shown in Fig. 7.130, we can state that here also cluster 1 represents the patients that tested negative. With its frequency of 212 patients, against 196 in the TwoStep model, the K-Means model is of better quality in terms of test results correctly assigned to the clusters. 3. To make the stream easier to understand, we can add a comment to the Auto Cluster node (see Fig. 7.139). Then we copy the Auto Cluster node and paste it. After that, we connect the new node to the type node. Figure 7.139 shows the actual status of the stream. We double-click on the pasted Auto Cluster node and activate the “Model” tab. It is important to define the number of models to keep. Here, we want to determine nine models. So this number must be at least 8 (see Fig. 7.140).
7.5 Auto Clustering
743
Fig. 7.138 Frequency distribution of “class_variable” per cluster
Fig. 7.139 Stream with copied Auto Cluster node
In the original stream, we only determined models with two clusters. Here, we have to remove this restriction in the Discard tab of the Auto Cluster node. Otherwise, other models will not be saved (see Fig. 7.141). To determine the models with different cluster numbers, we activate the Expert tab (see Fig. 7.142). Then we click on “Specify . . .”. As shown in Fig. 7.143, we open the dialog window in Fig. 7.144. Here, we define models with between 2 and 10 clusters to determine. This option has an advantage over the possibilities offered in the expert node of the K-Means node. Here, we can specify only one model to determine at a time. In
744
Fig. 7.140 Model dialog in the Auto Cluster node
Fig. 7.141 Modified Discard options in the Auto cluster node
7
Cluster Analysis
7.5 Auto Clustering
745
Fig. 7.142 Expert tab in the Auto Clusternode
Fig. 7.143 Defining the number of clusters to determine in the K-Means part of the Auto Cluster node
that way, the Auto Cluster node is more convenient for comparing different models with different cluster numbers. We can close all dialog windows. We do not need to add a Matrix node, because we only want to compare the different models based on their silhouette values. Figure 7.145 shows the final
746
7
Cluster Analysis
Fig. 7.144 Number of clusters to determine
Fig. 7.145 Final stream with two Auto Cluster nodes
stream, with an additional comment added. We can run the second Auto Cluster node now. Assessing the results in Fig. 7.146, we can see that the model quality decreases if we try to determine models with more than two clusters. This proves the model
7.5 Auto Clustering
747
Fig. 7.146 Auto Cluster node results for different numbers of clusters
Fig. 7.147 Silhouette value vs, number of clusters
quality determined in part 2 of the exercise. Furthermore, we can say that the algorithm indeed tries to determine segments that describe patients suffering or not suffering from diabetes. Finally, Fig. 7.147 illustrates the dependency of the clustering quality, in terms of silhouette value vs. the number of clusters determined. We explained in Exercise 5 of Sect. 7.4.4, how to get the precise silhouette value in the model viewer. The data can be found in the Excel file “cluster_diabetes_K_means.xlsx”. In the chart, we can see that quality decreases dramatically, the more clusters are determined.
748
7
Cluster Analysis
Exercise 2: Auto Clustering the Nutrition Dataset Name of the solution streams Theory discussed in section
cluster_nutrition_habits_auto Sect. 7.5 on Auto Cluster node usage Sect. 7.3.4, Exercise 3, clustering with TwoStep Sect. 7.3.5, solution to Exercise 3—characteristics of clusters determined with the TwoStep algorithm
Remark The TwoStep node is implemented in the Auto Cluster node. The log-likelihood distance measure needs normally distributed continuous variables and multinormally distributed variables. In Sect. 7.5.2, we found that using the Euclidean distance measure does not result in better models. That’s because we use untransformed data in combination with the log-likelihood measure here. 1. We open the template stream “Template-Stream_nutrition_habits” and save it under another name. The solution has the name “cluster_nutrition_habits_auto”. 2. We add an Auto Cluster node to the stream from the Modeling tab and connect it with the Type node. In the Fields tab of the node, we define all five variables available as input variables. Figure 7.148 shows this too. No other modifications in the Auto Cluster node are necessary, so we can start the clustering process. 3. Double-clicking on the model nugget shown in Fig. 7.149, we get the results. The three models presented in Fig. 7.150 have five or six clusters. So the model determined with TwoStep equals the solution that we extensively discussed in Exercise 3 of Sect. 7.3.5. An assessment of the clustering can be found there. Table 7.27 shows a summary of the cluster descriptions. Fig. 7.148 Fields tab in the Auto Cluster node
7.5 Auto Clustering
749
Fig. 7.149 An Auto Cluster node is added to the template stream
Fig. 7.150 Auto Cluster results
Of special interest here is the result produced with the Kohonen algorithm. This shows the frequency distribution of this model in the first column of Fig. 7.150, as well as the model viewer in Fig. 7.151. Normally, the Kohonen algorithm tends to produce models with too many clusters. Here, with its six clusters and a silhouette value of 0.794, it’s a good and useful model. The predictor importance on the right of Fig. 7.151, and characteristics of the different clusters shown on the left in Fig. 7.151, can also be used for consumer segmentation purposes. Based on this short assessment, we have found an alternative model to the TwoStep clustering presented in the solution to Exercise 3 of Sect. 7.3.5.
750
7
Cluster Analysis
Fig. 7.151 Results of Kohonen clustering in the model viewer
7.6
Summary
Cluster algorithms represent unsupervised learning procedures for identifying segments in data with objects that have similar characteristics. In this section, we explained in detail how the similarity or dissimilarity of two objects could be determined. We called the measures used “proximity measures”. Based on a relatively simple example of car prices, we saw how clustering works in general. The TwoStep algorithm and the K-Means algorithm were then used to identify clusters in different datasets. TwoStep represents the hierarchical agglomerative, and K-Means represents the partitioning clustering algorithms. Using both procedures, and the Kohonen procedure, we introduced usage of the Auto Clustering node in the modeler. Summarizing all our findings, we can state that clustering data often entails more than one attempt at finding an appropriate model. Not least, the number of clusters must be determined based on practical knowledge. "
The Modeler offers K-Means, TwoStep, and the Kohonen nodes for identifying clusters in data. K-Means is widely used and portions the datasets, by assigning the records to identified cluster centers. The user has to define the number of cluster in advance.
"
This is also necessary using the Kohonen node. This node is an implementation of a two-layer neural network. The Kohonen algorithm transforms a multidimensional input vector into a two-dimensional space. Vectors presented to the network are assigned to a pattern first recognized in the data during a learning period.
References
"
TwoStep implementation in the SPSS modeler is the most easy-to-use clustering algorithm. It can deal with all scale types and standardizes the input values. Moreover, the TwoStep node identifies the most appropriate number of clusters that will represent the data structure best. This recommended number is initially automatically generated and can be used as a good starting point to find out the optimal clustering solution. The user must bear in mind however, that the log-likelihood distance measure assumes normally distributed variables, but transforming variables often leads to unsatisfying results.
"
The Auto Cluster node can be used with the TwoStep, K-Means, and Kohonen algorithm all steering in the background. This gives the user a chance to find the best algorithm in reference to the structure of the given data. This node is not as easy to use as it seems, however. Often not all appropriate models are identified, so we recommend using the separate TwoStep, K-Means, and Kohonen model nodes instead.
751
References Bacher, J., Wenzig, K., & Vogler, M. (2004). SPSS TwoStep Cluster—A first evaluation. Accessed May 07, 2015, from http://www.statisticalinnovations.com/products/twostep.pdf Backhaus, K. (2011). Multivariate Analysemethoden: Eine anwendungsorientierte Einführung (13th ed.). Berlin: Springer. Bühl, A. (2012). SPSS 20: Einführung in die moderne Datenanalyse, Scientific tools (13th ed.). München: Pearson. Ding, C., & He, X. (2004). K-means Clustering via Principal Component Analysis. Accessed May 18, 2015, from http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf Handl, A. (2010). Multivariate Analysemethoden: Theorie und Praxis multivariater Verfahren unter besonderer Berücksichtigung von S-PLUS, Statistik und ihre Anwendungen (2nd ed.). Heidelberg: Springer. IBM. (2015a). SPSS Modeler 17 Algorithms Guide. Accessed September 18, 2015, from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/AlgorithmsGuide. pdf IBM. (2015b). SPSS Modeler 17 Modeling Nodes. Accessed September 18, 2015, from ftp://public. dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ ModelerModelingNodes.pdf IBM. (2015c). SPSS Modeler 17 Source, Process, and Output Nodes. Accessed March 19, 2015, from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ ModelerSPOnodes.pdf IBM Website. (2014). Customer segmentation analytics with IBM SPSS. Accessed May 08, 2015, from http://www.ibm.com/developerworks/library/ba-spss-pds-db2luw/index.html Kohonen, T. (2001). Self-organizing maps (Springer Series in Information Sciences) (Vol. 30, 3rd ed.). Berlin: Springer. Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. London: Academic Press. Murty, M. N., & Devi, V. S. (2011). Pattern recognition: An algorithmic approach, undergraduate topics in computer science. London: Springer.
752
7
Cluster Analysis
Struyf, A., Hubert, M., & Rousseeuw, P. J. (1997). Integrating robust clustering techniques in S-PLUS. Tavana, M. (2013). Management theories and strategic practices for decision making. Hershey, PA: Information Science Reference. Timm, N. H. (2002). Applied multivariate analysis, Springer texts in statistics. New York: Springer. Vogt, W. P., Vogt, E. R., Gardner, D. C., & Haeffele, L. M. (2014). Selecting the right analyses for your data: Quantitative, qualitative, and mixed methods. New York: The Guilford Press.
8
Classification Models
In Chap. 5, we dealt with regression models, the first group of so-called supervised models, and applied them to datasets with continuous, numeric target variables, the common data for these kinds of models. We then dedicated Chap. 7 to unsupervised data and described various cluster algorithms. In this chapter, we return to the supervised methods and attend to the third big group of data mining methods, the classification algorithms. Here, we are confronted with the problem of assigning a category to each input variable vector. As the name suggests, the target variable is categorical, e.g., “Patient is ill” or “Patient is healthy”. As in this example, the target variable’s possible values of a classification problem can’t often be ordered. These kinds of problems are very common in all kinds of areas and fields, such as Biology, Social Science, Economics, Medicine, and Computer Science; almost any problem you can think of that involves deciding between different kinds of discrete outcome. Figure 8.1 shows the outline of the chapter structure. In the first section, we describe some real-life classification problems in more detail, where data mining and classification methods are useful. After these motivating examples, we explain in brief the concept of classification in data mining, using the basic mathematical theory that all classification algorithms have in common. Then in the remaining sections, we focus on the most famous classification methods, which are provided by the SPSS Modeler, individually, and describe their usage on data examples. A detailed list of the classification algorithms discussed here is displayed in the subsequent Fig. 8.6. Besides the algorithms presented in this book, further classification models are included in the SPSS Modeler. However, a description of these models and their corresponding nodes in the Modeler is omitted, as this would exceed the scope of this book. See Sect. 8.2.2 for more details on these left-out methods and references to the literature. After finishing this chapter, the reader . . . 1. is familiar with the most common challenges when dealing with a classification problem and knows how to handle them. # Springer Nature Switzerland AG 2021 T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler, https://doi.org/10.1007/978-3-030-54338-9_8
753
754
8 Classification Models
Fig. 8.1 Outline of the chapter structure
2. possesses a large toolbox of different classification methods and knows their advantages and disadvantages. 3. is able to build various classification models with the SPSS Modeler and is able to apply them to new data for prediction. 4. knows various validation methods and criteria and can evaluate the quality of the trained classification models within the SPSS Modeler stream.
8.1
Motivating Examples
Classification methods are needed in a variety of real world applications and fields, and in many cases, they are already utilized. In this chapter, we present some of these applications as motivating examples. Example 1 Diagnosis of Breast Cancer To diagnose breast cancer, breast mass is extracted from numerous patients by fine needle aspiration. Each sample is then digitalized into an image from which different features can be extracted. Among others, some include the size of the cell core or the number of mutated cells. The feature records of the patients, together with the target variable (cancer or no cancer), are then used to build a classification model. In other words, the model learns the configuration and structure of the features for tumorous and non-tumorous tissue samples. Now, for a new patient, doctors can easily decide whether or not breast cancer is present, by establishing the abovementioned features and putting these into the trained classifier. This is a classical application of a classification algorithm, and logistic regression is a standard method used for this problem. See Sect. 8.3 for a description of logistic regression. An example dataset is the Wisconsin Breast Cancer Data (see Sect. 12.1.40). Details on the data, as well as the research study, can be found in Wolberg and Mangasarian (1990). Example 2 Credit Scoring of Bank Customers When applying for a loan at a bank, your credit worthiness is calculated based on some personal and financial characteristics, such as age, family status, income, number of credit cards, or amount of existing debt. These variables are used to estimate a personal credit score, which indicates the risk to the bank when giving you a loan.
8.1 Motivating Examples
755
An example of credit scoring data is the “tree_credit” dataset. For details, see Sect. 12.1.38. Example 3 Mathematical Component of a Sleep Detector (Prediction of Sleepiness in EEG Signals) Drowsiness brings with it a lesser ability to concentrate, which can be dangerous and should be avoided in some everyday situations. For example, when driving a car, severe drowsiness is the precursor of microsleep, which can be life-threatening. Moreover, we can think of jobs where high concentration is essential, and a lack of concentration can lead to catastrophes. For example, the work of plane pilots, surgeons, or technique observers at nuclear reactors. This is one reason why scientists are interested in detecting different states in the brain, to understand its functionality. For the purpose of sleep detection, EEG signals are recorded in different sleep states, drowsiness, and full consciousness, and these signals are analyzed to identify patterns that indicate when drowsiness may occur. The EEG_Sleep_Signals.csv (see Sect. 12.1.12) is a good example, which is analyzed with a SVM in Sect. 8.5.2. Example 4 Handwritten Digits and Letter Recognition When we send a letter through the mail, this letter is gathered together with a pile of other letters. The challenge for the post office is to sort this huge mass of letters by their destination (country, city, zip-code, street). In former days, this was done by hand, but nowadays computers do this automatically. These machines scan the address on each letter and allocate it to the right destination. The main challenge is that handwriting differs noticeably between different people. Today’s sorting machines use an algorithm that is able to recognize alphabetical characters and numbers from individual people, even if it has never seen the handwriting before. This is a fine example of a machine learning model, trained on a small subset and able to generalize to the entire collective. Other examples where automatic letter and digit identification can be relevant are signature verification systems or bank-check processing. The problem of handwritten character recognition falls into the more general area of pattern recognition. This is a huge research area with many applications, e.g., automatic identification of the correct image of a product in an online shop or analysis of satellite images. The optical recognition of handwritten digits data obtained from the UCI Machine Learning Repository Machine Learning Repository (1998) is a good example. See also Sect. 12.1.29. Other Examples and Areas Where Classification Is Used • Sports betting. Prediction of the outcome of a sports match, based on the results of the past. • Determining churn probabilities. For example, a telecommunication company wants to know if a customer has a high risk of switching to a rival company. In this case, they could try to avoid this with an individual offer to keep the customer.
756
8 Classification Models
• In the marketing area: to decide if a customer in your database has a high potential of responding to an e-mailing marketing campaign. Only customers with a high response probability are contacted.
8.2
General Theory of Classification Models
As described in the introduction to this chapter, classification models are dedicated to categorizing samples into exactly one category. More precisely, let y be the observation or target variable that can take one of a finite number of values A, B, C, . . . . Recalling the motivating examples in the previous section, these values can, for example, describe a medical diagnosis, check if an e-mail is spam, or discern the possible meaning of a handwritten letter. As these examples show, the observation values do not have to be in any kind of order or even numeric. Based on some input variables, xi1, . . ., xip, of a sample i, a classification model now tries to determine the value of the observation yi, and thus, categorize the data record. For example, if some breast sample tissue has unusual cell sizes, this could likely be cancer. Or, if the subject of an e-mail contains some words that are common in spam, this e-mail is probably a spam mail too. In this chapter, we give a brief introduction on the general theory of classification modeling.
8.2.1
Process of Training and Using a Classification Model
The procedure for building a classification model follows the same concept as regression modeling (see Chap. 5). As described in the regression chapter, the original dataset is split into two independent subsets, the training and the test set. A typical partition is 70% training data and 30% test data. The training data is used to build the classifier, which is then applied to the test set for performance evaluation on unknown data. Using this separate test dataset is the most common way to verify the universality of the classifier and its prediction ability in practice. This process is called cross-validation (see Sect. 5.1.2 or James et al. (2013)). We thus recommend always using this training and test set framework when building a classifier. Often, some model parameters have to be predefined. These, however, are not always naturally given and have to be chosen by the data miner. To find the optimal parameter, a third independent dataset is used, the validation set. A typical partition of the original data is then 60% training, 20% validation, and 20% test set. After training several classifiers with different parameters, the validation set is used to evaluate these models in order to find the one with the best fit (see Sect. 8.2.5 for evaluation measures). Afterwards, the winner of the previous validation is applied to the test set, to measure its predicting performance on new data. This last step is necessary to eliminate biases that might have occurred in the training and validation steps. For example, it is possible that a particular model performs very well on the
8.2 General Theory of Classification Models
757
Fig. 8.2 Overview of the steps when building a classification model
validation set, but is not very good on other datasets. In Fig. 8.2, the process of crossvalidation and building a prediction model is illustrated. We further like to mention that, in the area of big data, where datasets contain millions of samples, the size of the test and validation set can be chosen much smaller. In this situation, a size of 10% or lower for the validation and test set is not unusual, as these partitions still contain enough data to evaluate the model. Even a size of 1% is possible if sufficient data are available. However, the test and validation sets have to be chosen large enough so that the model is properly validated. The general idea behind a classification model is pretty simple. When training a classification model, the classification algorithm inspects the training data and tries to find a common structure in data records with the same target value and differences between data records of different target values. In the simplest case, the algorithm converts these findings into a set of rules, such that the target classes are characterized in the best possible way through these “if . . . then . . .” statements. In Fig. 8.3, this is demonstrated with a simple data example. There, we want to predict if we can play tennis this afternoon based on some weather data. The classification algorithm transforms them into a set of rules that define the classifier. For example, if the outlook is “Sunny” and the temperature is greater than 15 C, we will play tennis after work. If otherwise the outlook is “Rain” and the wind prognosis is “strong”, we are not going to play tennis. Applying a classifier to new and unseen data now simply becomes the assignment of a class (target value) to each data record using the set of rules. For example, in Fig. 8.4, the classifier that suggests whether we can play tennis is applied to new data. For day 10, where the Outlook ¼ “Sunny”, Temperature ¼ 18 C, and
758
8 Classification Models
Fig. 8.3 Training of a classification model
Fig. 8.4 Prediction with a classification model
Wind ¼ “strong”, the classifier predicts that we can play tennis in the afternoon, since these variable values fulfill the first of its rules (see Fig. 8.3). Many classification models preprocess the input data via a numeric function, which is used to transform and quantify each data record. Each data record is then assigned a so-called score, which can be interpreted as the probabilities of each class, or some distances. When training the model, this scoring function and its parameters are determined, and a set of decision rules for the function’s value is generated. The class of unseen data is now predicted by calculating the score with the scoring function and assigning the target class suggested by the score. See Fig. 8.5 for an illustration of a classification model with an internal scoring function.
8.2.2
Classification Algorithms
A variety of classification algorithms and concepts exist, with different approaches and techniques to handle the data and find classification rules. These methods can be roughly grouped into three types, linear, nonlinear, and rule-based algorithms. The models of the first two types follow a mainly mathematical approach and generate
8.2 General Theory of Classification Models
759
Fig. 8.5 Classification model with scoring function
Fig. 8.6 Overview of the classification algorithms presented in this book
functions for scoring and separation of the data and classes. While linear methods try to separate the different classes with linear functions, nonlinear classifiers can construct more complex scoring and separating functions. In contrast to these mathematical approaches, the rule-based models search the input data for structures and commonalities without transforming the data. These models generate “if . . . then . . .” clauses on the raw data itself. Figure 8.6 lists the most important and favored classification algorithms that are also implemented in the SPSS Modeler and discussed in this book. Chapters where the particular models are discussed are shown in brackets. In addition to the models presented in this book, the SPSS Modeler provides further classification models and nodes, e.g., Random Tree and Bayesian Network. However, a description of these models would exceed the scope of this book and is therefore omitted, and we refer the interested reader to Tuffery (2011), Han (2012), and IBM (2019b) resp. IBM (2019a) for information on the nodes in the SPSS Modeler and algorithm details. Among these algorithms, the random forest is the most important one with very powerful prediction abilities. It is highly popular in the data mining community, and thus often used for classification problems. A
760
8 Classification Models
description of the method and how a random forest model can be trained with the SPSS Modeler is provided through the corresponding website of this book “http:// www.statistical-analytics.net”. See Chap. 1 for details to access the content on the website related to this book. The SPSS Modeler further comprises nodes for building boosting models, e.g., XGBoost Tree. Ensemble methods, and in particular boosting, are a common concept in classification (see Sect. 8.8.1). Since many of the regular nodes, which are discussed in this book, itself provide options to build ensemble and boosting models, we skip a description of the nodes explicitly dedicated to boosting and refer to IBM (2019b) for details on these nodes, and Zhou (2012) for the boosting concept. Moreover, some classification algorithms are implemented in the SPSS Modeler 18.2 to run on the SPSS Analytics Server (see IBM (2019d)) and provided by the Modeler in separate nodes, like the Tree-AS node. Since these nodes are more or less an adaptation of the nodes described in this chapter, we will ignore them here and point to IBM (2019b) for details. Each classification model has its advantages and disadvantages. There is no method that outperforms every other model for every kind of data and classification problem. The right choice of classifier strongly depends on the data type and size. For example, some classifiers are more applicable to small datasets, while others are more accurate on large data. Other considerations when choosing a classifier are the properties of the model, such as accuracy, robustness, speed, and interpretability of the results. A very accurate model is never bad, but sometimes a robust model, which can be easily updated and is insensitive to strange and unusual data, is more important than accuracy. In Table 8.1, some advantages and disadvantages of the classification algorithms are listed to give the reader guidelines for the selection of an appropriate method. See also Tuffery (2011). However, we strongly recommend to try different approaches and methods on each dataset in order to see what will work best for the particular problem and data. The SPSS Modeler provides an easy way for this task of building and comparing different models on one dataset, the Auto Classifier node, which will be explained in detail in Sect. 8.9.2.
8.2.3
Classification Versus Clustering
Recalling the discussion in Sect. 7.2, a clustering algorithm is an unsupervised learning method. This means that the data points are not labeled, and hence, there is no target variable present. Clustering is therefore used to group the data points, based on their similarities and dissimilarities. The purpose is to find new patterns and structures within the data. Recall the motivating example of Customer segmentation in Sect. 7.1; customer segmentation is often used by companies to identify high potential and low potential customers. For example, banks cluster their customers into groups to identify the customers with high or low risk of loan default. As described in the previous sections, classification has a different purpose. When using a classifier, the data points are labeled. In the previous example of loan default risk, the bank already has a dataset of customers, each of them labeled “risky” or
8.2 General Theory of Classification Models
761
Table 8.1 Overview of advantages and disadvantages of the classification algorithms Logistic regression
Linear discriminant analysis
+ • Models are often very accurate • Works well on small datasets • Predicts probabilities • Easy to interpret, in particular, the influence of each input variable • Typically, very fast building the model • Works well on small datasets • Optimal if data assumptions are fulfilled
Support vector machine
• It can easily handle complex (nonlinear) data and a large feature space • Typically a high accuracy • Less susceptible to overfitting than other methods
Neural network
• Good performance on large datasets • Very good at allowing nonlinear relations and can generate very complex decision boundaries • Nonparametric. No distribution assumptions needed • Can handle noisy data • Often outperforms other methods
K-Nearest neighbor
• Very fast training • Complex concepts can be learned through simple procedures • Intuitive algorithm • Only few parameters to tune
Decision trees
• Robust to outliers • Model and decision rules are easy to understand • Can handle different data types, i.e., numerical and categorical variables
• It can only provide linear solutions • Problems with high collinearity of the input variables • More restrictive assumptions than other methods (e.g., logistic regression) • Usually, needs data preparation • Sensitive to outliers • Only applicable to linear problems • Long training time for large datasets • Model is hard to interpret (black box algorithm) • Sensitive with regard to the choice of kernel function and parameter • Choosing the wrong kernel and parameters can risk overfitting • Sensitive to outliers • Training can be computationally expensive • Results and effects of input variables are hard to interpret (black box algorithm) • Tends to overfitting • Does not always find the optimal solution • Can only process continuous input variables • Requires large storage space • Performance depends on the number of dimensions (curse of dimensionality) • Problem with highly imbalanced data • No interpretation of results possible • Slow in prediction for large datasets • Sensitive to noisy data. Typically, needs scaling and feature selection preprocessing • Can be computationally expensive to train • Large trees tend to overfitting • Most of the time it does not find the optimal solution (continued)
762
8 Classification Models
Table 8.1 (continued)
Random Forest
+
• Fast in prediction • No assumptions on variable distributions needed. Thus, less effort for data preparation. • Can handle missing values • It has the same advantages as a decision tree since it comprises multiple trees • Typically results in a better model than one decision tree • Good performance on small datasets • Works great with high dimensionality (many input variables) • Fast training and prediction
• Prefers variables with many categories or numerical data
• Overfitting is more likely than with other methods • The model is hard to interpret (black box algorithm) • Can take up a lot of memory for large datasets
Fig. 8.7 Clustering vs. classification
“non-risky”, based on the customers’ ability to repay his/her loan. Based on this training data of bank customers, a classification model learns to separate the “risky” from the “non-risky” customers. Afterwards, new customers can be categorized into risky or non-risky, so the bank has a better basis on which to decide to give a customer a loan. Due to this labeled training data, classification models belong to the supervised learning algorithms. The model learns from some labeled training data and then predicts the outcome of the target variable based on this training data. In Fig. 8.7, the difference between classification and clustering is visualized.
8.2 General Theory of Classification Models
8.2.4
763
Decision Boundary and the Problem with Overand Underfitting
The task (mathematically speaking) of a classification model is basically the following: Find a function describing a line (or curve) that separates the different classes from each other in the best possible way. This separating function is called decision boundary. The classifier now simply assigns a new data point to a class by looking at which side of the decision boundary it is located. For example, in the right graph in Fig. 8.7 the black circles are separated from the white circles by a straight line. A new data point is now classified as a black circle if it lies underneath the line, and otherwise, it is labeled as a white circle. This is an easy example of a linear classifier, meaning the decision boundary is a straight line, in a binary classification problem. However, the decision boundary can be nonlinear and highly complex, separating multiple classes from each other (see Fig. 8.8). With the discussion of the decision boundary in mind, a perfect classifier would be the one separating the classes and making no misclassification, i.e., all data points lie on the correct side of the decision boundary. Unfortunately, a perfect classifier with no misclassification is nearly impossible. Now, one can justifiably interject that nearly every dataset of different labeled points can be separated perfectly using a highly complex function. This is undoubtedly true. However, a complex decision boundary might perfectly classify the existing data, but if a new and unknown data point has to be classified, it can easily lie on the wrong side of the decision boundary, and thus, be misclassified. The perfect classifier no longer exists. The problem here is that, instead of describing the underlying structure of the data, the classifier models the noise and thus suffers from overfitting. The model is inappropriate to apply to new data, even though it perfectly classified the training data. Thus, we have to reduce our expectations and condone little misclassifications in favor of a simple decision boundary and more universal model, see Fig. 8.8 for an example. See Kuhn and Johnson (2013) for more information on the decision boundary and overfitting, as well as Sect. 5.1.2. These considerations of overfitting are illustrated in Fig. 8.9. There, the decision boundary in the graph on the left separates the squares from the circles perfectly. In the middle graph, a new data point is added, the filled square, but it lies on the circle side of the boundary. Thence, the decision boundary is no longer optimal. The model Fig. 8.8 Decision boundary of a three class classifier
764
8 Classification Models
Fig. 8.9 Illustration of the problem of overfitting, when choosing a decision boundary
has not identified the circle in the middle as noise, and thus suffers from overfitting. So, the linear boundary in the graph on the right is as good as the boundary in the middle graph, but describes the underlying structure of the data in a more suitable and, moreover, simpler way. The common way to detect overfitting is cross-validation. If the model performs much better on the training set than on the test set, overfitting might be the problem (see Sect. 8.2.5 for a list of evaluation measures). So, in order to avoid overfitting, we recommend always using a training/testing setting and potentially using a separate validation set too. If overfitting occurs, there are some techniques that can be utilized to approach this problem: • Reduce the number of input variables. This leads to a simpler decision boundary, and thus, a more general model. • Reduce the number of iterations during training. If the algorithm is using an iterative approach during training, it improves the fit of the model to the training data in each iteration step. Stopping this process in an early iteration run can reduce the complexity of the resulting model and thus prevent overfitting. • Use ensemble methods. Ensemble methods combine multiple separate models and approaches. Hence, the prediction and the fit do not depend on only one model. The most famous ensemble techniques are Boosting and Bagging (see Sect. 5.3.6). The opposite of a too complex model, which can result in overfitting, is a too simple model that does not capture the underlying data structure sufficiently. This is called underfitting, which is illustrated in Fig. 8.10. There, in the left graph, the squares are separated from the circles by a straight line. However, the squares are more concentrated in a cloudy formation and a circle-shaped decision boundary is more appropriate here, as it better describes the underlying structure of the data (see right graph). An indicator for models that suffer from underfitting is a bad performance on the training set. In this case, the model is probably too simple, and a switch to a more complex algorithm, e.g., nonlinear instead of a linear approach, can improve the prediction performance and overcome underfitting.
8.2 General Theory of Classification Models
765
Fig. 8.10 Illustration of the problem of underfitting Table 8.2 A confusion matrix gives a good overview of the possible classification results
Actual category
8.2.5
Yes No
Predicted category Yes True positive (TP) False positive (FP)
No False negative (FN) True negative (TN)
Performance Measures of Classification Models
Various measures and methods exist for the evaluation of a classification model. Here, we discuss the most common ones implemented in the SPSS Modeler. Confusion Matrix In the case of a binary target variable, i.e., a “yes” or “no” decision, four possible events can occur when assigning a category to the target variable via classification. • True positive (TP). The true value is “yes” and the classifier predicts “yes”. patient has cancer and cancer is diagnosed. • True negative (TN). The true value is “no” and the classifier predicts “no”. patient is healthy and no cancer is diagnosed. • False positive (FP). The true value is “no” and the classifier predicts “yes”. patient is healthy, but cancer is diagnosed. • False negative (FN). The true value is “yes” and the classifier predicts “no”. patient has cancer, but no cancer is diagnosed.
A A A A
These four possible classification results are typically structured in a confusion matrix, which displays the number of TP, TN, FP, and FN for the training resp. test set (see Table 8.2). When predicting the target variable, two errors can be made: FP and FN prediction. These errors are also often referred to as type I and type II error, in relation to classical hypotheses tests and their nomenclature, where these two errors play an
766
8 Classification Models
important role for the design of a statistical test. We refer the interested reader to Dekking et al. (2005) for further information. Accuracy and Metrics Derived from the Confusion Matrix A model is said to be a good classifier if it predicts the true value of the outcome variable with high accuracy, that is, if the proportion of TPs and TNs is high, and the number of FPs and FNs is very low. So, the obvious and most common measure for evaluating a classifier is the ratio of correct identified data points, the accuracy ACC ¼
TP þ TN TP þ TN ¼ , PþN TP þ FN þ TN þ FP
with P ¼ TP + FN, the number of actual positive (“yes” or 1) values, and N ¼ TN + FP, the number of actual negative (“no” or 0) values. Its complement is the rate of misclassified instances, called classification error, in formula 1 ACC, which should be small for a well-fitted model. Further measures indicating the quality of the model and its ability to classify the individual classes correctly are the true positive rate (TPR), false negative rate (FNR), true negative rate (TNR), and false positive rate (FPR) TPR ¼
TP TP ¼ , P TP þ FN
FNR ¼
FN FN ¼ P TP þ FN
TNR ¼
TN TN ¼ , N TN þ FP
FPR ¼
FP FP ¼ : N TN þ FP
and
The TPR, also called recall, hit rate, or sensitivity, is defined as the ratio of positive values (or 1s) that are correctly classified by the model from all actual 1s. In other words, the TPR describes the probability that a positive value data record is classified as such. So, a model with a high TPR, i.e., near 1, has a high chance in predicting a positive value record correctly. On the other hand, if the TPR is low, i.e., near 0, then the model is inadequate in finding the class 1 data, and further parameter tuning or data preparation need to be considered in order to improve the TPR of the model. The TNR, also called specificity, is the same metric as the TPR, but for the negative value class (or 0s). More precisely, it is defined as the ratio of negative values (or 0s) that are correctly classified by the model from all actual 0s, and so, describes the probability that a data record labeled as 0 is categorized as 0 by the model. We like to mention that a single one of the four measures is unable to describe the model entirely, and thus, multiple rates should be included in the evaluation process. A high TPR, for instance, doesn’t say anything about the capability of the model detecting the negative value, i.e., 0, class. Let us consider the following example: The goal of a model is to classify a tissue sample as tumorous or healthy. But,
8.2 General Theory of Classification Models
767
because of our fear missing a cancer disease, we declare every sample as tumorous, including the healthy ones. This procedure leads to a TPR of exactly 1 (all actual positive values are correctly classified), but simultaneously, it generates a lot of FPs or false alarms, i.e., all actual negative valued samples are classified as tumorous. This results in a TNR of 0 resp. FPR of 1. The model is unable to detect any healthy patients and separate the two classes properly. This is an example of the “cry wolf” expression, which refers to a shepherd that cries out “Wolf!” just to entertain himself, even though no wolf is in sight. But, on the day a real wolf is approaching, no one believes him, because of his many false alarms in the past. The above example describes a poor model, as it completely ignores one class (TPR ¼ 1 and FPR ¼ 1), whereas a good classifier, on the other hand, has a high TPR as well as a low FPR. In practice, however, increasing the TPR often goes along with increasing the FPR as well and vice versa. So, one typically has to settle for a trade-off between the hit rate and false alarm. As an example, we refer to Sect. 8.2.8 where the relationship between the TPR and FPR is discussed in more detail. ROC and AUC One of the most popular methods and performance measures that takes the TPR and FPR into account simultaneously, and thus, can quantify the trade-off between hit rate and false alarm, is the Receiver Operating Characteristic curve (ROC curve), and derived from that, the Area Under this Curve (AUC); the latter, one of the most important performance measures for classifiers. The ROC curve visualizes the false positive rate (FPR) against the true positive rate (TPR). More precisely, consider a binary classifier (0 and 1 as the target categories) with a score function that calculates the probability of the data point belonging to class 1. By default, a model classifies a data point to class 1, if the score is higher than 0.5, and to class 0 if the score is lower 0.5. This is the standard prediction which is returned by the model. However, for each threshold t between 0 and 1, such a decision function can be constructed that assigns a data point to class 1 if and only if the predicted probability is greater or equal than the threshold, i.e., score t, then class ¼ 1 score < t, then class ¼ 0: For each of these thresholds t, the target variables are predicted via the above decision formula and the confusion matrix is calculated, from which the TPR(t) and the FPR(t), which depend on t, can be derived. Visualizing these values TPR(t) and FPR(t), 0 t 1, in a graph, they form a curve, the above-mentioned ROC curve. See Fig. 8.14 as an example of a typical ROC curve. Of course, this procedure works analog with every other form of score. To get a better understanding, let us take a quick look at a toy example. For a more detailed explanation of how the ROC curve is created, we refer to Sect. 8.2.8. Consider the following probability predictions (Fig. 8.11):
768
8 Classification Models
Fig. 8.11 Example predictions for ROC
Fig. 8.12 TPR and FPR for different thresholds
Fig. 8.13 Example predictions to calculate the ROC curve for the threshold 0.5
To calculate the ROC curve, the FPR and TPR are determined for the thresholds 1, 0.8, 0.6, 0.5, 0.2, 0. We only have to consider the thresholds that correspond to the predicted probabilities since these are the only values where the confusion matrix and its entries change. The results are displayed in Fig. 8.12, including the confusion matrix that contains the TP, FP, FN, and TN values in the different cases. As an example, let us take a look at t ¼ 0.5. There are three data points with a probability for class 1 at least 0.5; hence, they are assumed to belong to class 1. Two of them are actually of class 1, while one is misclassified (the one with probability 0.6) (see Fig. 8.13). The last data point with probability 0.2 is assigned to class 0 since its predicted probability is smaller than the threshold. With these predicted classes, compared with the true classes, we can construct the confusion matrix and easily calculate the TPR and FPR. See the third column in Fig. 8.12. The optimal classifier has TPR ¼ 1 and FPR ¼ 0, which means that a good classification model should have an ROC curve that goes from the origin in a relatively straight line to the top left corner, and then to point (1,1) on the top right. Figure 8.14 shows a ROC curve of an optimal classifier (right graph) and a typical ROC curve (left graph). The diagonal symbolizes the random model, and if
8.2 General Theory of Classification Models
769
Fig. 8.14 A typical ROC curve is illustrated in the first graph and an optimal ROC curve in the second graph
the curve coincides with it, it doesn’t perform any better than guessing. In cases where the ROC curve is under the diagonal, it is most likely that the model classifies the inverse, i.e., 0 is classified as 1 and vice versa. From the ROC curve, one of the most common and used goodness of fit measures for classification models can be derived: the Area Under the Curve (AUC). It is simply the area under the ROC curve and can take values between 0 and 1. With the AUC, two models can be compared, and a higher value indicates a more precise classifier and thus a more suitable model for the data. For more detailed information on the ROC curve and AUC, we recommend Kuhn and Johnson (2013) and Tuffery (2011), as well as the explanations in Sect. 8.2.8. In the SPSS Modeler, the ROC curve can be drawn with the Evaluation node, which is explained in Sect. 8.2.7. Gini Index Another important and popular goodness of fit measure for a classification model is the Gini index, which is just a transformation of the AUC, i.e., Gini ¼ 2 AUC 1: As with the AUC, a higher Gini index indicates a better classification model. The shaded area in the first graph in Fig. 8.14 visualizes the Gini index and shows the relationship with the AUC.
8.2.6
The Analysis Node
The confusion matrix and the performance measures presented in the previous section are implemented in the SPSS Modeler and can be displayed via the Analysis node, along with additional statistics. For this purpose, the Analysis node has to be
770
8 Classification Models
Fig. 8.15 Options in the Analysis node for a classification model
added to the model nugget (the part in the stream that contains the trained model, see, e.g., Chap. 5 or Sect. 8.3.2) and then executed. In Fig. 8.15 the options of the Analysis node are shown. There, the three most important statistics and measures, that are available for a classification model, are marked: Coincidence matrices, Evolution metric, and Confidence figures. The output of the Analysis node for each of these metrics is described in Table 8.3. In addition to the mentioned statistics, a Performance Evaluation can be performed by the Analysis Node, which calculates Kononenko’s and Bratko’s information-based evolution criterion. We will not further address this criterion in this book since it is used rarely in practice. We refer the interested reader to Kononenko and Bratko (1991). The output of the Analysis node for a binary target variable can be viewed in Fig. 8.16, which shows the statistics for a classifier on the Wisconsin breast cancer data (see stream in Sect. 8.3.6) for the training and the test set separately. This is due to the option “Separate by partition”, which is enabled by default. The accuracy is
8.2 General Theory of Classification Models
771
Table 8.3 List of the available statistics for a classification model in the Analysis node Coincidence matrices
Evaluation metric (AUC & Gini)
Confidence figures
Display of the confusion matrix that shows the predicted target values against the true target categories for each data record. This helps to identify systematic prediction errors. See Sect. 8.2.5 for more details on the confusion matrix. Display of the AUC and Gini index. This option can only be chosen for a binary classifier. See Sect. 8.2.5 for more details on the measures. To calculate the values of these measures, the target variable has to be set to flag, as a measurement in the type node. See Fig. 8.17. When enabled, some confidence values and statistics describing the distribution of the predicted probabilities are calculated. This option is only available if the model generates a class probability or score as prediction, e.g., logistic regression. For a detailed description of these statistics, see IBM (2019c).
shown at the top of the output, followed by the coincidence resp. confusion matrix and a variety of confidence statistics, which describe the distribution of the predicted scores resp. probabilities of the model. The Gini index and AUC are displayed at the bottom of the Analysis node output. All statistics are individually calculated for all data partitions, i.e., the training and test set, respectively. The output in the multiple classification case is basically the same, except for the missing AUC and Gini values, which are only implemented for binary target variables. For an example of such a multiple classifier output, see Fig. 8.118.
8.2.7
The Evaluation Node
With the Evolution node, the SPSS Modeler provides an easy way to visualize the ROC curve, along with other evaluation curves. These evaluation charts show the prediction performance of models and are commonly used to compare different models with each other in order to find the best fitted one. To plot such an evaluation chart, one has to connect the Evolution node to the model nugget and afterwards run it. In Fig. 8.18, the options of the Evaluation node in the Plot tab are shown. In particular, the figure shows the setting for a ROC curve; hence, the Chart type is set to Receiver Operating Characteristic (ROC). The Evaluation node also provides other graphs that can be selected as Chart type: Gains, Response, Lift, Profit, Return on Investment (ROI). The description of these graphs is postponed until the end of this section. In the following, we list the most important options available in the Evaluation node to configure the graphs. We refer to IBM (2019c) for further information. • Cumulative plot. If this option is enabled, the cumulative chart is plotted instead of a graph showing the increments. A cumulative chart is usually smoother, as it neglects the increment fluctuation. For this reason, this option is enabled by default. This option is not available for ROC curves.
772
8 Classification Models
Fig. 8.16 Output of the Analysis node of a classifier with two target categories, here, a logistic regression classifier on the Wisconsin breast cancer data (see Sect. 8.3.6)
• Include baseline. Select this option if a reference line of the naïve random model should be included in the graph. For example, for a ROC chart, this is the diagonal which indicates the model that randomly classifies the data. This option is not available for Profit and ROI charts. We recommend to always draw this line to visualize the improvement of the prediction by the model.
8.2 General Theory of Classification Models
773
Fig. 8.17 Defining the target variable measurement as “Flag” in the Type node. This ensures correct calculation of the AUC and Gini
• Include best line. This option adds a line representing the optimal model, i.e., no misclassifications, to the chart. This option is not available for ROC charts. • Separate by partition. For each of the partitions (training, test, validation), a separate graph is plotted if a partition field is used to spilt the data, e.g., with a Partition node. • Plot (Percentiles). This option is available for all graphs except the ROC. Prior to the calculation of these other evolution charts, the dataset is split into quantiles, i.e., the data are sorted by the predicted score in descending order and then split into equal parts (quantiles). So, the data record with the highest scores is in the first part followed by the second highest scored data in the second part and so on. Then the required calculations are done on each quantile and the results are plotted as the desired graph. Possible options are: Quartiles, Quintiles, Deciles, Vingtiles, Percentiles, and 1000-tiles. Needless to say, smaller quantiles (Percentiles or 1000-tiles) ensure a smoother graph, whereas large quantiles (Quartiles or, Quintiles) generate a rougher graph. • Style. With this option the style of the graph is defined. The graph can be drawn as points or as a line. • Profit settings. With this option costs, revenues and weights can be specified which are associated with each record. So, instead of the default counts of hits and misses, the here defined values are summed up. The defined values can be fixed or
774
8 Classification Models
Fig. 8.18 Options of the Evolution node, here, for the visualization of the ROC curve
variable, whereby, for the latter a variable field of the dataset has to be selected. The profit settings option is mandatory for Profit and ROI chart, but can also be activated for all other graphs, except the ROC, via the ‘Use profit criteria for all chart types’ option at the top of the Evolution node. In the Option tab, the user is able to define a hit, i.e., a positive case, and a scoring criterion. This ensures a flexibility to overrule the default values, which are automatically extracted from the information reported by the model. This might be beneficial if, for example, lower scores are better than higher. In this case, the hit criteria can be selected as ‘@TARGET T, and ‘no purchase’ otherwise, for a given threshold T. The density functions with the threshold are visualized in Fig. 8.29 which is based on Wikipedia (2019). The TPR describes the probability of a correct classified ‘purchase’ data record, i.e., the probability that X > T if a ‘purchase’ instance is given. But, this is nothing other than the area under the curve f1(x) right of the threshold, i.e., the integral of f1(x) from T to infinity resp. 1 since X predicts a probability,
786
8 Classification Models
Fig. 8.29 Probability density curves of ‘purchases’ and ‘no purchases’ and visualization of the TPR
Zþ1 TPR ¼
f 1 ðxÞ dx, T
which is also highlighted by the shaded area in Fig. 8.29. In the same manner, the TNR, which describes the probability that X T if a ‘no purchase’ instance is given, equals the area under the curve f0(x) form minus infinity resp. 0 to T, i.e., ZT f 0 ðxÞ dx:
TNR ¼ 1
So, we deduce ZT FPR ¼ 1 TNR ¼ 1
Zþ1 f 0 ðxÞ dx ¼
1
f 0 ðxÞ dx, T
which is visualized in Fig. 8.30. We see that the FPs are located in the overlapping area of the densities. This should not be surprising since in this overlapping interval the model is unsure of the true class, hence, leading to misclassifications of the model which can only make greater and smaller T decisions. For example, in Figs. 8.29 and 8.30, the threshold T separating the ‘purchase’ and ‘no purchase’ equals the standard threshold used by a classifier, i.e., T ¼ 0.5. Let’s assume a score is just a bit larger than 0.5, e.g., 0.51,
8.2 General Theory of Classification Models
787
Fig. 8.30 Visualization of the FPR in the overlapping area of the probability density curves of ‘purchases’ and ‘no purchases’
then the decision that a purchase happens is not assured, in fact, it is almost a coin flip decision. Therefore, a classifier is good if the density curves do not overlap too much and the center of the distributions is far away from each other. With more separated density curves, fewer misclassifications can occur, and the classifier provides a higher accuracy. See Fig. 8.31 for density curves with little overlapping belonging to a model with higher accuracy. Sliding the threshold to the left and right in the graphic of the density curves, we can get a better understanding of the behavior and the dependencies between TPR and FPR as well as the quality of a classifier. Recall from Sect. 8.2.5 that for a model having strong classification power, the TPR has to be high and the FPR low. So, moving the threshold to the left would increase the TPR, but at the same time, increase the FPR as well, because a larger area of the density f0(x) is now located to the right of T. This can be illustrated in Fig. 8.32. Thus, moving the threshold to the left generates more FPs and we are in the crying wolf situation (see once again Sect. 8.2.5). On the other hand, sliding T to the right reduces the FPR as well as the TPR simultaneously. See Fig. 8.33. After these preliminaries, we can now describe how the ROC curve is calculated using the same procedure. As the ROC curve describes the relationship of the TPR and FPR, and with it, the ability of the model to separate the two target classes, here, ‘purchase’ and ‘no purchase’, we have to determine these two measures for different thresholds T. So, starting at the right, i.e., T ¼ 1 resp. T ¼ 1, we move the threshold to the left and calculate the TPR and FPR for each T. However, we only have to consider the thresholds that correspond to the predicted probabilities ‘$LP-1’ since
788
8 Classification Models
Fig. 8.31 Visualization of the density curves with little overlap belonging to a classifier that separates the ‘purchases’ from the ‘no purchases’ very accurately
Fig. 8.32 A threshold further to the left increases the TPR as well as the FPR simultaneously
these are the only values where the classification of a data record might change, and with it, the values of the TPR and FPR. For T ¼ 1 resp. T ¼ 1, we see that all scores, and thus, both density curves, lie completely left of T, which leads to a TPR and FPR of 0. As mentioned above, we
8.2 General Theory of Classification Models
789
Fig. 8.33 A threshold further to the right reduces the TPR as well as the FPR simultaneously
now start moving T to the left, stopping at every predicted probability ‘$LP-1’ for new calculations of TPR and FPR. We recommend to always start at the right and moving from the highest to the lowest score of a purchase. For this purpose, we sort the rows of the output table of the model by the scores ‘$LP-1’ in descending order (see Table 8.5). Now, we can perform our calculations successively from the top to the bottom of the table. So, we start our calculations at the first row of Table 8.5 with a score of 0.988. We find the following: Threshold T ¼ 0.988
Conclusion Each row with a score above or equal to 0.988 will be classified as ‘purchase’, which is only one data record (the one in row 1) in this case. All other rows are assigned to the class ‘no purchase’. Since row 1 is an actual purchase, we conclude that TP ¼ 1 and FP ¼ 0. There are a total of eight actual purchases in the dataset, and thus, TPR ¼ 1/8 ¼ 0.125. Furthermore, FPR ¼ 0. In Fig. 8.34, these calculations are visualized.
Moving our threshold line further to the left, we find in row 2 of Table 8.5: Threshold T ¼ 0.975
Conclusion Each row with a score above or equal to 0.975 will be classified as ‘purchase’, which are two (row 1 and row 2) in this case. Since both of these rows are actual purchases, we get that TP ¼ 2 and FP ¼ 0. With the same calculations as above, we get TPR ¼ 2/8 ¼ 0.250 and FPR ¼ 0.
790
8 Classification Models
Fig. 8.34 Probability density curve and threshold for row 1 of Table 8.5
Following the same procedure as for the first two rows, we can calculate the TPR and FPR for each threshold of the first seven rows. Thus, in every step, the TPR increases by 0.125 up to TPR ¼ 7/8 ¼ 0.875 for row 7, where T ¼ 0.793. Meanwhile, the FPR stays at 0 for all of these rows. So next, let us have a look at row 8. We get: Threshold T ¼ 0.527
Conclusion Each row with a score above or equal to 0.527 will be classified as ‘purchase’, i.e., the first eight rows. The first seven rows are actual purchases, whereas row 8 belongs to a ‘no purchase’ data record. Se, we have our first false positive classification, leading to TP ¼ 7 and FP ¼ 1. Still, TPR ¼ 7/8 ¼ 0.875, but, for the first time, the FPR is positive. Since there are a total of four ‘no purchases’ cases in the data, we get FPR ¼ 1/4 ¼ 0.250. See Fig. 8.35 for a visualization of this case.
We continue with this procedure for every following row in Table 8.5 until we arrive in the last row 12, and we determine the final values for the ROC curve. Threshold T ¼ 0.114
Conclusion Each row with a score above or equal to 0.114 will be classified as ‘purchase’, which are all records in the dataset. So, we obtain TP ¼ 8 and FP ¼ 4, and we get that TPR ¼ 8/8 ¼ 1 and FPR ¼ 4/4 ¼ 1.
All the values of the above calculations are listed in Table 8.6. Besides the output of the classifier, the TP, FP TPR, and FPR are listed for each row, which means that the threshold in this case is the score ‘$LP-1’ of that particular row. Furthermore, the
8.2 General Theory of Classification Models
791
Fig. 8.35 Probability density curve and threshold for row 8 of Table 8.5, where we find TPR ¼ 0.875 and FPR ¼ 0.250
column ‘AUC’ contains the increment of the area under the ROC curve from the previous row to the particular one. Finished with our manual calculations, we can finally visualize the ROC curve, which is displayed in Fig. 8.36, by plotting the values of the columns ‘False Positive Rate’ against ‘True Positive Rate’. When comparing the shape of the ROC curve with the process of moving the threshold through Table 8.6 from top to bottom, we observe that if a hit (‘purchase’) is seen, the ROC curve goes up and if a miss (‘no purchase’) is seen the ROC curve goes one step to the right. So, the (empirical) ROC curve has the form of a step function in this simple case. The reason is that each predicted score is unique in the dataset. If, otherwise, ties occur, that means, numerous sample in the dataset have the same predicted score, multiple steps are done at once and in this region, the ROC curve is a line with positive and finite slope. The calculation of the AUC (area under the curve) is pretty simple since the ROC curve is a step function in our example. The AUC is just the sum of all rectangles generated by two successive rows. Thus, the width of the rectangle is the difference of the FPR with the FPR of the previous row, whereas the height is calculated with the formula m¼
TPRprevious þ TPR 2
In the Microsoft Excel file ‘ROC_manual_calculation.xlsx’, we can now implement the formula as shown in Fig. 8.37. By summing up all increments, we get an AUC of 0.875, which is the exact value calculated by Analysis node.
Row 1 2 3 4 5 6 7 8 9 10 11 12
Actual value 1 ¼ purchase/ 0 ¼ no purchase
Purchase 1 1 1 1 1 1 1 0 0 0 0 1
Score of a purchase calculated by the model
$LP-1 0.988 0.975 0.961 0.954 0.929 0.839 0.793 0.527 0.340 0.306 0.274 0.114 TP 1 2 3 4 5 6 7 7 7 7 7 8
Number of true positives FP 0 0 0 0 0 0 0 1 2 3 4 4
number of false positives
True positive rate TPR ¼ TP/ P 0.1250 0.2500 0.3750 0.5000 0.6250 0.7500 0.8750 0.8750 0.8750 0.8750 0.8750 1.0000
False positive rate FPR ¼ FP/ N 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2500 0.5000 0.7500 1.0000 1.0000
AUC 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2188 0.2188 0.2188 0.2188 0.0000
Area under the curve
Table 8.6 The output of the classifier extended by the values of the above calculations. In each row, the TP, FP TPR, FPR, and AUC increments are listed for the threshold being the score ‘$LP-1’ of the particular row
792 8 Classification Models
8.2 General Theory of Classification Models
793
Fig. 8.36 Graph of the manually calculated ROC curve, which displays the TPR and FPR of Table 8.6
Fig. 8.37 AUC calculation of the example in Microsoft Excel
Let us pause here for a moment and have a look at the ROC curve that is generated by the SPSS Modeler for this example. The ROC curve plotted by the Evaluation node can be seen in Fig. 8.38. We notice that the curve looks different at the right side than our manually calculated graph in Fig. 8.36. The data point of row 11 in Table 8.6 i.e., (1, 0.875), seems missing. Also, the area under the curve of this ROC curve is obviously larger than 0.875, which does not match the calculated AUC by the Analysis node. This is an example of the inaccuracy of the ROC curve generated by the Evaluation node, which was mentioned in Sect. 8.2.7. We like to mention that the calculations of the ROC curve and the AUC are simple in our example, but can be a bit more complicated in practice, for example,
794
8 Classification Models
Fig. 8.38 ROC curve generated by the Evaluation node of the SPSS Modeler
when ties occur and multiple data records have the same scores. In this case, the algorithm described here has to be slightly modified. For the interested reader, we refer to Tom Fawcett (2003) for a description of modifications of the algorithm that calculates the ROC curve.
8.2.9
Exercises
Exercise 1: Building a Classifier and Prediction for New Data 1. Explain in your own words the steps for building a classifier. 2. Consider a classifier for the assignment of a “good” or “bad” credit rating to bank customers. The classifier has the following set of rules:
8.2 General Theory of Classification Models
795
Use this classifier to predict the credit rating of the following customers:
Exercise 2: Calculation of Goodness of Fit Measures Recall the example of credit card rating from Exercise 1. Assume now that the classifier uses a scoring function that calculates the probability that a customer has a “good” credit rating. The probabilities of the customers and the true credit ratings are as follows:
796
8 Classification Models
1. Predict the credit rating in cases where the classifier suggests a “good” rating, if the probability is greater than 0.5, and a “bad” rating otherwise. What are the number of TP, FP, TN, and FN and the accuracy in this case? 2. Draw the ROC curve and calculate the AUC from the probabilities of a “good” credit rating and a true credit rating. Exercise 3: Pitfalls in Classification When building a classification model, there are many factors you have to keep in mind, which can negatively influence the ability of the prediction model and lead to worse models. These include: • Imbalanced data: The Target variable is very unevenly distributed in your training data. For example, in a binary target variable there is only 5% of 1 and 95% of 0. This can happen in fraud data, for example, since fraud is rare compared to normal use. • Outliers: These are data points or variable values that are located far away from all the others. These can result from measurement errors, for example. • Missing values: Data may not always be complete, and some data records contain missing values. That means, not every attribute of a data record has a proper value, i.e., the data of this variable is missing. This incompleteness can occur if, for example, the data are unavailable like customer information for sales transaction data, or the variable weren’t considered to be important in the past. • A huge number of predictor variables: The number of input variables is pretty high compared to the number of observations. One example is gene data, which comprise a huge number of measured values, but samples and tissues of a particular disease are often rare, so the number of observations is very few. Give reasons why these effects can worsen your model and eliminate a good prediction. Explain the term “overfitting” in your own words. Exercise 4: Confusion Matrix and Type of Error We look at customers entering a supermarket. They are interested in a certain product and buy it in the shop or not. Based on historical data, we build a statistical model to predict the purchase. Purchasing the product is being defined here as the positive case.
8.2 General Theory of Classification Models
797
1. Describe the different table cells in form of: True Positives, False Positives, True Negative, and False Negative. 2. Which cell represents the Type I resp. the Type II error? 3. Recall from Sect. 8.2.5 that the measure recall is calculated as Recall ¼
True Positives , True Positives þ False Negatives
and describes the proportion of correctly identified positive cases to all actual positive cases, e.g., the proportion of number of customers who actually bought a product and(!) also predicted as such divided by all customers who actually bought it. A similar measure is the precision which is calculated as Precision ¼
True Positives : True Positives þ False Positives
and it describes the proportion of correctly identified positive cases to all positively predicted cases, e.g., the number of customers who actually bought a product and(!) also predicted as such divided by all customers who are predicted as buyers. Precision is also referred to as positive predictive value (PPV). Mark the relevant cells in the confusion matrix which are used for the calculation of recall and precision. Give two examples, one of a prediction having a high recall and low precision, and one with a low recall and high precision. Exercise 5: Interpretation of Evaluation Charts Recall the example used in Sect. 8.2.7, which comprises information of a direct marketing campaign for a financial product of a bank (see Moro et al. (2011)). Assume you are a marketing manager looking at the evaluation graphs (Figs. 8.19– 8.24) in order to optimize the campaign. Hint: The figures show the evaluation graphs for the specified target group and expected response of the campaign. 1. You are concerned that the size of the target group is too large, but your colleague wants to stick with the planned target group. Which of the evaluation graphs proves your point and how do you convince your colleague to reduce the target group size? 2. Mark the area in the profit chart in which you would pick the size of the target group. Give a reason for your choice. 3. Despite the profitability of the campaign, at least 90% of the total hits (response) have to be in your final target group. How large do the final target group has to be? Which graph do you use to answer the question? What response rate can you expect for the campaign with the new chosen target group size?
798
8 Classification Models
Fig. 8.39 Predicted credit ratings
Table 8.7 Values of TP, FP, TN, FP for the credit card
True credit card rating
Yes No
Predicted credit card rating Yes No TP ¼ 6 FN ¼ 2 FP ¼ 1 TN ¼ 4
8.2.10 Solutions Exercise 1: Building a Classifier and Prediction for New Data Theory discussed in section
Sect. 8.2.1
1. This has been explained in Sect. 8.2.1, and we refer to this chapter for the answer to this first question. 2. The predicted credit ratings are listed in Fig. 8.39. Exercise 2: Calculation of Goodness of Fit Measures Theory discussed in section
Sect. 8.2.4 Sect. 8.2.5
1. The predicted credit ratings are the same as in Exercise 1 and can be viewed in Fig. 8.39. Table 8.7 shows the confusion matrix with TP, FP, FN, TN With these values, we obtain the accuracy
8.2 General Theory of Classification Models
799
Fig. 8.40 FPR and TPR for the relevant threshold of the ROC curve
Fig. 8.41 ROC curve of the credit rating prediction
Accuracy ¼
TP þ TN 6þ4 ¼ ¼ 0:231: TP þ FP þ FN þ TN 13
2. The values of the TPR and FPR for the relevant thresholds are listed in Fig. 8.40, and they are calculated as described in Sect. 8.2.8. The ROC curve then looks like Fig. 8.41
800
8 Classification Models
Fig. 8.42 Visualization of the imbalanced data problem. The minority class is highly misclassified
Exercise 3: Pitfalls in Classification • Imbalanced data: Classifier algorithms can prefer the majority class and optimize the model to predict this class very accurately. As a consequence, the minority class is then poorly classified. This leaning towards the larger target class is often the optimal choice for highest prediction accuracy, when dealing with highly imbalanced data. In Fig. 8.42, the imbalanced data problem is visualized. The dots are underrepresented compared with the rectangles, and thus the decision boundary is matched to the latter class of data points. This leads to high misclassification in the dots’ class, and hence, 5 out of 7 dots are misclassified, as they lie on the wrong side of the boundary. In Chap. 10, this problem of imbalanced data is discussed extensively and a variety of methods for dealing with this problem are presented. In addition to Chap. 10, we refer the interested reader to Haibo He and Garcia (2009) for details on this issue. • Outliers: An outlier in the data can vastly change the location of the decision boundary and, thus, highly influence the classification quality. In Fig. 8.43, the calculated decision boundary is heavily shifted, in order to take into account the outlier at the top. The dotted decision boundary might be a better choice for separation of the classes. • Missing values: There are two pitfalls when it comes to missing values. First, a data record with missing values leaks information. For some models, these data are not usable for training, or they can lead to incorrect models, due to misinterpretation of the distribution or importance of the variables with missing values. Moreover, some models are unable to predict the target for incomplete data. There are several methods to handle missing values and assign a proper value to the missing field. See Chap. 11 and Han et al. (2012) for a list of concepts to deal with missing values. • A huge number of predictor variables: If the ratio of input variables to the number of observations is extremely high, classifiers tend to be overfitting. The more
8.2 General Theory of Classification Models
801
Fig. 8.43 Visualization of the outlier problem. The decision boundary is not chosen optimally, due to the outlier at the top
input variables exist, the more possibilities for splitting the data. So, in this case, we can definitely find a way to distinguish between the different samples through at least one variable. One possible technique for dealing with this problem is dimension reduction, via PCA or factor analysis (see Chap. 6). For further information, we refer to James et al. (2013). Overfitting is a phenomenon that occurs when the model is too well-fitted to the training data, such that it is very accurate on the training data, but inaccurate when predicting unseen data. See Sect. 8.2.4, Fig. 8.9, and Sect. 5.1.2. Exercise 4: Confusion Matrix and Type of Error Theory discussed in section
Sect. 8.2.5
1. Figure 8.44 shows the confusion matrix with the correct description of the cells. 2. The types of error are also assigned to the confusion matrix cells shown in Fig. 8.44. 3. Figs. 8.45 and 8.46 visualize the cells relevant to the calculation of recall and precision. As can be seen, the true negatives are not taken into account in both cases. For the example of a high recall and low precision, let us stay at the purchase use-case. Assume, that we have a dataset of 100 customers with 10 positive (purchase) and 90 negative (no purchase) cases. Now, a model that identifies all positive cases is the naïve model that always predicts ‘purchase’, i.e., TP ¼ 10. However, it also misclassifies the ‘no purchase’ cases as purchases, hence, FP ¼ 90 and FN ¼ 0. So, we get a high recall and low precision,
802
8 Classification Models
Fig. 8.44 Confusion matrix with description of the cells including the types of errors
Fig. 8.45 Visualization of recall
Fig. 8.46 Visualization of precision
Recall ¼
10 10 ¼ 1 and Precision ¼ ¼ 0:1: 10 þ 0 10 þ 90
Let us now assume, the model is trained to be very “picky”, that is, it only predicts ‘purchase’ if it is really sure. So, for example, such a model predicts that 2 of the 10 positive cases correctly, i.e., TP ¼ 2, and all other data points are
8.2 General Theory of Classification Models
803
100,000
$BEST-response $C-response
75,000 Profit
50,000 25,000 0 –25,000 0
20
40 60 Percentile response = “yes”
80
100
Fig. 8.47 Profit chart with highlighted negative profit area and optimal target group size for maximal profit for the marketing campaign
classified as ‘no purchase’, which results in FP ¼ 0 and FN ¼ 8. So, we get a low recall and high precision, Recall ¼
2 2 ¼ 0:2 and Precision ¼ ¼ 1: 2þ8 2þ0
Exercise 5: Interpretation of Evaluation Charts 1. We choose the cumulative profit chart and see that the campaign with the target group as planned is expected to generate a negative profit. As can be seen in Fig. 8.47, the profit line is under 0 at the far right. This means that the campaign with the planned target group would generate a negative profit, and a reduction of the target group to the best scored clients in the area where the profit line is above 0 would be better. 2. In the profit chart Fig. 8.47 the area between the top 15% and 25% scored clients is marked since in this interval the campaign generates the most profit. So, reducing the target group of top scored clients to a size in this area would be the optimal choice. 3. To gain at least 90% of the total response, we have to select approximately the top 32% scored clients for the campaign. This can be determined by the gain chart, which is shown in Fig. 8.48. There, the dotted lines mark the position where at least 90% of the response is captured by the target group. If the final campaign is just targeting the top 32% clients, we can expect a response rate of around 34%, see the response chart in Fig. 8.49.
804
8 Classification Models
100
$BEST-response $C-response
% Gain
80 60 40 20 0 0
20
40 60 Percentile response = “yes”
80
100
Fig. 8.48 Gain chart of the marketing campaign. The dotted lines mark the position where at least 90% of the response is captured by the target group 100
$BEST-response $C-response
% Response
80 60 40 20 0 0
20
40 60 Percentile response = “yes”
80
100
Fig. 8.49 Response chart of the marketing campaign with the expected response rate of the top 32% clients of about 34%
8.3
Logistic Regression
Logistic regression (LR) is one of the most famous classification models and is used for many problems in a variety of fields. It is so incisive and of such relevance and importance, especially in the financial sector, that the main contributors to the theory received a Nobel prize in economy in 2000 for their work. LR is pretty similar to Linear Regression described in Chap. 5. The main difference is the scale of the target value. Linear regression assumes a numeric/continuous target variable and tries to estimate the functional relationship between the predictors and the target variables,
8.3 Logistic Regression
805
whereas in a classification problem, the target variable is categorical and linear regression models become inadequate for this kind of problem. Hence, a different approach is required. The key idea is to perform a transformation of the regression equation to predict the probabilities of the possible outcomes, instead of predicting the target variable itself. This resulting model is then the “LR model”.
8.3.1
Theory
The setting for a binary LR is the following: consider n data records xi1, . . ., xip, each consisting of p input variables, and each record having an observation yi. The observations y1, . . ., yn thereby are binary and take values 0 or 1. Instead of predicting the categories (0 and 1) directly, the LR uses a different approach and estimates the probability of the observation being 1, based on the covariables, i.e., P yi ¼ 1 j xi1 , . . . , xip , with a regression h xi1 , . . . , xip ¼ β0 þ β1 xi1 þ . . . þ βp xip : However, using the naïve approach and estimating the probability directly with regression, that is, P(yi ¼ 1 | xi1, . . ., xip) ¼ h(xi1, . . ., xip), is not feasible since the regression function is not bound between 0 and 1. More precisely, for particular values of the input variables h(xi1, . . ., xip), it can be greater than 1 or even negative. To solve this problem, the regression function is transformed with a function F, so that it can only take values in the interval [0, 1], i.e., P yi ¼ 1 j xi1 , . . . , xip ¼ F β0 þ β1 xi1 þ . . . þ βp xip , where F is the logistic (distribution) function F ðt Þ ¼
exp ðt Þ : 1 þ exp ðt Þ
See Fig. 8.50 for the graph of the logistic function. Hence, P yi ¼ 1 j xi1 , . . . , xip ¼
exp h xi1 , . . . , xip , 1 þ exp h xi1 , . . . , xip
and by taking the inverse of the logistic function
806
8 Classification Models
Fig. 8.50 Graph of the logistic (distribution) function
! P yi ¼ 1 j xi1 , . . . , xip ¼ h xi1 , . . . , xip , log 1 P yi ¼ 1 j xi1 , . . . , xip we get the usual (linear) regression term on the right-hand side. This equation is referred to as log-odds or logit, and it is usually stated as determination of the logistic regression equation. Odds Recall that in the linear regression models, the coefficients β0, . . ., βp give the effect of the particular input variable. In LR, we can give an alternative interpretation of the coefficients, by looking at the following equation P yi ¼ 1 j xi1 , . . . , xip ¼ expðβ0 Þ expðβ1 xi1 Þ . . . exp βp xip , P yi ¼ 0 j xi1 , . . . , xip which is derived from the former equation, by taking the exponential. The quotient of probabilities on the left-hand side of this equation is called odds and gives the weight of the probability to the observation being 1, compared to the probability that the observation is 0. So, if the predictor xik increases by 1, the odds change with a factor of exp(βk). This particularly means that a coefficient of βk > 0, and thus exp(βk) > 1, increases the odds and therefore the probability of the target variable being 1. On the other hand, if βk < 0, the target variable tends to be of the category 0. A coefficient of 0 does not change the odds, and the associated variable has, therefore, no influence on the prediction of the observation. This model is called the Logit model, due to the transformation function F, and it is the most common regression model for binary target variables. Besides the Logit
8.3 Logistic Regression
807
model, there are other models that use the same approach but different transformation functions, and we refer the interested reader to Fahrmeir (2013) for further information. To perform a LR, some requirements are needed, which should be checked before building a LR model to assure a high quality of the predictions. Necessary Conditions 1. The observations should be independent. That means that the records in the dataset do not come from repeated measurements or matched data. 2. The predictor variables should have a low collinearity, as otherwise the impact of several variables can be difficult to differentiate. 3. The number of samples should be large enough. Hence, the coefficients can be calculated. Multinomial Logistic Regression If the domain of the target variable has more elements than two, the logistic regression model described above can be extended. Therefore, a base category of the domain is chosen, and for every other value, a binary logistic regression model is built against the base category. More precisely, let us assume that from the k different values of the target variable, the last one k is picked as the baseline category. Then, there are k 1 independent binary LR models fitted to predict the probability of each of the first k 1 outcomes against the base category. With these k 1 models and their predicted probabilities for the first k 1 classes, the probability of the baseline category can be calculated since the sum of all probabilities of the k elements in the domain equals 1. For new data, the category with the highest probability estimated by the regression equations is then chosen as the prediction. See Azzalini and Scarpa (2012) for more details. Goodness of Fit Measures Besides the common performance measures introduced in Sect. 8.2.5, there are a variety of other measures and parameters to quantify the goodness of fit of the logistic regression model to the data. First, we mention the Pseudo R-Square Measures, in particular the Cox and Snell, the Nagelkerke, and the McFadden measures, which compare the fitted model with the naïve model, i.e., the model which includes only the intercept. These measures are similar to the coefficient of determination R2 for the linear regression (see Sect. 5.2.1) in such a way that their values also lie between 0 and 1, and a higher value describes a better fit of the model. As a thumb rule, a model with a McFadden measure between 0.2 and 0.4 can already be considered as a good fit to the data. For the other two pseudo R-square measures, a value over 0.4 should be taken to represent a good fit of a model. We would further like to point out that the Cox and Snell R2 measure, in contrast to the other two measures, is always less than 1, even for a perfectly fitted model. See Tuffery (2011) and Allison (2014) for further information on the pseudo R-square measures.
808
8 Classification Models
Another measure for the goodness of fit used by the Modeler is the likelihood ratio test, which is also a measure that compares two fitted models with each other. Its statistic is the difference of the 2 log likelihood values of the two models, which is asymptotically Chi-square distributed. It is used to evaluate the effects of adding or removing a variable to the regression equation during the training process (see Sects. 8.3.2 and 8.3.4) as well as testing the final model against the model including only the intercept. A large value of the likelihood ratio test means a significantly improved fit of the “new” model (with added or removed variable) resp. final model to the “old” resp. naïve model. See Azzalini and Scarpa (2012) and IBM (2019a) for a precise definition of the mentioned statistics and the likelihood ratio test.
8.3.2
Building the Model in SPSS Modeler
A logistic regression model can be built with the Logistic node in the SPSS Modeler. We will present how to set up the stream for a logistic regression, using this node and the Wisconsin breast cancer dataset, which comprises data from breast tissue samples for the diagnosis of breast cancer. A more detailed description of the variables and the medical experiment can be gleaned from the motivating Example 1, Sect. 12.1.40 and Wolberg and Mangasarian (1990). Description of the model Stream name Based on dataset Stream structure
Logistic regression WisconsinBreastCancerData.csv (see Sect. 12.1.40)
Important additional remarks: The target variable should be categorical. If it is continuous, a (multiple) linear regression might be more appropriate for this problem (see Chap. 5) Related exercises: All exercises in Sect. 8.3.7
8.3 Logistic Regression
809
Fig. 8.51 Template stream of the Wisconsin breast cancer data
1. We open the template stream “016 Template-Stream_WisconsinBreastCancer” (see Fig. 8.51) and save it under a different name. The target variable is called “class” and takes the values 2 for “benign” and 4 for “malignant” samples. 2. Next, we open the Type node to specify the variable types. See Sect. 2.1 for the description of this node. Figure 8.52 shows an example of how a Type node should look in this case. The target variable “class” is defined as type “Flag”, which means that it can take only two different values, and its role is set to “Target”. This ensures that the modeling node automatically identifies the variable “class” as the target variable for a binary classification problem. Furthermore, the role of the “Sample code” is defined a “None” in order for the modeling node to ignore this variable, as it is only the unique identifier for the tissue samples in the dataset. Note: We make sure that all variables, except for “class”, are defined as continuous, although these are discrete and can take only a finite number of values. But, based on the variable description, we can interpret these variables as interval or ratio scaled (see Sect. 12.1.40). 3. We add the Logistic node to the canvas and connect it with the Type node. We open the Logistic node with a double-click, to set the parameters of the logistic regression model. 4. In the Fields tab, we can select the target and input variables. It provides two options for the variable selection: If the “Use predefined roles” option is enabled, the Logistic node automatically sets the target variable and input variables by their role as defined in the Type node (see Fig. 8.52). For better illustration, we choose the “Use custom field assignments” option here, and we have to specify these variables manually. See Fig. 8.53 for the selection of variables for this example. For this model, the target variable can only be categorical since we intend to build a classification model. Here, the target variable “class” indicates if a sample is tumorous (4 for malignant) or healthy (2 for benign). As input variables, we select all but the “Sample code”, as this variable only names the samples, and so is irrelevant for any classification.
810
8 Classification Models
Fig. 8.52 Detailed view of the Type node for the Wisconsin breast cancer data
5. In the Model tab, we can select if the target variable is binary or has more than two categories. See top arrow in Fig. 8.54. We recommend using the Multinomial option, as this is more flexible and the procedure used to estimate the model doesn’t differ much from the Binomial, while the results are equally valid. As in the other (linear) regression models (see Chap. 5), we can choose between several variable selection methods. See Sect. 5.3.1 for more information on these methods. Table 8.8 shows the available variable selection methods for Multinomial and Binomial logistic regression. For a description of these methods, see Table 5.2. In the example of the breast cancer data, we choose the Backwards stepwise method, see middle arrow in Fig. 8.54. That means, the selection process starts with a complete model, i.e., all variables are included, and then removes variables with no significant impact on the model fit step by step. Furthermore, already removed variables are reevaluated if they add to the prediction power of the model. These procedures are repeated until the resulting model cannot be anymore improved upon. We can also specify the base category, which is the category of the target variable that all other variables are compared with. In other words, the base category is interpreted as 0 for the logistic regression, and the probability of
8.3 Logistic Regression
811
Fig. 8.53 Selection of the criterion variable and input variables for logistic regression, in the case of the Wisconsin breast cancer data
nonoccurrence in the base category is estimated by the model. See discussion on multinomial logistic regression in Sect. 8.3.1. By default, each input variable will be considered separately, with no dependencies or interactions between each other. This is the case when we select the model type “Main Effects” in the model options (see Fig. 8.54). On the other hand, if “Full Factorial” is selected as the model type, all possible interactions between predictor variables are considered. This will lead to a more complex model that can describe more complicated data structures. The model may likely suffer from overfitting in this situation however, and the calculations may increase intensively due to the amount of new coefficients that have to be estimated. If we know the interactions between variables, we can also declare them manually. This is shown in detail in Sect. 8.3.3. If we select the Binomial logistic model, it is also possible to specify the contrast and base category for each categorical input. This can sometimes be
812
8 Classification Models
Fig. 8.54 Options in the Model tab of the Logistic node including the variable selection method
useful if the categories of a variable are in a certain order, or a particular value is the standard (base) category to which all other categories are compared. As this is a feature for experienced analysts, however, we omit giving a detailed description of the options here and refer the interested reader to IBM (2019b). 6. We recommend including the predictor importance calculations in the model, which can be chosen in the Analyze tab. But as a drawback, these calculations may increase the runtime of this node significantly. See Fig. 8.55 and the references given in Sect. 5.3.3, for information on predictor importance measures. Furthermore, the calculation of raw propensity scores can be activated in the Analyze tab. Here, we are not interested in these scores, as they don’t add
8.3 Logistic Regression Table 8.8 List of the variable selection methods for Multinomial and Binomial logistic regression
813
Method Enter (no selection) (Forwards) stepwise Forwards Backwards Backwards stepwise
Multinomial X X X X X
Binomial X X
X
information to the prediction. For further information on propensity scores, we refer to IBM (2019b). 7. We run the stream to build the model. The model nugget, which contains the trained model, appears and is included in the stream. We then add the Table node to the model nugget and run it, in order to visualize the predictions of the trained model. Figure 8.56 shows the output of the Table node. On the right, we see four new columns, beginning with a ‘$’, added to the original dataset. These contain the predictions of the model. For a description and naming convention of these new columns, see the Infobox and Table 8.9 below. In the model nugget, additionally configurations can be done to specify which of these prediction columns are returned by the model. See Sect. 8.3.4. The variables containing the prediction of a classifier have the following naming convention. The names of these variables consist of the name of the target variable that the model is predicting and a prefix indicating the type of prediction. More precisely, the prefix ‘$X-’ characterizes the predicted category, whereas ‘$XP-’ stands for the probability of the predicted category. The X in the prefix, thereby, denotes the classification method, ‘$L-’ standing for a logistic regression, ‘$S-’ for a support vector machine, and so on. In the present example, where a logistic regression is used to predict the target variable class, the new variables are named ‘$L-class’ and ‘LP-class’. Furthermore, for each category of the target variable, a new field is added to the dataset containing the predicted probabilities that the data record belongs to this category. These variables are named based on the category name, prefixed by ‘$XP-’ with X representing the model type as described above. For example, the class variable has values ‘2’ and ‘4’, therefore, two variables are added named ‘$LP-2’ and ‘$LP-4’. In Fig. 8.56, the variables of the example described are shown, and we refer to IBM (2019b) for further information on the naming convention. 8. We recommend adding an Analysis node (see Sect. 8.2.6) to the model nugget to quickly view the hit rate and goodness of fit statistics and thus evaluate the quality of the model. See Fig. 8.57 for the accuracy and Gini/AUC values of the model for the Wisconsin breast cancer data. Here, we have 97% correctly classified samples and a very high Gini value of 0.992. Both indicate a well-fitted model using the training data.
814
8 Classification Models
Fig. 8.55 The Analyze tab of the Logistic node, where the calculation of predictor can be enabled
Fig. 8.56 View of the Table node, which shows the dataset including the variables containing the prediction of the model
9. As we are faced with a binary classification problem, we add an Evaluation node (see Sect. 8.2.7) to the model nugget, to visualize the ROC curve in a graph and finish the stream. We open the node and select the ROC graph as the chart type (see Fig. 8.58).
8.3 Logistic Regression
815
Table 8.9 Variable of the predictions performed by the classification model
Variable $X
$XP
$XP
Description Prediction of the target variable category. This is the category with the maximal predicted probability ‘$XP-’. Here, ‘$Lclass’ can take values ‘2’ and ‘4’. Probability of the prediction ‘$X’, also called “associated probability”. Equals the maximum of the target category predictions ‘$XP-’. Probability that the data record belongs to the target category. E.g., ‘$LP-4’ is the probability that the data record belongs to class 4.
Variable names in this example of a logistic regression $L-class
$LP-class
$LP-2 $LP-4
Values of row 1 in Fig. 8.56 $L-class ¼ 2. This is the category with the highest probability. $LP-class ¼ 0.982 This is the maximum probability of $LP2 and $LP-4 $LP-2 ¼ 0.982 $LP-4 ¼ 0.118
Fig. 8.57 The Analysis node with the accuracy and AUC of the final logistic regression model
After clicking on the “Run” button at the bottom, the graph output pops up in a new window. This is displayed in Fig. 8.59. The ROC curve is visualized with a line above the diagonal and has nearly the optimal form, whereas the diagonal symbolizes the purely random prediction model (recall Sects. 8.2.5 and 8.2.7).
816
8 Classification Models
Fig. 8.58 Settings for the ROC in the Evaluation node
8.3.3
Optional: Model Types and Variable Interactions
There are three possible model types for a logistic regression. Two of these are: “Main Effects”, where the input variables are treated individually and independently with no interactions between them, and “Full Factorial”, where all possible dependencies between the predictors are considered for the modeling. In the latter case, the model describes more complex data dependencies and itself becomes more difficult to interpret. Since there is an increase of terms in the equation, the calculations may take much longer, and the resulting model is likely to suffer from overfitting. With the third model type, i.e., “Custom” (see Fig. 8.60), we are able to manually define variable dependencies that should be considered in the modeling process. We can declare these in the Model tab of the Logistic node, in the bottom field, as framed in Fig. 8.60.
8.3 Logistic Regression
817
Fig. 8.59 Graph of the ROC curve of the logistic regression model on the Wisconsin breast cancer data
1. To add a new variable combination, which should be included in the model, we select “Custom” as the model type and click on the button to the right. See arrow in Fig. 8.60. 2. A new window pops up where the terms that should be added can be selected (see Fig. 8.61). There are five different possibilities to subjoin a term: Single interaction, Main effects, All 2-way interactions, All 3-way interactions, and All 4-way interactions. Their properties are described below for the example where the variables A, B, and C are chosen: 3. We choose one of the five term types and mark the relevant variables for the term we want to add, by clicking on them in the field below. In the Preview field, the selected terms appear and by clicking on the Insert button, these terms are included in the model. See Fig. 8.62, for an example of “All 2-way” interaction. The window closes, and we are back in the options view of the Logistic node. • Single interaction: a term is added that is the product of all selected variables. Thus, the term A*B*C is included in the model for the considered example. • Main effects: each variable is added individually to the model, hence, A, B, and C separately, for the considered example.
818
8 Classification Models
Fig. 8.60 Model type specification area in the Logistic node
• All *-way interaction: All possible products with variable combinations of *, which stands for 2, 3, or 4, are inserted. In the case of “All 2-way” interactions, for example, this means that the terms A*B, A*C, and B*C are added to the logistic regression model. 4. The previous steps have to be repeated until every necessary term is added to the model.
8.3.4
Final Model and Its Goodness of Fit
The estimated coefficients and the parameters describing the goodness of fit can be viewed by double-clicking on the model nugget.
8.3 Logistic Regression
819
Fig. 8.61 Variable interaction selection window
Model Equation and Predictor Importance The first window that opens shows the regression equations, that is, the log-odds equation with the estimated coefficients of the input variables on the left-hand side (see Fig. 8.63). In our example, there are two equations, one for each of the two target categories. If the target variable has more values that it can take, then the Modeler would estimate more equations and display them on the left-hand side, one for each possible output. We note that, for the base category, no equation is estimated, since its probability is determined by the probabilities of all other categories. As described in Sect. 8.3.1, logistic regression will therefore only estimate coefficients and a regression equation for “non-base category” categories. In our example, the regression equation is the following:
PðY ¼ 4j x Þ log PðY ¼ 2j xÞ
¼ 0:5387 Clump thickness þ 0:3469 Cell shape þ . . . þ 0:5339 Mitosis 9:691,
with x being a record of input values. In the right-hand field, the predictor importance of the input variables is displayed. The predictor importance gives the same description as it would with
820
8 Classification Models
Fig. 8.62 Insert new terms to the model. Here, 2-way interactions of three selected variables
linear regression; that is, the relative effect of the variables on the prediction of the observation in the estimated model. The length of the bars indicates the importance of the variables in the regression model, which add up to 1. With the slider at the bottom, the displayed variables can be restricted. For more information on predictor importance, we refer to the “Multiple Linear Regression” Sect. 5.3.3, and for a complete description of how these values are calculated, read IBM (2019a). Estimated Coefficients and Their Significance Level The Advanced tab contains a detailed description of the modeling process, e.g., the variable selection process, and criteria for assessing the goodness of fit of the regression model. Additionally, this tab contains the estimated coefficients of the final model and their significance level, together with further information, such as the confidence interval. These latter values are summarized in a table at the bottom of the advanced tab which is shown in Fig. 8.64. The first column on the left shows the category for which the probability is estimated (in this instance, it is 4). The rest of the table is built as follows: the columns are dedicated to the input variables considered in the model, and the rows describe the various statistical parameters of the variables. The estimated coefficients are in row “B”, and the significance level is in row “Sig.” See Fig. 8.64. Here, most
8.3 Logistic Regression
821
Fig. 8.63 Model equations and predictor importance displayed in the Model nugget
Fig. 8.64 Estimated model parameters and significance level
of the coefficients are significant to a 5%-level without considerations; except the last two variables, ‘Normal nucleoli’ and ‘Mitosis’, which have a significance level of about 0.08 resp. 0.09. We would further point to row “Exp(B)”. These values give the factors the odd changes if the variable increases by 1. For example, the odd increases by a factor of
822
8 Classification Models
Fig. 8.65 Case summary statistics of logistic regression in the Wisconsin breast cancer data
Case Processing Summary Marginal Percentage
N class
2
458
4
241
34,5%
699
100,0%
Valid Missing Total Subpopulation
65,5%
0 699 463a
a. The dependent variable has only one value observed in 463 (100,0%) subpopulations.
1.714 if ‘clump thickness’ is one unit higher. The equation to estimate the odd change therefore is PðY ¼ 4jxÞ ¼ exp ð9:591Þ 1:714xi1 1:415xi2 . . . 1:706xip : PðY ¼ 2jxÞ Summary of Variable Selection Process and Model-Fitting Criteria At the top of the Advanced tab, we can find a summary of the data processed when building the model. This can be seen in Fig. 8.65, where the categorical variables with the number of sample records that belong to the diverse categories of these variables are displayed. Here, only the target variable ‘class’ is of the categorical type, and we see that class 2 has 458 samples, and the subset with 4 as the target value has 241 samples. All these records are valid, that means they have no missing data and can be used to build a regression model. Hence, the model considers 699 observations. The “Subpopulation” indicates the number of different combinations of input variables that are seen by the model. Here, 463 different combinations of the predictors are in the training dataset. Note that this implies that only 463 different probability predictions are made for the training data. This is commented by the footnote, i.e., all values (100%) predicted by the model for the training data can be generated with a subpopulation of 463 data records. In the table in Fig. 8.66, a summary of the variable selection procedure is shown. For each step of this process, we can see the variables that are removed in the particular step; recall that we use the backwards stepwise method. For each of these variables, the model-fitting criteria value is displayed, together with test statistics that were determined in order to evaluate if the variable should be contained in the model or removed. Here, the variables ‘cell size’ and ‘single epithelial cell size’ were removed, based on the 2 log likelihood model-fitting criteria and the likelihood ratio test (see Sect. 8.3.1). Additional information can be found in Azzalini and Scarpa (2012) and Fahrmeir (2013).
8.3 Logistic Regression
823 Step Summary
Model Step 0 Step 1
0 1 2
Action Entered
Effect(s) a
Cell size Single Removed Epithelial cell size Stepwise Method: Backward Stepwise Removed
Model Fitting Criteria –2 Log Likelihood 112,177
Chi-Squareb .c
112,181
,004
1
,950
112,317
,136
1
,712
Effect Selection Tests df
Sig.
.
a. This model contains all effects specified or implied in the MODEL subcommand. b. The chi-square for entry is based on the score test. b. The chi-square for removal is based on the likelihood ratio test.
Fig. 8.66 Model-finding summary of logistic regression with backwards variable selection method
A comparison of the final model with the baseline model, i.e., the model with only the intercept, is shown in the model-fitting information table also contained in the Advanced tab. See the first table in Fig. 8.67. Here, the statistics of the likelihood ratio test are shown including the significance level. In our case, the final model predicts the target variable significantly better than the baseline model. Beneath this overview, further model-fitting criteria values are listed, the pseudo R2 values. These comprise the Cox and Snell, Nagelkerke, and McFadden R2 values. These evaluation measures are described in the theory section (Sect. 8.3.1), and we refer to the references given there for additional information. Here, all R2 values indicate that the regression model describes the data well. See Sect. 8.3.1. Output Setting In the Settings tab, we can specify which additional values, besides the predicted category, should be returned by the model (see Fig. 8.68). These can be “Confidences”, which is the probability of the predicted target category ($L-class), or the “Raw propensity score”. The latter option is only available for a target variable of type flag, and it adds a column to the dataset that contains the probability of the occurrence in the non-base category, which is labeled ‘$LPR-class’. For information on propensity scores, we refer to IBM (2019b). Alongside these options, all probabilities can also be appended to the output, which include the predicted probabilities of the individual categories of the target variable, here, ‘$LP-2’ and ‘$LP-4’. The output of the configuration in Fig. 8.68 can be viewed in Fig. 8.56, and we refer to Sect. 8.3.2 for a description of the added prediction columns and their naming convention.
824
8 Classification Models Model Fitting Information Model Fitting Criteria
Model
–2 Log Likelihood
Intercept Only
2292,823
Final
1379,425
Likelihood Ratio Tests Chi-Square
913,398
Pseudo R-Square Cox and Snell
,419
Nagelkerke
,563
McFadden
,398
Fig. 8.67 Model-fitting criteria and values
Fig. 8.68 Output definition of the model, when used for prediction
df
Sig.
4
,000
8.3 Logistic Regression
825
Fig. 8.69 Prediction for new data with a logistic regression model nugget
8.3.5
Classification of Unknown Values
Predicting classes for new data records, i.e., applying a logistic regression model to an unknown dataset, is done just like linear regression modeling (see Sect. 5.2.5). We copy the model nugget and paste it into the modeler canvas. Then, we connect it to the Source node with the imported data that we want to classify. Finally, we add an Output node to the stream, e.g., a Table node, and run it to obtain the predicted classes. The complete prediction stream should look like Fig. 8.69.
8.3.6
Cross-Validation of the Model
Cross-validation is a standard concept for validating the goodness of fit of the model when processing new and unknown data (see Sect. 8.2.1). More precisely, a model might describe the data it is based on very well, but be unable to predict the correct categories for unknown data, which are independent of the model. This is a classic case of overfitting (see Sects. 8.2.4 and 5.1.2), and cross-validation is needed to eliminate this phenomenon. If the test data are in a separate data file, then cross-validation can be performed by classifying unknown values as described in Sect. 8.3.5, but instead of using a Table node for output, we should use the Analysis node to get the hit counts and evaluation statistics. If our initial dataset is large enough, however, such that it can be divided into a training set and a test set, we can include the cross-validation in the model building process. Therefore, only the Partition node has to be added to the stream in Sect. 8.3.2. Description of the model Stream name Based on dataset Stream structure
Logistic regression cross_valildation WisconsinBreastCancerData.csv (see Sect. 12.1.40) (continued)
826
8 Classification Models
Related exercises: 1, 2
1. We consider the stream for the logistic regression as described in Sect. 8.3.2 and add a Partition node to the stream in order to split the dataset into a training set and a test set. This can be, for example, placed before the Type node. See Sect. 2. 7.7 for a detailed description of the Partition node. We recommend using 70–80% of the data for the training set and the rest for the test set. This is a typical partition ratio. Additionally, let us point out that the model and the hit rates coincide with the randomly selected training data. To get the same model in every run, we fix the seed of the random number generator in the Partition node. See Fig. 8.70 for the settings of the Partition node in this case. 2. In the Logistic node, we have to select the field, here, ‘Partition’, that indicates an affiliation to the training set or test set. This is done in the Fields tab (see Fig. 8.71). 3. Afterwards, we mark the “Use partitioned data” option in the Model tab (see Fig. 8.72). Now, the modeler builds the model only on the training data and uses the test data for cross-validation. All other options can be chosen as in a typical model building procedure and are described in Sect. 8.3.2. 4. After running the stream, the summary and goodness of fit information can be viewed in the model nugget that now appears. These are the same as in Sect. 8.3.4, despite the fact that fewer data were used in the modeling procedure. See Fig. 8.73 for a summary of the data used. Here, we see that the total number of samples has reduced to 491, which is 70% of the whole Wisconsin breast cancer dataset. Since fewer data are used to train the model, all parameters, the number of included variables, and the fitting criteria values change. See Fig. 8.74 for the parameters of this model. We note that compared with the model built on the whole dataset (see Sect. 8.3.4), this model, using a subset of 70% for training, contains only 6 rather than 7 predictor variables.
8.3 Logistic Regression
827
Fig. 8.70 Settings in the Partition node for a training and test set cross-validation scenario
5. In the Analysis node, we mark the “Separate by partition” option to calculate the evaluation measures for each partition separately, see Fig. 8.75. This option is enabled by default. The output of the Analysis node can be viewed in Fig. 8.16, which shows the hit rates and further statistics for both the training set and the test set. In our example of the Wisconsin breast cancer data, the regression model classifies the data records in both sets very well, e.g., with an accuracy of over 97% and a Gini of over 0.98. This confirms that the model is not overfitting, and that it can be deployed to production mode. We refer to Sect. 8.2.6 for details on the Analysis node and the evaluation measures it provides. 6. In the Evaluation node, we also enable the “Separate by partition” option, so that the node draws an ROC curve for the training set, as well as the test set, separately (see Fig. 8.76). These plots can be viewed in Fig. 8.77, and their shape reinforces the conclusion of non-overfitting.
828
8 Classification Models
Fig. 8.71 Definition of the partitioning field in the Logistic node
8.3.7
Exercises
Exercise 1: Prediction of Credit Ratings Banks use demographic and history loan data to decide if they will offer credit to a customer. Use the tree_credit.sav data, which contains such information, to predict a “good” or “bad” credit rating for a customer (see Sect. 12.1.38). 1. Import the data and prepare a training—testing situation. 2. Build a logistic regression model, with a stepwise variable selection mechanism, to predict the credit rating. 3. What are the variables included in the model and which one of them has most importance? Determine the logistic regression equation for the log-odds. 4. Calculate the performance evaluation measures’ accuracy and Gini index for both the training set and the test set. Is the model overfitting? 5. Visualize the Gini measure with the ROC curve. 6. A sales manager wants to offer a credit to 60% of his customers with a “Good” credit rating. How much of his total customer base does he or she has to contact in
8.3 Logistic Regression
829
Fig. 8.72 Enabling of the use of the partitioned data option in the Logistic node
Case Processing Summary Marginal Percentage
N class
2
344
70,1%
4
147
29,9%
491
100,0%
Valid Missing Total Subpopulation
0 491 310a
a. The dependent variable has only one value observed in 310 (100,0%) subpopulations.
Fig. 8.73 Summary and size of the training data used in the modeling process
830
8 Classification Models
Fig. 8.74 Model parameters of the model built on a subset of the original data
Fig. 8.75 Settings in the Analysis node with “Separate by partition” selected
order to archive that goal? Which evaluation graph do you consolidate to answer the question of the sales manager? How many customers with a “Bad” credit rating are falsely contacted by the sales manager?
8.3 Logistic Regression
831
Fig. 8.76 Settings in the Evaluation node. The “Separate by partition” option is enabled, to treat the training data and test data separately
Exercise 2: Prediction of Titanic Survival and Dealing with Missing Data The titanic.xlsx file contains data on Titanic passengers, including an indicator variable “survived”, which indicates if a particular passenger survived the Titanic sinking (see Sect. 12.1.37). Your task in this exercise is to build a model with crossvalidation, which decides from the ticket and personal information of each passenger, whether they survived the Titanic tragedy. 1. Import and inspect the Titanic data. How many values are missing from the dataset and in which variable fields? 2. Build a logistic regression model to predict if a passenger survived the Titanic tragedy. What is the accuracy of the model and how are the records with missing values handled?
832
8 Classification Models
Fig. 8.77 Output of the Evaluation node for the logistic regression of the Wisconsin breast cancer data. The ROC curves are plotted for the training set and test set separately
3. To predict an outcome for the passengers with missing values, the blanks have to be replaced with appropriate values. Use the Auto Data Prep node to fill in the missing data. Which values are inserted in the missing fields? 4. Build a second logistic regression model on the data without missing values. Has the accuracy improved? 5. Compare the model with and without missing values, by calculating the Gini measures and plotting the ROC curves. Use the Merge node to show all measures in one Analysis node output and draw the ROC curves of both models in one graph. Exercise 3: Multiple Choice questions
1.
2.
Question Let the coefficient of the input variable x be β ¼ 2. What is the factor with which the odds change when x is increased by 1?
Which of the following are valid variable selection methods for logistic regression?
Possible answers □2 □ 0.5 □ exp(2) □ exp(0.5) □ log(2) □ log(0.5) □ Stepwise □ Information criterion (continued)
8.3 Logistic Regression
833
3.
What are properties of a logistic regression model?
4.
Consider a multinomial logistic regression with 3 target categories A, B, C, and C as base class. Let 0.34 be the odd of class A and 0.22 of class B. Which are the correct probabilities of the target categories ( pA, pB, pC)?
The following questions are yes or no questions. Please mark the correct answer. 5. Logistic regression is a special case of a generalized linear model (GLM). 6. The logistic regression outputs the target class directly. 7. Logistic regression is a nonparametric classifier. 8. If the target variable has K categories, then the multinomial logistic regression model consists of K 1 single binary models.
8.3.8
(AICC) □ Backwards □ Forwards □ It is a linear classifier □ Has a probabilistic interpretation □ No problems with collinearity □ It is a black box model, which means, no interpretation of effects is □ pA ¼ 0.22 □ pA ¼ 0.14 □ pB ¼ 0.22 □ pB ¼ 0.64 □ pC ¼ 0.64 □ pC ¼ 0.22 Yes No □ □ □ □ □ □ □ □
Solutions
Exercise 1: Prediction of Credit Ratings Name of the solution streams Theory discussed in section
tree_credit_logistic_regression Sect. 8.2 Sect. 8.3.1 Sect. 8.3.6
The final stream for this exercise is shown in Fig. 8.78. 1. We start by opening the stream “000 Template-Stream tree_credit”, which imports the tree_credit data and already has a Type node attached to it, and save it under a different name (see Fig. 8.79).
834
8 Classification Models
Fig. 8.78 Stream of the credit rating via logistic regression exercise
Fig. 8.79 Template stream for the tree_credit data
To set up a cross-validation with training data and testing data, we add a Partition node to the stream and place it between the Source node and Type node. Then, we open the node and define 70% of the data as training data and the remaining as test data. See Sect. 2.7.7 for the description of the Partition node. 2. We start the modeling process by opening the Type node and defining the variable “Credit rating” as the target variable. Furthermore, to calculate the performance measures and in particular the Gini at the end, we set the measurement type of the target as “Flag”. See Fig. 8.80.
8.3 Logistic Regression
835
Fig. 8.80 Definition of the variable “Credit rating” as the target variable and on type Flag in the Type node
Now, we add a Logistic node to the stream, by connecting it to the Type node. We observe that the name of the node is “Credit rating”, which shows us that the node has identified this variable as its target automatically. We open the Logistic node and choose the “Stepwise” variable selection method in the Model tab. Furthermore, to enable the cross-validation process, we make sure the “Use partitioned data” option is activated (see Fig. 8.81). Since the input variables’ importance is needed for the next task, we check the box “Calculate predictor importance” in the Analysis tab (see Fig. 8.82). Now, we run the stream and the model nugget appears. 3. To inspect the trained model and identify the included variables, we open the model nugget. In the Model view, we see that the three variables, “Income level”, “Number of credit cards”, and “Age” are included in the model. Of these variables, the income level is most important for the prediction. See the righthand side of Fig. 8.83. On the left-hand side of Fig. 8.83, we identify the regression equation in terms of the log-odds. The equation is, in our case,
836
8 Classification Models
Fig. 8.81 Definition of the stepwise variable selection method and cross-validation process
PðCR ¼ Goodj x Þ log PðCR ¼ Badj x Þ
¼ 1:696 þ 0:1067 ∙ Age 8 > < 1:787, IL ¼ High þ 0, IL ¼ Medium > : 1:777, IL ¼ Low þ
2:124, NCC ¼ 5 or more 0, NCC ¼ Less than 5,
where CR represents the “Credit rating”, IL the “Income level”, and NCC the “Number of credit cards”. These predictor variables included in the model are significant. This can be viewed in the Advanced tab at the bottom table (see Fig. 8.84). 4. To calculate the performance measures and to evaluate our model, we add an Analysis node to the stream and choose the coinciding matrix and AUC/Gini
8.3 Logistic Regression
837
Fig. 8.82 Enabling of predictor importance calculations
Fig. 8.83 Predictor importance and the regression equation for the credit rating exercise
options. See Sect. 8.2.6 for a description of the Analysis node. We run the stream. See Fig. 8.85 for the output of the Analysis node and the evaluation statistics. We note that the accuracy in both the training set and the testing set is nearly the same at a bit over 80%. Furthermore, the Gini and AUC are close to each other in both cases at about 0.78. All these statistics suggest good prediction accuracy and a non-overfitting model. 5. We visualize the ROC curves by adding an Evaluation node to the stream and connecting it to the model nugget. We open it and choose the ROC chart option,
838
8 Classification Models
Fig. 8.84 Significance of the predictor variables in the credit rating model
Fig. 8.85 Evaluation statistics of the logistic regression model for predicting credit rating
8.3 Logistic Regression
839
Fig. 8.86 ROC curves of the credit card rating classifier
as shown in Fig. 8.58. Figure 8.86 then shows the ROC curves of the training and test dataset. Both curves are similarly shaped which strengthens the ability of the model to perform on unknown data as well as on the training data. 6. To answer the question of the sales manager, we choose the gain chart. So, we add a second Evaluation node to the model nugget, open it, and pick “Gains” as our chart option. Furthermore, we mark the box “Include best line”. See Fig. 8.87. Then, we click the “Run” button at the bottom. The output graphs are displayed in Fig. 8.88. We look at the graph of the test set since this represents the performance of the model more adequately, as it comprises unseen data. We see that the model cumulatively gained 60% of the customers with a “Good” rating in the best 40% scored customers. This is illustrated by the right vertical line, which was added in the right graph. Hence, the sales manager has to (approximately) contact 40% of his or her customer base in order to talk to 60% of the customers having a “Good” rating. To calculate the percentage of customers with a “Bad” credit rating who are falsely contacted, we look at the gain chart drawn to represent the best model, i.e., the line at the top of the chart. We observe that in this graph, the 60% gain is already reached with approximately the top 37% of the customers (see left vertical line added to the right graph). So, we conclude that about 3% of the customers who are contacted by the sales manager have a “Bad” credit rating.
840
8 Classification Models
Fig. 8.87 Option to draw a Gains chart in the Evaluation node
Exercise 2: Prediction of Titanic Survival and Dealing with Missing Data Name of the solution streams Theory discussed in section
Titanic_missing_data_logistic_regression Sect. 8.2 Sect. 8.3.1 Sect. 8.3.6 Sect. 3.2.5 (SuperNode)
The final stream for this exercise is shown in Fig. 8.89. Here, the two models that are built in this exercise are compressed into SuperNodes (see Sect. 3.2.5), so the stream is more structured and clear. 1. We open the “017 Template-Stream_Titanic” and save it under a different name. This stream already contains the Partition node, which splits the data into a training set (70%) and a test set (30%). See Fig. 8.90.
8.3 Logistic Regression
841
Fig. 8.88 Gain charts of the credit card rating classifier
Fig. 8.89 Stream of the Titanic survival prediction exercise
We run the stream, and the data screening output of the Data Audit node appears. In the Quality tab, we see that missing values appear in the fields, “Age”, “Fare”, “Cabin”, and “Embarked”. See ‘%Complete’ column in Fig. 8.91. The missing values in the “Cabin” category are, however, treated as empty strings and not as no entry at all. So these are not really missing values. The variables “Fare” and “Embarked” furthermore have only very few missing values, but the “Age” field is only 80% filled. This latter variable can become problematic in our further analysis. 2. We now build the logistic regression classifier on the data with missing values. Figure 8.92 shows the complete part of this stream. This is the part that is wrapped up in a SuperNode named “with missing data”.
842
8 Classification Models
Fig. 8.90 Template stream of the Titanic dataset
Fig. 8.91 Quality output with missing value inspection of the Titanic data in the Data Audit node
Fig. 8.92 Stream of the logistic regression classifier with missing values in the Titanic data
First, we open the Type node and declare the “Survived” variable as the target field and set the measurement of it to “Flag”. This will ensure that the “Survived” variable is automatically selected as the target variable in the Modeling node. See Fig. 8.93. Then, we add a Logistic node to the stream and connect it to the Type
8.3 Logistic Regression
843
Fig. 8.93 Definition of the “Survived” field as the target and its measurement as “Flag”
node. After opening it, we specify “Stepwise” as our variable selection method and enable the use of partitioned data and the predictor importance calculations. See Figs. 8.81 and 8.82 for the setup of these options in the Logistic node. Now, we run the stream and the model nugget appears, which we then open. In the Model tab (see Fig. 8.94), we note that the variables “Sex”, “Pclass”, “Embarked”, and “Age” are selected as predictors by the stepwise algorithm. The most important of them is the variable “Sex”, and we see by looking at the regression equation on the left that women had a much higher probability of surviving the Titanic sinking, as their coefficient is 2.377 and the men’s coefficient is 0. Furthermore, the variables “Age” and “Embarked”, which contain missing values, are also included in the classifier. To evaluate the model, we add the usual Analysis node to the model nugget, to calculate the standard goodness of fit measures. These are the coincidence matrix, AUC and Gini. See Sect. 8.2.6. After running the stream, these statistics can be obtained in the opening window. See Fig. 8.95. We observe that accuracy in the training set and test set is similar (about 63%). The same holds true for the Gini (0.53). This is a good indicator for non-overfitting. In the coincidence matrix, however, we see a column named ‘$null$’. These are the passenger records with missing values. All these records are non-classifiable, because a person’s age, for
844
8 Classification Models
Fig. 8.94 Logistic regression equation and the importance of the included variables for predicting the survival of a passenger on the Titanic
example, is necessary for any prediction. In this case, the model assigns no class (‘$null$’), which leads to misclassification for all these records. This obviously reduces the goodness of fit of our model. So, we have to assign a value to the missing fields, in order to predict the survival status properly. To distinguish between predictions of the model with missing values and the one built thereafter without missing values, we change the name of the perdition fields with a Filter node. To do this, we add a Filter node to the model nugget and rename these fields, as seen in Fig. 8.96. 3. Now, we use the Auto Data Prep node to replace the missing values and predict their outcome properly. Figure 8.97 shows this part of the stream, which is wrapped in a SuperNode named “mean replacement”. We add an Auto Data Prep node to the Type node and open it. In the Settings tab, the top box in the “Prepare Inputs & Target” options should be marked in order to perform data preparation. Then, we tick the bottom three boxes under “Inputs”, so that the missing values in continuous, nominal, and ordinal fields are replaced with appropriate values. Since the variables “Age” and “Fare” are continuous, the missing values are replaced by the mean of the field’s values, and the missing embarked value is replaced by the field’s mode. See Fig. 8.98. Furthermore, we enable the standardization of continuous fields (see the bottom arrow in Fig. 8.98). This is not obligatory, but recommended as it typically improves the prediction, at least it doesn’t worsen it.
8.3 Logistic Regression
845
Fig. 8.95 Performance measures of the Titanic classifier with missing values
After finishing these settings, we click the Analyze Data button at the top, which starts an analysis process on the data-prepared fields. In the Analysis tab, the report of the analyzed data can be viewed. We see, e.g., for the “Age” variable, that in total 263 missing values were replaced by the mean, which is 29.88. Moreover, the age variable was standardized and now follows a normal distribution, as can be seen in the top right histogram. See Fig. 8.99. 4. To build a classifier on the prepared data, we add another Type node to the stream, to ensure that the Model node recognizes the input data correctly. Afterwards, we add a Logistic node to the stream, connect it to the Type node and choose the same options as in the first Logistic node. See step 2 for details. Afterwards, we run the stream and the Model nugget appears. We note that “Sex”, “Pclass”, and “Age” are still included as input variables in the model, with nearly the same prediction importance. The Embarked variable is, however, now replaced by “sibsp”. This is due to the more powerful “Age”
846
8 Classification Models
Fig. 8.96 Renaming the prediction fields for the model with missing values
Fig. 8.97 Stream of the logistic regression classifier and missing value data preparation in the Titanic data
variable, which is now able to influence more passenger records, as it now contains complete data. See Fig. 8.100. To view the Gini and accuracy, we add an Analysis node to the stream and set up the usual options (see step 2 and Sect. 8.2.6). The output of the Analysis node can be viewed in Fig. 8.101. We see that no missing values are left in the data, and thus, all passengers can be classified with higher accuracy. Consequently, both
8.3 Logistic Regression
847
Fig. 8.98 Auto Data Prep node to replace missing values
the accuracy and Gini have improved, when compared to the model with missing values (see Fig. 8.95). So, the second model has a higher prediction power than the one that ignores the missing values. 5. As in the previous model, we add a Filter node to the stream and connect it to the model nugget; this Filter node allows us to change the name of the prediction outcome variables by adding “mean repl” as a suffix. See Fig. 8.102. Now, we add a Merge node to the stream canvas and connect the two Filter nodes to it. In the Filter tab of the Merge node, we cross out every duplicate field, to eliminate conflicts during the merging process. See Fig. 8.103. We add an Evaluation node and an Analysis node to the stream and connect them to the Merge node. See Fig. 8.104. The settings of these nodes are as usual. See Sect. 8.2.6 for a description of the Analysis node and Sect. 8.2.7 for information on the Evaluation node. In the Evaluation node, we choose ROC as the chart type, see Fig. 8.58. In Fig. 8.105, the ROC curves of the model are shown, with and without missing values considered. As can be seen, the ROC curve of the model without missing values lies noticeably above the ROC curve of the classifier ignoring missing values. Hence, it has a better prediction power.
848
8 Classification Models
Fig. 8.99 Analysis of the “Age” field after data preparation by the Auto Data Prep. node
In the Analysis output, the performance measures of both models are stated in one output. Additionally, the predictions of both models are compared with each other. This part of the Analysis output is displayed in Fig. 8.106, whereas the other standard measures are hidden. To see these, one has to click on the plus of the particular tree node. The first table in Fig. 8.106 shows the percentage of equal-predicted classes. In other words, the table shows how many passengers are classified as survivor resp. non-survivor by both classifiers. The second table then takes the commonly classified records (passengers) and compares their prediction with the actual category. Here, of the equally classified passengers, about 80% are correct. Exercise 3: Multiple Choice Questions Theory discussed in section
Sect. 8.3.1
1. The odd changes with a factor exp(2). 2. The variable selection methods are Forward, Stepwise, and Backwards. The AICC is a criterion that compares models with each other and is thus involved in the variable selection, but not a method itself. 3. The correct answers are: • It is a linear classifier. • Has a probabilistic interpretation.
8.3 Logistic Regression
849
Fig. 8.100 Input variables and their importance for predicting the survival of a passenger on the Titanic after data preparation
If the variables are highly correlated, the logistic regression can leak in performance. Furthermore, it is a so-called white model, which means the effects of the input variables and the model are easy to interpret. 4. The correct probabilities are: pA ¼ 0:22, pb ¼ 0:14 and pC ¼ 0:64: The calculations are the following. Starting with the odds equations pA ¼ 0:34 pC
and
pB ¼ 0:22, pC
we imply pA ¼ 0:34 pC
and pB ¼ 0:22 pC :
Since pA + pB + pC ¼ 1, we get 1 ¼ 0:34 pC þ 0:22 pC þ pC ¼ 1:56 pC and finally
850
8 Classification Models
Fig. 8.101 Performance measures of the Titanic classifier without missing values
pC ¼ 0:64: Putting the value of pC in the odds equations, we get pA ¼ 0:22
and pB ¼ 0:14:
5. Yes. The Logistic regression is a GLM. See Sect. 5.4. 6. No. The logistic regression calculates a probability for each target class instead of predicting the class directly. The predicted class is then the category with the highest predicted probability. 7. No. Logistic regression falls in the category of parametric classifiers, as coefficients have to be estimated which then define the model. 8. Yes. For each non-base target category, a binary model is calculated versus the base category.
8.4 Linear Discriminate Classification
851
Fig. 8.102 Renaming of the prediction fields for the model without missing values
8.4
Linear Discriminate Classification
Linear discriminant analysis (LDA) is one of the oldest classifiers and goes back to Fisher (1936) and Welch (1939) and their biological frameworks. This classifier is still one of the most known and preferred classifiers. For example, the Linear discriminant classifier (LDC) is very popular with banks, since it works well in the area of credit scoring. When used correctly, the LDC provides great accuracy and robustness. The LDC follows a linear approach and tries to find linear functions that describe an optimal separation of the target classes. The theory of LDC is described in more detail in the subsequent section.
8.4.1
Theory
The idea of LDC goes back to Fisher (1936) and Welch (1939) who developed the method separately to each other and with different approaches. We here describe the original method by Fisher, called Fisher’s linear discriminant method. The key idea behind the method is to find linear combinations of input variables that separate the classes in an optimal way. See, e.g., Fig. 8.107. The LDC chooses a linear discriminant, such that it maximizes the distance of the classes from each other, while at the
852
8 Classification Models
Fig. 8.103 Elimination of duplicate fields in the Merge node
Fig. 8.104 Output nodes are connected to the Merge node
same time minimizes the variation within. In other words, a linear function is estimated that best separates the distributions of the classes from each other. This algorithm is outlined in the following binary case. For a more detailed description,
8.4 Linear Discriminate Classification
853
Fig. 8.105 ROC curves of the logistic regression classifiers with and without missing values
we recommend Runkler (2012), Kuhn and Johnson (2013), Duda et al. (2001) and Tuffery (2011). Consider the target variable to be binary. This method now finds the optimal linear separator of the two classes in the following way: First, it calculates the mean of each class. The linear separator now has to go through the midpoint between these two means. See the top left graph in Fig. 8.108. There, a cross symbolizes the midpoint, and the decision boundary has to go through this point, just like the two possibilities in the graph. To find the optimal discriminant function, the data points are projected along the candidates. Now on the projected data, the “within classes” variance and “between classes” variance are calculated, and the linear separator with the minimal covariance within the projected classes, and simultaneously with a high covariance between the projected classes, is picked as the decision boundary. This algorithm is visualized in Fig. 8.108. In the top right graph, the data are projected on the two axes. As can be seen, the class distributions of the projected data overlap and thus the classes are not clearly distinguishable. So the discriminant functions, parallel to the axes, are not optimal separators. In the bottom left graph, however, the data points are projected onto the dotted line. This separates the distributions better, as can be seen in the bottom right graph. Hence, the solid black line is chosen as the discriminating function. A Comment on Necessary Conditions The approach by Fisher (1936) does not make any assumptions of the underlying distributions. However, the model is more robust if the following conditions are fulfilled:
854
Fig. 8.106 Comparison of the two models in the Analysis output
Fig. 8.107 Separation of classes with linear decision boundaries
8 Classification Models
8.4 Linear Discriminate Classification
855
Fig. 8.108 Visualization of the linear discriminant algorithm
• Multivariate normal distribution of the features for each class of the target variable. • Equal covariance matrices across the classes of the target variable. In this case, the solution is optimal and can be calculated explicitly (see Duda (2001)). Often, the linear discriminant analysis is introduced in a slightly different way via a probabilistic approach for which the above-mentioned distributional assumptions are postulated. We refer to Azzalini and Scarpa (2012) and James et al. (2013) for details on this probabilistic introduction of the linear discriminant analysis. Even though the solution is optimal under these strict assumptions, the LDC has to be proven very robust to violations to these conditions, see Li et al. (2006), Sever et al. (2005), Azzalini and Scarpa (2012) and Tuffery (2011). If the condition of equal covariance matrices is removed, we obtain a quadratic discriminant function, and the corresponding procedure is therefore called Quadratic Discriminant Analysis (QDA). Since the QDA is a nonlinear method, it is more flexible, and more complex decision boundaries can be constructed which can lead to a better prediction performance. However, much more parameters have to be estimated which can result in a less stable model. As a thumb rule, a QDA is a good choice if the training set is large so that the variance of the classifier is not that of
856
8 Classification Models
concern, or the equality of the covariance matrices is untenable. On the other hand, the LDC tends to be a better model than the QDA if there are few training samples or the covariance matrices are equal. We refer to Azzalini and Scarpa (2012) and James et al. (2013) for further information on the QDA. The SPSS Modeler provides the option of calculations with different covariance matrices, resulting in a model similar to a QDA. See Fig. 8.115 and IBM (2019b). The Box’s M test is a multivariate statistical test which is used to check the equality of multiple covariance matrices. This is the standard test used for LDC in order to validate the second condition. Unfortunately, this test is very sensitive to the violation of normality, which leads to a rejection of the hypotheses of equality of covariance matrices in most cases. We refer to Box (1949) and Warner (2013) for further information. Another important requirement are numeric/continuous input variables. The LDC is not able to deal with categorical variables since extensive calculus has to be done to the estimation of the discriminant functions. If the data consists of multiple categorical variables, these have to be transformed into numeric dummy variables, which then, of course, violate the Gaussian distribution assumption, via the Restructure node or another method can be more suitable. We further like to mention that both the LDC and the QDA require a larger samples size of the training data than the number of predictor variables. If the latter number is near the sample size, the performance of the model will decline. As a simple thumb rule, LDC and QDA can be used on data with five times or more samples than input variables. In order to get a more stable model, we recommend to attempt the abovementioned conditions to meet. So we advise to at least standardize the data as far as possible or transform the feature variable prior to the training in order to get roughly normal distributed variables. See Tuffery (2011) for further measures in order to approach the constraining assumptions. Comparison to Logistic Regression Compared to logistic regression, LDC makes stricter assumptions on the data, but if these are fulfilled, it typically gives better results, particularly in accuracy and stability. For example, if the classes are well-separated, then logistic regression can have a surprisingly unstable parameter estimate and thus perform particularly poorly. On the other hand, logistic regression is more flexible, and thus it is still preferred in most cases. As an example, linear discriminant analysis requires numeric/continuous input variables, since the covariance, and so the distance between data points, has to be calculated. This is a drawback of LDC, when compared with logistic regression, which can process these kinds of data. Predictor Selection Methods Before including all input variables in the finals discriminant model, the Modeler also provides the “Stepwise” selection method, in order to find the most relevant predictors for the model. See Sect. 5.3.1 for a description of the stepwise method.
8.4 Linear Discriminate Classification
8.4.2
857
Building the Model with SPSS Modeler
Description of the model Stream name Based on dataset Stream structure
Linear discriminate analysis Iris.csv (see Sect. 12.1.22)
Important additional remarks: The target variable should be categorical. If it is continuous, a (multiple) linear regression might be more appropriate for this problem (see Chap. 5). Descriptive pre-analysis should be done before building the model, to get an intuition of the distribution of the input variables, in order to decide if equal covariance matrices and multivariate normal distribution can be assumed. Cross-validation is included in the model node, and it can be performed as described in the logistic regression section of Sect. 8.3.6. Related exercises: All exercises in Sect. 8.4.4
We introduce the LDC with the same Iris data used in the original paper of Fisher (1936). The Iris dataset contains data on 150 flowers from three different Iris species, together with width and length information on their petals and sepals (see Sect. 12.1.22). We want to build a discriminant classifier that can assign the species of an Iris flower, based on its sepal and petal width and length. The discriminant node can automatically perform a cross-validation when a training set and test set are specified. We will make use of this feature. This is not, however, obligatory; if the test set comes from external data, then a cross-validation process will not be needed. 1. To build the above stream, we first open the stream “012 Template-Stream_Iris” (see Fig. 8.109) and save it under a new name. This stream already contains a Partition node that divides the Iris dataset into two parts, a training and a test set in the ratio 70–30%
858
8 Classification Models
Fig. 8.109 Template stream of the Iris data
2. To get a deeper insight into the Iris data, we add several Graphboard nodes to the stream by connecting them to the Type node. We use these nodes to visualize the data in a scatterplot matrix and 3-D density plots. Further descriptive analysis could be considered, however. The scatterplot matrix (Fig. 8.110) of the Iris data indicates the separability of the species through the sepal and petal Width and length. Furthermore, the histograms on the diagonal suggest a multivariate normal distribution across the species, which is one of the condition that ensures a stable model. Furthermore, the 3-D density plots support this assumption, as the shape nearly follows a multivariate Gaussian distribution. See Fig. 8.111 for a visualization of the sepal width and length joint distribution. 3. We add the Discriminant node to the Modeler canvas and connect it to the Partition node. After opening the former, we specify the target, partition, and input variables in the Field tab. Thus, the ‘species’ variable is chosen as the target, the ‘Partition’ field as the partition, and all four remaining variables are chosen as input variables. See Fig. 8.112. 4. In the Model tab, we enable the usual “Use partition data” option, in order to ensure that the training set and the test set are treated differently in their purpose for the cross-validation technique. See Fig. 8.113. Furthermore, we choose the “Stepwise” method, to find the optimal input variables for our classification model. See the bottom arrow in Fig. 8.113. 5. In the Analyze tab, we also enable the predictor importance calculation option. See Fig. 8.114. 6. In the Expert tab, knowledge of the data and variable distribution can be defined which will fine-tune the training process. To access these expert options, click on the ‘Expert’ mode. If the ‘Simple’ mode is retained, the default settings are in place. As a first option, we can specify if the target groups are of equal size (the default setting) or if the group size should be determined from the data.
8.4 Linear Discriminate Classification
Fig. 8.110 Scatterplot matrix of the Iris data
Fig. 8.111 3-D density plot of the sepal width and length joint distribution for each species
859
860
8 Classification Models
Fig. 8.112 Specification of the input, partition, and target variables
Furthermore, we can define which kind of covariance matrix is used during training. If the ‘Separate-groups’ option is chosen, separate covariance matrices are considered for each group which lead to a model similar to the QDA with nonlinear discriminating functions. On the other hand, if the default ‘Withingroups’ option is selected, a common averaged covariance matrix is used during the calculations which results in a linear model. See Fig. 8.115. In the Iris dataset, all species have the same sample size. Furthermore, the size of the dataset is not very large and since the descriptive analysis indicated a possible linear separability, we stick with the default settings and a linear approach here (see Fig. 8.115). 7. Furthermore, in the Expert tab additional outputs, which will be displayed in the model nugget, can be specified. These include the parameters and statistics of the feature selection process and estimated discriminant functions. We here select the following outputs: • Box’s M: The statistics of a Box’s M test will be added to the model nugget. • Function Coefficients: The estimated coefficients of Fisher’s discriminant and classification function (standardized and unstandardized) are displayed in the
8.4 Linear Discriminate Classification
861
Fig. 8.113 Model options for the discriminant analysis
Fig. 8.114 Predictor importance in the discriminant node
model nugget. These represent the decision boundaries and discriminant functions of the model. • Casewise results: Overview of the first samples containing the actual group, predicted group, posterior probability, and discriminant score.
862
8 Classification Models
Fig. 8.115 Expert tab of the Discriminant node where distribution knowledge of the data can be defined and additional output statistics can be chosen
• Territorial map: A map that illustrates the discriminant functions with their partitioning into the prediction areas for the target groups. The map is only displayed if there is more than one discriminant function. • Summary of Steps: Summary and statistics of the stepwise variable selection process are added to the model nugget. See Fig. 8.116 for the complete palette of output options. We omit here the description of every possible additional output and refer the reader to IBM (2019b) for more information. We recommend trying several of these options, in order to inspect the information they add to the description of the model. 8. Now we run the stream, and the model nugget appears on the canvas. 9. We add an Analysis node to the model nugget, to evaluate the goodness of fit of our discriminant classifier. Since the Iris data have a triple-valued target variable, we cannot calculate the AUC and Gini index (see Sect. 8.2.6), so we just evaluate the model on the coincidence matrices (see Fig. 8.117). 10. After pressing the “Run” button in the Analysis node, the output statistics appear and are shown in Fig. 8.118. A look at these model evaluation statistics shows that the model has very high accuracy. That is, an accuracy of 97% correctly classified Iris flowers in the training set and 100% correctly classified in the test data. In total numbers, only three of the 150 flowers are misclassified. Hence, the model is a well fitted classifier for our iris flower categorization problem.
8.4 Linear Discriminate Classification
863
Fig. 8.116 Advanced output options in the Discriminant node
8.4.3
The Model Nugget and the Estimated Model Parameters
The main information on the built model is provided in the Model tab and the Advanced tab of the model nugget, the contents of which are roughly described hereafter. Model tab When opening the model nugget, first, the Model tab is shown with the predictor importance, provided its calculation was enabled in the discriminant node. We see that the variables “Petal Length” and “Petal Width” are the most important for differentiating between the Iris species. See Fig. 8.119. This is not surprising as the descriptive analysis of the data already indicated this outcome (see Fig. 8.110). Advanced tab The Advanced tab comprises statistics and parameters from the estimated model and the final discriminant functions. Most of the outputs in the Advanced tab are very technical and are only interpretable with extensive background and mathematical knowledge. We, therefore, keep the explanations of the parameters and statistics
864
8 Classification Models
Fig. 8.117 The Analysis node and the selected output variables
rather short and refer the reader to IBM (2019b), Duda et al. (2001) and Tuffery (2011) for further information if desired. The outputs chosen in the output options of the Expert tab in the discriminant node are displayed in this tab of the model nugget (see Sect. 8.4.2). Here, we will briefly describe the most important reports and statistics of the Advanced tab. There are several further interesting and valuable statistical tables that can be displayed, however, and we recommend trying some of these additional outputs. The first two tables in this tab give a summary of the groups, i.e., the samples with the same category. These include, e.g., the number of valid sample records in each group and the number of missing values. The statistics of the Box’s M test for equality of covariance matrices are shown in the ‘Test Results’ table (see Fig. 8.120). A significant p-value thus indicates inequality of covariance matrices, whereas a non-significant p-value means there is insufficient evidence for the matrices to differ. Here, the test is significant, and thus, hints to different covariance matrices across the target groups. However, in our case, we still assume homogeneity of covariance since the dataset is small (see Sect. 8.4.1). The table ‘Variables Entered/Removed’ gives an overview of the stepwise variable selection process and the predictors included in the final model. In each step, the variable that minimizes the overall Wilks’ lambda is entered. The table displays the variables added in every step with the corresponding Wilks’ lambda and significance
8.4 Linear Discriminate Classification
865
Fig. 8.118 The output statistics in the Analysis node for the Iris dataset
statistics. See Fig. 8.121. Here, all variables are included in model with significant impact on the quality of the model. The tables ‘Eigenvalues’ and ‘Wilks’ Lambda’ show the number of estimated discriminants: the quality of these functions and their proportion to the classification are shown. Here, two linear functions are needed to separate the three Iris species properly. See Fig. 8.122. The parameter calculation of the linear functions can be traced back to an eigenvalue problem, which then has to be solved. See once again Runkler (2012) and Tuffery (2011). The first table shows the parameters of the eigenvalue estimation. See Fig. 8.122. The eigenvalue thus gives a quantity to the discriminating ability. In the column “% of Variance”, the percentage of the discriminating ability is stated, which is basically the proportion of the eigenvalue compared with the sum of eigenvalues. Here, the first discriminant function explains 99.3% of the discrimination and is therefore by far the most important of the two functions. The last column in this table holds the canonical correlations of the predictor variables and the grouping variable. The second table in Fig. 8.122 shows statistics from the hypothesis test, which tests for each function if it is canonical correlation, and all canonical correlations of the successive functions are equal to zero. This is done via Wilks’ lambda and a Chi-square test, see Tuffery (2011). The significance levels are displayed in the last
866
8 Classification Models
Fig. 8.119 Predictor importance of the linear discriminant analysis for the Iris data Test Results Box’s M F
107,104 Approx.
5,045
df1
20
df2
35518,404
Sig.
,000
Tests null hypothesis of equal population covariance matrices.
Fig. 8.120 Statistics of the Box’s M test for equality of covariance matrices
column. Here, both linear discriminants are significant and, thus, cannot be omitted in the classification model. The discriminating functions can be extracted from the table “Canonical Discriminant Function Coefficients” which contains the estimated coefficients of the linear equation of the discriminants. See Fig. 8.123. In the case of our Iris data, we have the following two equations:
8.4 Linear Discriminate Classification
867
Variables Entered/Removeda,b,c,d Wilks’ Lambda Step
Entered
Statistic
df1
df2
df3
Statistic
df1
Exact F df2
Sig.
1
Petal Length
,060
1
2
100,000
778,884
2
100,000
,000
2
Sepal Width
,041
2
2
100,000
196,125
4
198,000
,000
3
Petal Width
,028
3
2
100,000
161,053
6
196,000
,000
4
Sepal Width
,026
4
2
100,000
125,348
8
194,000
,000
At each step, the variable that minimizes the overall Wilks’ Lambda is entered. a. Maximum number of steps is 8. b. Maximum partial F to enter is 3.84. c. Maximum partial F to remove is 2.71. d. F level, tolerance, or VIN insufficient for further computation.
Fig. 8.121 Summary of the stepwise variable selection method with significance statistics Eigenvalues Function 1 2
Eigenvalue
% of Variance
30,152a ,222a
Cumulative %
99,3 ,7
Canonical Correlation
99,3 100,0
,984 ,426
a. First 2 canonical discriminant functions were used in the analysis.
Wilks’ Lambda Test of Function(s) 1 through 2 2
Wilks’ Lambda ,026 ,819
Chi-square 358,448 19,717
df
Sig. 8 3
,000 ,000
Fig. 8.122 Quality measures and parameters of the estimated discriminants, including the eigenvalues and significance test
Score1 ¼ 0:849 ∙ Sepal Length 1:593 ∙ Sepal Width þ 2:146 ∙ Petal Length þ 2:570 ∙ Petal Width 1:155 Score2 ¼ 0:522 ∙ Sepal Length þ 1:949 ∙ Sepal Width 1:036 ∙ Petal Length þ 2:662 ∙ Petal Width 8:275: With these equations, the discriminant scores for each sample can be calculated via the values of the predictor variables. These scores and their location to the decision boundaries, i.e., the discriminants, lead to the prediction of the target class. The ‘Territorial map’ outlines the discriminant functions and the decision boundaries, dividing the value space into areas that represent the target groups. See Fig. 8.124. Here, the decision boundaries are linear partitioning the value space in three areas, each representing one iris species.
868
8 Classification Models
Fig. 8.123 Matrix of the estimated coefficients of the discriminant function
Canonical Discriminant Function Coefficients Function 1
2
Sepal Length
–,849
,522
Sepal Width
–1,593
1,949
Petal Length
2,146
–1,036
Petal Width
2,570
2,662
(Constant)
–1,155
–8,275
Unstandardized coefficients
We like to mention that the coefficients in Fig. 8.123 are unstandardized. To see the magnitude of the effect of the discriminating variable, one has to look at the standardized coefficients, which are also displayed in another table in the model nugget. The standardization is thus done in such a way that the distribution of the discriminant scores has zero mean and a standard deviation equal to one. In the ‘Casewise Statistics’ table, the classification statistics of the first samples are displayed, comprising the predicted target group with its predicted probability and the discriminant scores. See Fig. 8.125. For example, the discriminant scores of the first sample are 7.542 and 0.291, and when plotting them in the territorial map, we see that the data point lies in the area of the iris flower indexed by 1. Hence, the predicted group of this sample is 1. At last, we like to point to the table ‘Classification Function Coefficients’, which contains the coefficients of Fisher’s linear discriminant functions. With these functions, a score can be calculated for each target group, here, the iris species. The sample is then assigned to the group with the highest score. This is an equivalent and direct way predicting the target variable (see IBM (2019b) and James et al. (2013)) (Fig. 8.126).
8.4.4
Exercises
Exercise 1: Classification of Wine Via Linear and Quadratic Discriminant Models The dataset “wine_data.txt” contains chemical analysis data on three Italian wines (see Sect. 12.1.39). The goal of this exercise is to build a discriminant model that is able to identify the wine based on its chemical characteristics: 1. Import the data with an appropriate Source node and divide the dataset into training data and test data. 2. Add a Type node and specify the scale type as well as the role of each variable.
8.4 Linear Discriminate Classification
869
Fig. 8.124 Territorial map that shows the discriminant functions and decision boundaries
3. Perform some descriptive analysis to see if the different wines are linear separable and check for the multivariate normal distribution assumption. 4. Build a linear discriminant classification model with the discriminant node. Include the afore-done partitioning and the calculation of the predictor’s importance in your model.
870
8 Classification Models
Fig. 8.125 Casewise statistics comprising the actual group, predicted group, posterior probability, and discriminant score Classification Function Coefficients Species setosa
versicolor
virginica
Sepal Length
23,412
15,349
12,736
Sepal Width
29,143
13,200
9,367
Petal Length
–14,777
5,376
12,296
Petal Width
–19,272
1,611
14,208
(Constant)
–96,684
–77,072
–104,634
Fisher’s linear discriminant functions
Fig. 8.126 Classification function coefficients that describe Fisher’s linear discriminant functions
5. Survey the model nugget. What is the most important predictor variable? Determine the equation for calculating the discriminant score. What is the result of the Box’s M test? Is the consideration of separate covariance matrices suitable? 6. Add an Analysis node to the nugget and run the stream. Is the model able to classify the samples? What is the accuracy of the test set? 7. Add a second discriminant node to the stream and build a model that uses separate groups for the covariance matrices. What are the accuracy of this nonlinear model? Inspect the territorial map in the model nugget. How do the decision boundaries have changed? Exercise 2: Comparison of LDC and Logistic Regression and Building an Ensemble Model In this exercise, for each of the two linear classifier methods, LDC and Logistic Regression, a model is trained on the “diabetes_data_reduced_sav” dataset (see Sect. 12.1.10), to predict if a female patient suffers from diabetes (class_variable ¼ 1). These two models are then compared with each other, as different classification models vary in their interpretation of the data and, for example, favor different target classes. 1. Import the dataset “diabetes_data_reduced_sav” with an appropriate Source node and divide the data into a training set and test set.
8.4 Linear Discriminate Classification
871
2. Build an LDC that separates the diabetes patients from the non-diabetes patients. What are the most important predictor variables? 3. Train a Logistic Regression model on the same training data as the Linear Discriminant model. To do this, use the forwards stepwise variable selection method. What are the variables included in the model? 4. Compare the performance and goodness of fit of the two models with the Analysis and Evaluation node. 5. Use the Ensemble node to combine the two models into one model. What are the performance measures of this new ensemble model? Exercise 3: Mining in High-Dimensional Data and Dimensional Reduction The dataset “gene_expression_leukemia_all” contains genomic sample data from several patients suffering from one of four different leukemia types (ALL, AML, CLL, CML) and a healthy control group (see Sect. 12.1.16). The gene expression data are measured at 851 locations on the human genome and correspond to known cancer genes. Genomic data are multidimensional with a huge number of features, i.e., measurement points on the human genome, and often come from only a few samples. This can create problems for many classification algorithms, since they are typically designed for situations with a small number of input variables and plenty of observations. In this exercise, we use PCA for dimension reduction to overcome this obstruction. 1. Import the dataset “gene_expression_leukemia_all.csv” and extract the subset that contains only healthy and ALL patients. How many patients are left in this subset and what are the sizes of each patient group? What can be a problem when building a classifier on this subset to separate the two patient groups? 2. Build an LDC that separates the “ALL” from the “healthy” patients. Calculate the accuracy and Gini for the training data and the test data. Explain the results. 3. Perform a PCA on the gene expression data of the training set. Plot the first two, with the PCA determined factors against each other. What can you say about the separability of the two patient groups based on their location (“ALL”, “healthy”). 4. Build a second LDC on the first 5 factors. What are the accuracy and Gini for this model? Explain the difference from the first model. What are the advantages of a dimensional reduction in this case?
8.4.5
Solutions
Exercise 1: Classification of Wine Via Linear and Quadratic Discriminant Models Name of the solution streams Theory discussed in section
Figure 8.127 shows the final stream for this exercise.
Wine_LDA Sect. 8.2 Sect. 8.4.1
872
8 Classification Models
Fig. 8.127 Stream of the LDC for the wine dataset
1. First, import the dataset with the File Var. File node and connect it to a Partition node, to divide the dataset into two parts in a 70:30 ratio. See Sect. 2.7.7 for partitioning datasets. 2. Now, add a Type node to the stream and open it. We assign the role “Target” to the “Wine” variable and set its measurement to ‘Nominal’. All other variables are already defined correctly, that is, as a continuous input variable resp. as the partitioning variable. See Fig. 8.128. Afterwards, click the “Read Values” button to make sure that the stream knows all the scale levels of the variables. Otherwise, the further discriminant analysis might fail and produce an error. 3. Next, we add some Graphboard nodes (see Chap. 4) to the Type node to inspect the data. Here, we just show the descriptive analysis for the variables ‘Proline’ and ‘Flavanoids’. However, more graphs and analysis should be done in order to verify the separability and multivariate normal distribution assumption. First, we visualize the ‘Proline’ and ‘Flavanoids’ variables in a scatterplot with different colors for each wine. See Fig. 8.129. We observe that the different wines are located in more or less non-overlapping clusters. So, there is a good chance that the wines are separable by a linear function. Furthermore, we plot 3-D density graphs of the two variables with another Graphboard node. See Fig. 8.130. We see that for each wine, the density takes after a multivariate Gaussion distribution, and consequently, we can assume that the multivariate normal distribution condition is in place, at least for these two variables.
8.4 Linear Discriminate Classification
873
Fig. 8.128 Definition of the scale level and role of the variables
4. Now, build a linear discriminant classification model by adding a Discriminant node to the stream. Open the node, enable the predictor importance calculation in the Analyze tab, and choose the “Use type node settings” option in the Fields tab. The latter now uses the roles of the variables as declared in the Type node. Furthermore, make sure the “Use partitioned data” option is marked in the Model tab. To ensure that the relevant statistics for the subsequent part of the exercise are included in the model nugget, enable the ‘Expert’ mode in the Expert tab and check the box next to the Box’s M test, the Function coefficients, and the Territorial map in the output options. See Fig. 8.131. Before running the stream to build the model, rename the node ‘LDA’ in the Annotations tab. 5. Open the model nugget. In the Model tab, you can see the predictor importance plot. The ‘Proline’ variable, with 0.26, has the most importance for classifying the Italian wines. See Fig. 8.132.
874
8 Classification Models
Fig. 8.129 Scatterplot of the ‘Proline’ and ‘Flavanoids’ variables of the wine data
Fig. 8.130 3-D density plot of the ‘Proline’ and ‘Flavanoids’ variables of the wine data
8.4 Linear Discriminate Classification
875
Fig. 8.131 Advanced output options of the LDC for the wine classification
Predictor Importance Target: Wine Proline Flavanoids Alcohol Proanthocyanins Color_intensity Alcalinity_of_ash Magnesium OD280_OD315_of_diluted_ wines Nonflavanoid_phenols Total_phenols Ash Malic_acid Hue
0.0
0.2
0.4
Fig. 8.132 Predictor importance of the LDC for the wine data
0.6
0.8
1.0
876
8 Classification Models
Fig. 8.133 Coefficients of the two discriminant functions of LDC for the wine data
Canonical Discriminant Function Coefficients Function 1
2
,528
,888
Malic_acid
–,128
,249
Ash
1,384
2,653
Alcalinity_of_ash
–,212
–,208
Magnesium
–,002
,006
Total_phenols
–,655
,858
Flavanoids
1,485
–1,112
Nonflavanoid_phenols
1,261
–,865
,082
–,682
–,326
,257
,144
–2,261
1,154
–,032
Alcohol
Proanthocyanins Color_intensity Hue OD280_OD315_of_dilute d_wines Proline (Constant)
,003
,003
–11,390
–14,473
Unstandardized coefficients
The equation of the functions that calculate the discriminant score is described in the Advanced tab of the model nugget through the Canonical Discriminant Function Coefficients, which contains the coefficients of the two linear discriminant functions. See Fig. 8.133. Finally, the p-value of the Box’s M test is very low (see Fig. 8.134), which indicates that the covariance matrices are likely to be unequal. So, building a model with separate-group covariance matrices is not far-fetched from trying. 6. Now, add an Analysis node to the nugget and select the option “Coincidence matrices” in the node. Then run the stream again. A window as in Fig. 8.135 opens with the evaluation statistics and confusion matrix. We see that the LDC performs very well, as the accuracy is greater than 98 % in both the training data and testing data. More precisely, only one wine is misclassified in the training resp. test set. So, the build model is very accurate and suitable in classifying the Italian wines. 7. Add a second Discriminant node to the stream and connect it to the Type node. The settings are the same as in the first Discriminant node, except for the Expert tab. There, select the options ‘Compute from group size’ and ‘Separate-groups’ to ensure that during the model training different covariance matrices are used for each wine and that the frequency of each wine in the dataset is computed
8.4 Linear Discriminate Classification
877 Test Results
Box’s M F
560,912 Approx.
2,577
df1
182
df2
28973,380
Sig.
,000
Tests null hypothesis of equal population covariance matrices.
Fig. 8.134 Box’s M test statistics for the LDC of the wine data
Fig. 8.135 Output of the Analysis node for the LDC
individually and no uniform distribution is assumed. See Fig. 8.136. Before running the node, rename it ‘QDA’ in the Annotations tab. In the appearing model nugget, the territorial map now shows nonlinear discriminants. See Fig. 8.137. In the LDC model, the discriminants were linear since a linear separability was assumed in this model.
878
8 Classification Models
Fig. 8.136 Expert option in the Discriminant node that ensures a nonlinear discriminant for the wine classifier
Finally, add an Analysis node to the model nugget, select again the “Coincidence matrices” option, and run it. In Fig. 8.138, the output is shown. The accuracy is still high, i.e., over 98% in the training and test set. Hence, a nonlinear discriminant model is equally suitable classifying the wines. However, since the LDC is more stable and robust, we would recommend to pick the linear classifier as the final model for production. Exercise 2: Comparison of LDA and Logistic Regression and Building an Ensemble Model Name of the solution streams Theory discussed in section
Diabetes_logit_lda Sect. 8.2 Sect. 8.4.1 Sect. 8.3.1 Sect. 5.3.6 (ensembles)
Figure 8.139 shows the final stream for this exercise. 1. We start by opening the template stream “015 Template-Stream_Diabetes” and saving it under a different name. See Fig. 8.140 for the stream. If the data type of the target variable “class_variable” is not defined as “Flag” in the Type node, we change the type to “Flag” so that we are able to calculate the Gini and AUC evaluation measures of the later trained models. See Fig. 8.141.
8.4 Linear Discriminate Classification
879
Fig. 8.137 Territorial map of the nonlinear discriminant model which shows quadratic discriminants
Furthermore, we make sure that the role of the “class_variable” is “Target” and all other roles are “Input”. Now, we add the usual Partition node and split the data into training (70%) and test (30%) sets. See Sect. 2.7.7 for a detailed description of the Partition node.
880
8 Classification Models
Fig. 8.138 Output of the Analysis node for the nonlinear discriminant model
Fig. 8.139 Stream of the LDC and Logistic Regression classifiers for the diabetes dataset
2. We add a Discriminant node to the stream, connect it to the Partition node and open it. As the roles of the variables are already defined in the Type node (see Fig. 8.141), the node automatically identifies the roles of the variables, and thus,
8.4 Linear Discriminate Classification
881
Fig. 8.140 Template stream for the diabetes data
Fig. 8.141 Type node with the measurement type of the target variable (class_variable) is set to “Flag”
nothing has to be done in the Fields tab. In the Model tab, however, we select the “Stepwise” variables selection method (see Fig. 8.142) and in the Analyze tab, we enable the variable importance calculations (see Fig. 8.143).
882
8 Classification Models
Fig. 8.142 Model tab of the Discriminant node and definition of the stepwise variable selection method
Fig. 8.143 Analyze tab in the Discriminant node and the enabling of the predictor importance calculation
8.4 Linear Discriminate Classification
883
Fig. 8.144 Predictor importance in the LDC model
Now we run the stream and open the model nugget that now appears. We observe in the Model tab that the variable “glucose_concentration” is by far the most important input variable, followed by “age”, “BMI”, and “times_pregnant”. See Fig. 8.144. 3. We now add a Logistic node to the stream and connect it to the Partition node, to train a logistic regression model. Afterwards, we open the node and select the forwards stepwise variable selection method in the Model tab. See Fig. 8.145. There, we choose the Binomial procedure, since the target is binary. The multinomial procedure with the stepwise option is also possible and results in the same model. Before running the node, we further enable the importance of calculation in the Analyze tab. We then open the model nugget and note that the most important variables included in the logistic regression model are the same as in the LDC, i.e., “glucose_concentration”, “age”, and “BMI”. These three are the only variables included in the model using the forwards stepwise variable selection method (Fig. 8.146).
884
8 Classification Models
Fig. 8.145 Model tab in the Logistic node with definition of the variable selection method
4. Before comparing both models, we rearrange the stream by connecting the two model nuggets in series. See Fig. 8.147. This can easily be done by dragging a part of the connecting arrow of the Partition node to the logistic regression model nugget on the LDC model nugget. Now, we add an Analysis and an Evaluation node to the stream and connect them to the logistic regression nugget. Then, we run these two nodes to calculate the accuracy resp. Gini values and to visualize the ROC curve. The setting of these two nodes is explained in Sects. 8.2.6 and 8.2.7. In Fig. 8.148 the model statistics of the LDC and logistic regression can be viewed. We see that the AUC/Gini values are slightly better for the linear discriminant model; this is also visualized in Fig. 8.149, where the ROC curve of the LDC model is located a bit above the logistic regression ROC curve. Accuracy in the test set is, however, higher in the logistic regression model. This is a good example of how a higher Gini doesn’t have to go along with higher accuracy and vice versa. The decision of a final model is thus always associated with the choice of the performance measure. Here, the LDC would be preferred when looking at the Gini, but when taking accuracy as an indicator, the logistic regression would be slightly preferred.
8.4 Linear Discriminate Classification
885
Fig. 8.146 Predictor importance in the Logistic Regression model
Fig. 8.147 Rearranging the stream by connecting the model nuggets in series
To analyze where these differences arise from, we inspect the coincidence matrices. We note that the two models differ in their favoritism of the target class. The logistic regression model has a higher tendency to the nondiabetes class (class_variable ¼ 0) than the linear discriminant model. The first one thus predicts nondiabetes diagnosis more often than the LDC, for both the training set and the
886
8 Classification Models
Fig. 8.148 Performance measures in the LDC and logistic regression model for the diabetes data
test set. Moreover, the LDC predicts diabetes more often than it occurs in the data; that is, in the training set the LDC predicts 98 patients will have diabetes, although there are only 90 diabetes patients in the set. This is similar for the test set. Thus, to compensate for this overprediction of diabetes patients and create a more robust prognosis, one possibility is to combine the two models into an ensemble model. 5. To combine the two models into an ensemble model, we add the Ensemble node to the stream and connect it to the logistic regression model nugget. We open the
8.4 Linear Discriminate Classification
887
Fig. 8.149 ROC curves of the LDC and logistic regression model
Fig. 8.150 Settings of the Ensemble node with averaging over raw propensity scores as ensemble method
888
8 Classification Models
Fig. 8.151 Enabling of the propensity calculations in the LDC model nugget
node and choose “class_variable” as the target field for the ensemble model. See Fig. 8.150. Additionally, we set the aggregation method of the ensemble to “Average raw propensity”. When running this node now, the predictions, namely the probabilities for the target classes, of the two models LDC and logistic regression are averaged, and the target class with the higher probability wins and is therefore predicted by the ensemble. For the averaging to work properly, we have to ensure that the propensities of the models are calculated. See IBM (2019b) for information on propensity scores. This can be enabled in the model nuggets in the Settings tab. See Fig. 8.151 for the LDC model nugget. This can be done analog for the logistic regression model nugget. We add an Analysis node to the stream and connect it to the Ensemble node. In Fig. 8.152, the accuracy and Gini are displayed in the ensemble model. We see that the Gini of the test set has increased by 0.004 points compared to the LDC (see Fig. 8.148). Furthermore, the accuracy in the training set and test set has slightly improved. In conclusion, the prediction power of the LDC improves slightly when it is combined with a logistic regression model within an ensemble model. Exercise 3: Mining in High-Dimensional Data and Dimensional Reduction Name of the solution streams Theory discussed in section
leukemia_gene_expression_lda_pca Sect. 8.2 Sect. 8.4.1 Sect. 6.3
Figure 8.153 shows the final stream for this exercise.
8.4 Linear Discriminate Classification
889
Fig. 8.152 Evaluation measures in the ensemble model, consisting of the LDC and logistic regression models
Fig. 8.153 Solution stream for the exercise of mining gene expression data and dimensional reduction
890
8 Classification Models
Fig. 8.154 Template stream of the gene expression of leukemia ALL data
Fig. 8.155 Selection of the subset that contains only ALL or healthy patients
1. We start by opening the template stream “018 Template-Stream gene_expression_leukemia” and saving it under a different name. See Fig. 8.154. The template already comprises a Partition node with a data split, 70% training and 30% test. Furthermore, the roles in the type node are already defined, that is, the “Leukemia” variable is set as the target and all genomic positions are set as inputs. We then add a Select node and place it in the stream between the Source node and Partition node. We open the Select node and enter the formula that selects only the “ALL” or “healthy” patients in the dataset. See Fig. 8.155 for the formula inserted in the Select node. Afterwards, we add a Distribution node to the Select node to draw the frequency of the two patients’ groups. Figure 8.156 shows the distribution of the Leukemia variable in the subset. There are in total 207 patients in the subset, whereof 73 are healthy and the others suffer from ALL.
8.4 Linear Discriminate Classification
891
Fig. 8.156 Distribution of the leukemia variable within the selected subset
When building a classifier on this subset, the large number of input variables, 851, is a problem when compared with the observation records, 207. In this case, many classifiers suffer from overfitting. 2. We add a discriminant node to the stream and connect it to the Type node. As the roles of the variables are already defined in the Type node, the Modeler automatically detects the target, input, and Partition fields. In the Model tab of the discriminant node, we further choose the stepwise variable selection method before running the node. After the model nugget appears, we add an Analysis node to the stream and connect it to the nugget. For the properties of the Analysis node, we recall Sect. 8.2.6. The evaluation statistics, including accuracy and Gini, are shown in Fig. 8.157. We see that the stats are very good for the training set, but are clearly worse for the test set. The Gini indicates this, especially since here it is only half as large in the test set (0.536) as in the training set (1.0). This signals overfitting of the model. So, the LDC is unable to handle the gene expression data with disproportional input variables compared to the patients. The reason for this can be that the huge number of features overlay the basic structure in the data. 3. We add a PCA/Factor node to the stream and connect it to the Type node in order to consolidate variables and identify common structures in the data. We open the PCA/Factor node and select all genomic position variables as inputs in the Fields tab. See Fig. 8.158. In the Model tab, we mark the “Use partitioned data” option in order to only use the training data for the factor calculations. The method we intend to use is the “Principal Components” method, which we also select in this tab. See Fig. 8.159. Now, we run the PCA/Factor node and add a Plot node to the appearing nugget. In the Plot node, we select factor 1 and factor 2 as the X and Y field. Furthermore, we define the “Leukemia” variable as a coloring and shape indicator, so the two groups can be distinguished in the plot. See Fig. 8.160.
892
8 Classification Models
Fig. 8.157 Analysis output for the LDC on the raw gene expression data
In Fig. 8.161, the scatterplot of the first two factors is shown. As can be seen, the “ALL” patients and the “healthy” patients are concentrated in clusters and the two groups can be separated by a linear boundary. 4. We add another Type node to the stream and connect it to the PCA/Factor model nugget, so the following Discriminant node is able to identify the new variables with its measurements. As just mentioned, we add a Discriminant node and connect it to the new Type node. In the Fields tab of the Model node, we select “Leukemia” as the target, “Partition” as the partition variable, and all 5 factors, which were determined by the PCA, as input variables. See Fig. 8.162. In the Model tab, we further choose the stepwise variable selection method before running the stream. We connect the model nugget of the second LDC to a new Analysis node and choose the common evaluation statistics as the output. See Sect. 8.2.6. The output of this Analysis node is displayed in Fig. 8.163. We immediately see the accuracy, as well as the Gini, has improved for the test set, while the stats in the training set have not decreased. Both values now indicate extremely good separation and classification power, and the problem of overfitting has faded.
8.4 Linear Discriminate Classification
893
Fig. 8.158 Input variable definition in the PCA/Factor node
Fig. 8.159 Definition of the extraction method and usage of only the training data to calculate the factors
894
8 Classification Models
Fig. 8.160 Setup of the scatterplot to visualize factors 1 and 2, determined by the PCA
Fig. 8.161 Scatterplot of the first two factors determined by the PCA
8.4 Linear Discriminate Classification
895
Fig. 8.162 Selection of the factors as inputs for the second LDC
An explanation of why the LDC on the PCA calculated factors performs better than on the raw data is already given in previous steps of this solution. The huge number of input variables has supposedly hidden the basic structure that separates the two patient groups, and the model trained on the raw data is therefore overfitting. The PCA has now uncovered this basic structure and the two groups are now linearly separable, as visualized in Fig. 8.161. Another advantage of this dimensional reduction is an improvement in the time it takes to build the model and predict the classes. Furthermore, less memory is needed to save the data in this reduced form.
896
8 Classification Models
Fig. 8.163 Analysis output for the LDC with factors from a PCA as input variables
8.5
Support Vector Machine
In the previous two sections, we introduced the linear classifiers, namely, logistic regression and linear discriminant analysis. From now on, we leave linear cases and turn to the nonlinear classification algorithms. The first one in this list is the Support Vector Machine (SVM), which is one of the most powerful and flexible classification algorithms and is discussed below. After a brief description of the theory, we attend to the usage of SVM within the SPSS Modeler.
8.5.1
Theory
The SVM is one of the most effective and flexible classification methods, and it can be seen as a connection between linear and nonlinear classification. Although the SVM separates the classes via a linear function, it is often categorized as a nonlinear classifier due to the following fact: The SVM comprises a preprocessing step, in which the data are transformed so that previously nonlinear separable data can now be divided via a linear function. This transformation technique makes the SVM applicable to a variety of problems, by constructing highly complex decision boundaries. We refer to James et al. (2013) and Lantz (2013).
8.5 Support Vector Machine
897
Fig. 8.164 Illustration of the decision boundary discovery and the support vectors
The Support Vectors The SVM constructs a linear function (a hyper-plane in higher dimensions) to separate the different target classes. It thus chooses the decision boundary with the following approach: Consider a set of data containing two classes, circles and rectangles, which are perfectly separable by a linear function. See the left graph in Fig. 8.164, where two possible decision boundaries are displayed. The SVM now chooses the one with the largest margin. The margin is the distance between the decision boundary and the nearest data points. Hence, the SVM chooses the linear function as the decision boundary, with the furthest distance for the classes. The closest data points characterize the decision boundary uniquely and they are called support vectors. In the right graph in Fig. 8.164, the classification boundary is shown as a solid line, and the two support vectors, marked with arrows, uniquely define the largest possible margin, which is indicated by the dashed lines. Mapping of Data and the Kernel Function In the case of nonlinearly separable classes, as in the left graph in Fig. 8.165 for example, the SVM uses a kernel trick by mapping the data into a higher dimensional space, in which the data are then linearly classifiable. This shifting of the data into a higher dimension reduces the complexity and thus simplifies the classification problem. This is an unusual approach compared with the other classification methods, as it transforms the data in such a way that it is useful for the method, instead of trying to construct a complex separator into the training data, as inserted in the modeling process. The process of data transformation is demonstrated in Fig. 8.165. The mapping function is defined by the choice of kernel function, and there are several standard kernel function types that are commonly used. The ones supported by the SPSS Modeler are: • Linear K xi , x j ¼ xti ∙ x j • Polynomial d K xi , x j ¼ γxti ∙ x j þ r
898
8 Classification Models
Fig. 8.165 Transformation of the data via a kernel function
• Sigmoid K xi , x j ¼ tanh γxti ∙ x j þ r • Radial basis function (RBF) 2 K xi , x j ¼ exp γ xi x j :
In the formulas, xi and xj are feature vectors, and the parameters (γ, r, d) of the various kernels can be tuned in the SVM node (see Sect. 8.5.2). So in conclusion, we don’t have to know the complex mapping function itself, as long as we know resp. define the simple kernel function. This is the kernel trick. We refer to Lantz (2013) and IBM (2019a) for a more detailed descriptions of the kernels. Among these, the RBF kernel is the most popular for SVM, as it performs well on most data types. So this kernel is always a good starting point. The right choice of the kernel guarantees robustness and a high accuracy in classification. The kernel function and its parameters must, however, be chosen carefully, as an unsuitable kernel can cause overfitting of the model. For this reason, a cross-validation with the training set and test set is always strongly recommended. Furthermore, an additional validation set can be used to find the optimal kernel parameters (see Sect. 8.2.1). See Schölkopf and Smola (2002) and Lantz (2013) for additional information on the kernel trick. Data Preparation and Feature Rescaling Since distances have to be calculated to determine the support vectors and estimate the decision boundary, a recommended data preprocessing step is rescaling of features prior to the training of the SVM. Feature rescaling means performing a z-transformation on all continuous variables, that is, subtracting the mean and dividing each value by the standard deviation. Then, every feature is standardized and has zero mean and a standard deviation of 1; consequently, the features have a common scale. This is favorable when building a SVM for the following reason: If
8.5 Support Vector Machine
899
the features are on different scales, then a feature with a large scale is likely to dominate the distance calculations. This can result in support vectors and a decision boundary that is fitted to this dominating feature, but underrepresents the other features. This can lead to a suboptimal model. See Sect. 2.7.6 for a detailed explanation of the z-transformation and the Auto Data Prep node which is the standard way to rescale features in the SPSS Modeler.
8.5.2
Building the Model with SPSS Modeler
In this section, we introduce the SVM node of the Modeler and explain how it is used in a classification stream. The model we are building here refers to the sleep detection problem described as a motivating example in Sect. 8.1. We quickly like to mention here that the SPSS Modeler provides a second node for building a linear SVM, the LSVM node. This node is particularly suited for dataset with a large number of predictor variables. We will not go into detail here and refer to IBM (2019b) for further information on this node. The dataset EEG_Sleep_signal.csv contains EEG signal data from a single person in a “drowsiness and awake” state (see Sect. 12.1.12). The electrical impulses of the brain are measured every 10 milliseconds (ms), and the data is split into segments of 30 seconds (s). The task is then to classify a 30 s EEG signal as either a drowsiness or an awake state. The problem now is that the states cannot be classified based on the raw signals, for EEG signals have a natural volatility and can fluctuate between different levels (Niedermeyer et al. (2011)). So, the structure of a signal is more important than its actual measured value. See Fig. 8.166 for an excerpt of the EEG signals. Every row is therefore a 30 s signal segment. In summary, before building a classifier on the EEG signals, we generate a new data matrix that contains features calculated from the EEG data. See Fig. 8.167 for the complete stream, named “EEG_Sleepdetection_svm_COMPLETE”. The first part is dedicated to the feature calculation, and the model is built in the second part of the stream.
Fig. 8.166 Excerpt from the EEG signals
900
8 Classification Models
The feature calculation is performed in R (R Core Team (2014)) via an R node. Therefore, R has to be installed on the computer and included within the SPSS Modeler. This process and the usage of the R node are explained in detail in Chap. 9. We split the stream up into two separate streams, a feature calculation stream and a model building stream. If one is not interested in the feature calculation, the first part can be skipped, as the model building process is described on the already generated feature matrix. For the interested reader, we refer to Niedermeyer et al. (2011) for detailed information on EEG signals, their properties, and analysis. In this situation, the raw data has to be preprocessed and transformed into more appropriate features on which the model is able to separate the data. This transformation of data into a more suitable form and generating new variables out of given ones is common in data analytics. Finding new variables that will improve model performance is one of the major tasks of data science. We will experience this in Exercise 3 in Sect. 8.5.4, where new variables are obtained from the given ones, which will increase the prediction power of a classifier. Feature Generation Description of the model Stream name Based on dataset Stream structure
EEG_Sleep_calculate_features EEG_Sleep_Signals.csv (see Sect. 12.1.12)
Important additional remarks: The calculation of features is done in the R node, since this is the quickest and easiest way. How to include R in the SPSS modeler and the usage of R nodes is explained in more detail in Chap. 9. This stream generates the new features which are exported into a csv file. This file is then later imported by the second stream which builds the SVM model Related exercises: 3 and all exercises in Chap. 9
1. We start the feature calculations by importing the EEG signals with a Var. File node. The imported data then look like Fig. 8.166. 2. Now, we add an R Transform node, in which the feature calculations are then performed. Table 8.10 lists the features we extract from the signals.
8.5 Support Vector Machine
901
Table 8.10 Features calculated from the EEG signals Feature Activity Mobility Complexity Range Crossings
Description Variation in the signal Represents mean frequency Describes the change in frequency Difference in maximum and minimum values in the signal Number of x-axis crossings in the standardized signal
Fig. 8.167 EEG_Sleepdetection_svm_COMPLETE stream, which builds a classifier to detect sleepiness in EEG signals
The first three features are called Hjorth parameters. They are true classic statistical forms within signal processing and are often used for analytical purposes, see Niedermeyer et al. (2011) and Oh et al. (2014). We open the R Transform node and include in the R Transform syntax field the R syntax that calculates the features. See Fig. 8.168 for the node and Fig. 8.169 for the syntax inserted into the R node. The syntax is therefore displayed in the R programming environment RStudio, RStudio Team (2015). The syntax is provided under the name “feature_calculation_syntax.R”. Now we will explain the syntax in detail. The data inserted into the R Transform node for manipulation is always named “modelerData”, so that the SPSS Modeler can identify the input and output data of the R nodes, when the calculations are complete. Row 1: In the first row, the modelerData are assigned to a new variable named “old_variable”. Rows 3 + 4: Here, the signal data (the first 3000 columns) and the sleep statuses of the signal segments are assigned to the variables “signals” and “sleepiness”. Row 6: Definition of a function that calculates the mobility of a signal segment x. Rows 8–15: Calculation of the features. These, together with the sleepiness states, are then consolidated in a data.frame, which is an R-matrix format. This data.frame is then assigned to the variable “modelerData”. Now the Modeler can further process the feature matrix, as the variable “modelerData” is normally passed onto the next node in the stream.
902
8 Classification Models
Fig. 8.168 The R Transform node in which the feature calculations are declared
Fig. 8.169 R syntax that calculates the features and converts them back to SPSS Modeler format, displayed in RStudio
8.5 Support Vector Machine
903
Row 17–24: In order that the data can be processed correctly, the SPSS Modeler must know the fields and their measurement types, and so the fields in the “modelerData” data.frame have to be specified in the data.frame variable “modelerDataModel”. The storage type is defined for each field, which is “real” for the features and “string” for the sleepiness variable. 3. We add a Data Audit node to the R Transform node, to inspect the new calculated features. See Fig. 8.170 for the distributions and statistics of the feature variables. 4. To save the calculated features, we add the output node Flat File to the stream and define a filename and path, as well as a column delimiter and the further structure of the file. Building the Support Vector Machine on the New Feature Data Description of the model Stream name Based on dataset Stream structure
EEG_Sleepdetection_svm Features_eeg_signals.csv (see Sect. 12.1.15)
Important additional remarks: This is a sub-stream of the EEG_Sleepdetection_SVM_COMPLETE stream where the SVM model is trained. The stream starts with the import of the generated feature matrix by the previous stream. Cross-validation is included in the model node, and it can be performed as described in the logistic regression section of Sect. 8.3.6. Related exercises: All exercises in Sect. 8.5.4
By preprocessing the EEG signals, we are now able to build an SVM classifier that separates sleepiness states from awake states. 1. We start by importing the data in the features_eeg_signal.csv file with a Var. File node. The statistics obtained with the Data Audit node can be viewed in Fig. 8.170. Afterwards, we add a Partition node to the stream and allot 70% of the data as a training set and 30% as a test set. See Sect. 2.7.7 for how to use the Partition node.
904
8 Classification Models
Fig. 8.170 Statistics on the new features derived from the EEG signals
2. Now, we add the usual Type node to the stream and open it. We have to make sure that the measurement type of the “sleepiness” variable, i.e., the target variable, is “Nominal” or “Flag” (see Fig. 8.171). 3. To perform the recommended feature rescaling, we add an Auto Data prep node to the stream and connect it with the Type node. After opening it, we go to “Settings > Prepare Inputs & Target” and activate the z-score transformation rescaling method at the bottom. See Fig. 8.172. Afterwards, we click on the “Analyze Data” button on the top, which starts the data preprocessing step and rescales all continuous variables. The old variables are removed and replaced by the new rescaled variables which are labeled with a suffix “_transformed”, e.g., the rescaled version of the variable “activity” is named “activity_transformed”. 4. We add an SVM node to the stream and connect it to the Type node. After opening it, we declare the variable “sleepiness” as our target in the Fields tab, “Partition” as the partitioning variable, and the remaining variable, i.e., the calculated and transformed features listed in Table 8.10, as input variables (see Fig. 8.173). 5. In the Model tab, we enable the “Use partitioned data” option so that crossvalidation is performed. In other words, the node will only use the training data to build and the test set to validate the model (see Fig. 8.174).
8.5 Support Vector Machine
905
Fig. 8.171 Setting of the target variable measurement as “Flag”
6. In the Expert tab, the kernel function and parameters for use in the training process can be defined. By default, the simple mode is selected (see Fig. 8.175). This mode utilizes the RBF kernel function, with its standard parameters for building the model. We recommend using this option if the reader is unfamiliar with the precise kernels, their tuning parameters, and their properties. We proceed with the simple node in this description of the SVM node. If one has knowledge and experience with the different kernels, then the kernel function and the related parameters can be specified in the expert mode (see Fig. 8.176). We here focus on the most important options and kernel parameter for a classification model and refer the interested reader for a more precise description to IBM (2019b), Lantz (2013) and Schölkopf and Smola (2002). First, and most importantly, the kernel type can be chosen. Possible function types are: RBF, Polynomial, Sigmoid, and Linear. This defines the type of kernel used to transform the data space (see Sect. 8.5.1). Depending on the selected kernel type, additional options are available to fine-tune the parameter of the kernel function. These are listed in the Table 8.11 and see Sect. 8.5.1 for the formulas of the kernel functions.
906
8 Classification Models
Fig. 8.172 Enabling of feature rescaling with the z-score transformation in the Auto Data prep node
A problem that often occurs while fitting a SVM is that the classes are not perfectly separable and misclassification is unavoidable. See Fig. 8.177 for an illustration of this problem. In this case, there is a conflict of goals between a wide margin and a small number of misclassified points for which the optimal balance has to be found. The “Regularization parameter (C)”, which can be set in the Expert tab of the SVM node (see Fig. 8.176), controls this trade-off. A typical value should lie between 1 and 10, with 10 being the default. Increasing the regularization parameter improves the accuracy of the SVM classifier for the training data, but this may lead to overfitting. We recommend that the reader should play with all the mentioned parameters, i.e., building models with different values of the kernel parameters, in order to find the best model. 7. Now, we run the stream and the model nugget appears.
8.5 Support Vector Machine
Fig. 8.173 Definition of the input and target variables in the SVM node
Fig. 8.174 Enabling of the cross-validation procedure in the SVM node
907
908
Fig. 8.175 Standard kernel setting in the SVM node
Fig. 8.176 Expert tab. Definition of kernel type and parameters
8 Classification Models
8.5 Support Vector Machine
909
Table 8.11 List of options for fine-tuning the kernel function parameter in the SVM node Parameter RBF gamma
Gamma Bias Degree
Description Typically, the value should be between 3/k and 6/k, where k is the number of input fields. Increasing the value improves the classification for the training data. However, this can also lead to overfitting Increasing this value typically improves the classification accuracy for the training data. However, this can also lead to overfitting Defines the constant value in the kernel function. The default value is 0, which is suitable in most cases. Defines the degree of the polynomial kernel. This parameter controls the dimension of the mapping space and is normally less or equal to 10
Related kernel type RBF
Polynomial Sigmoid Polynomial Sigmoid Polynomial
Fig. 8.177 Example where the classes are not perfectly separable
8. We connect the nugget to an Analysis node and an Evaluation node. See Sects. 8.2.6 and 8.2.7 for the settings of these nodes. Figures 8.178 and 8.179 display the evaluation statistics and the ROC curves of the training set and test set for the SVM. We note that accuracy in both the training and test sets is very high (over 90%), and the Gini is also of high value in both cases. This is visualized by the ROC curves. In conclusion, the model is able to separate sleepiness from an awake state.
910
Fig. 8.178 Evaluation statistics for the sleep detection classifier
Fig. 8.179 ROC curves of the sleep detection SVM classifier
8 Classification Models
8.5 Support Vector Machine
8.5.3
911
The Model Nugget
Statistics and goodness of fit measures within the SVM model nugget are very rare when compared to other model nuggets. The only statistic the SPSS Modeler provides in the SVM node is the predictor importance view (see Fig. 8.180). For the sleep detection model, the “crossing0_transformed” feature is the most important variable, followed by “mobility_transformed”, “range_transformed”, “complexity_transformed”, and “activity_transformed”. The importance of x-axis crossings suggests that the fluctuation around the mean of the signal is an indicator of being asleep or awake.
8.5.4
Exercises
Exercise 1: Detection of Leukemia in Gene Expression Data The dataset “gene_expression_leukemia_short.csv” contains gene expression measurements from 39 human genome positions of various leukemia patients (see Sect. 12.1.17). This genomic data is the basis on which doctors obtain their diagnosis of whether a patient has leukemia. Your task is to build an SVM classifier that decides for each patient whether or not they have blood cancer. 1. Import the data and familiarize yourself with the gene expression data. How many different types of leukemia are there, and how often do they occur in the dataset? 2. Unite all leukemia types into a new variable value that just indicates the patient has cancer. How many patients have cancer and how many are healthy?
Fig. 8.180 Predictor importance view in the SVM model nugget
912
8 Classification Models
3. Preprocess the data by performing a z-transformation for all continuous variables. 4. Build a SVM classifier that can differentiate between a leukemia patient and a non-leukemia patient. What is the accuracy and Gini value for the training set and the test set? Draw a ROC curve to visualize the Gini. Exercise 2: Classification of Leukemia Types: The Choice of Kernel Problem Again, consider the dataset “gene_expression_leukemia_short.csv” from Exercise 1 that contains gene expression measurements for 39 human genome positions of various leukemia patients (see Sect. 12.1.17). 1. Import the data, set up a cross-validation stream, and standardize the input variables with a z-transformation. 2. Build a classification model with the SVM node. To do this, use the sigmoid kernel to predict the type of leukemia, based on the gene expression data. 3. What is the accuracy of the model after the previous step? Is the model suitable for distinguishing between the different leukemia types? 4. Build a second SVM model with RBF as the kernel function. Compare this model with the sigmoid kernel model. Which one has more prediction power? Exercise 3: Titanic Survival Prediction and Feature Engineering Deriving new information from given variables is a major part of data science. The titanic.xlsx file contains data on Titanic passengers, including an indicator variable “survived”, which indicates if a particular passenger survived the Titanic sinking (see Sect. 12.1.37). Your task in this exercise is to generate new variables from the Titanic data that will improve the prediction of passenger survival with a SVM classifier. Pay attention to the missing data within the Titanic dataset. See also Sect. 8.3.7, Exercise 2, where a logistic regression classifier has to be built for this problem, and missing values handling must be performed with the Auto Data Prep node. 1. Import the Titanic data and inspect each variable. What additional information can be derived from the data? 2. Create three new variables that describe the deck of the passenger’s cabin, his/her (academic) title, and the family size. The deck can be extracted from the cabin variable, as it is the first symbol in the cabin code. The passenger’s title can be derived from the name variable, it is located between the symbols “,” and “.” in the name variable entries. The family size is just the sum of the variables “sibsp” and “parch”. Use the Derive node to generate these new variables. What are the values and frequencies of these variables? Adjoin the values that only occur once with other similar values. 3. Use the Auto Data Prep node to normalize the continuous variables and replace the missing data. 4. Build four classifiers with the SVM node to retrace the prediction accuracy and performance when adding new variables. So, the first model is based on the original variables, the second model includes the deck variable, and the third also comprises the title of the passengers. Finally, the fourth model also takes the family size of the passenger as an input variable.
8.5 Support Vector Machine
913
5. Determine the most important input variables for each of the models. Are the new variables (deck, title, family size) relevant for the prediction? Compare the accuracy and Gini of the models. Have the new variables improved the prediction power? Visualize the change in model fitness by plotting the ROC curves of all four models with the Evaluation node.
8.5.5
Solutions
Exercise 1: Detection of Leukemia in Gene Expression Data Name of the solution streams Theory discussed in section
gene_expression_leukemia_short_svm Sect. 8.2 Sect. 8.5.1
The stream displayed in Fig. 8.181 is the complete solution of this exercise. 1. We open the template stream “019 Template-Stream gene_expression_ leukemia_short”, which is shown in Fig. 8.182. This template already contains a Partition node that splits the data into a training set and a test set in the usual ratio. Now we run the stream, to view the statistics in the Data Audit node. The last variable is “Leukemia”, which describes the cancer type of the patients. By
Fig. 8.181 Complete stream of the leukemia detection classifier
Fig. 8.182 Template stream of the gene expression data
914
8 Classification Models
Fig. 8.183 Data Audit node for the gene expression data
Fig. 8.184 Distribution of the leukemia variable
double-clicking on the graph symbol, the distribution of this variable is shown in a new window. See Fig. 8.183 for the Data Audit node and Fig. 8.184 for the distribution plot of the “Leukemia” variable. We learn from the graph that there are patients with 4 types of leukemia in the dataset, and about 5.73% of the whole dataset are healthy people with no leukemia. The leukemia types are ALL (10.53% of the patients), AML (42.58%), CLL (35.19%), and CML (5.97%). 2. To assign all four leukemia types (AML, ALL, CLL, CML) to a joint “cancer” class, we add a Reclassify node to the stream. In the node options, select the variable “Leukemia” as the reclassify field and click on the “Get” button, which will load the variable categories as original values. In the New value fields of the four leukemia types, we put “Cancer” and “Healthy” for the non-leukemia entry (see Fig. 8.185). At the top, we then choose the option “reclassify into existing field”, to overwrite the original values of the “Leukemia” variable with the new values. Now, we add a Distribution node to the stream and connect it to the Reclassify node. In the graph options, we choose “Leukemia” as the field and run the stream. Figure 8.186 shows the distribution of the new assigned values of the “Leukemia”
8.5 Support Vector Machine
915
Fig. 8.185 Reclassify node that assigns the common variable label “Cancer” to all leukemia types
variable. 94.27% of the patients in the dataset have leukemia, whereas 5.73% are healthy. 3. For rescaling the input variables, we add an Auto Data Prep node to the stream and connect it to the Reclassify node. In “Settings > Prepare Inputs & Target”, we activate the z-score transformation as described in Sect. 8.5.2 with the settings displayed in Fig. 8.172. 4. Before building the SVM, we add another Type node to the stream and define in it the target variable, i.e., the “Leukemia” field (see Fig. 8.187). Furthermore, we set the role of the “Patient_ID” field to “None”, as this field is just the patients identifier and irrelevant for predicting leukemia. Now, we add the SVM node to the stream. As with the previous definitions of the variable roles, the node automatically identifies the target, input, and partitioning variables. Here, we choose the default settings for training the
916
8 Classification Models
Fig. 8.186 Distribution of the reclassified leukemia variable classes
Fig. 8.187 Definition of the target variable for the leukemia classifier via SVM
SVM. Hence, nothing has to be changed in the SVM node. We run the stream and the model nugget appears. To evaluate the model performance, we add an Analysis node and Evaluation node to the stream and connect it to the nugget. The options in these nodes are described in Sects. 8.2.6 and 8.2.7. After running these two nodes, the goodness of fit statistics and the ROC curves pop up in a new window. These are displayed in Figs. 8.188 and 8.189. We see that SVM model accuracy is pretty high in both
8.5 Support Vector Machine
917
Fig. 8.188 Output of the Analysis node for the SVM leukemia classifier
the training set and the test set, and the Gini also indicates good prediction ability. The latter is visualized by the ROC curves in Fig. 8.189. Although these statistics look pretty good, we wish to point out one fact that could be seen as a little drawback of this model. When just looking at the prediction performance of the “Healthy” patients, we note that only 11 out of 18 predictions are correct in the test set. This is just 61% correctness in this class, compared to 96% overall accuracy. This is caused by the high imbalance of the target classes, and the SVM slightly favors the majority class, i.e., “Cancer”. One should keep that in mind when working with the model. Exercise 2: Classification of Leukemia Types: The Choice of Kernel Problem Name of the solution streams Theory discussed in section
gene_expression_leukemia_short_KernelFinding_svm Sect. 8.2 Sect. 8.5.1
The stream displayed in Fig. 8.190 is the complete solution of this exercise.
918
8 Classification Models
Fig. 8.189 ROC curves of the training set and test set of the SVM leukemia classifier
1. We follow the first part of the solution to the previous exercise, and open the template stream “019 Template-Stream gene_expression_leukemia_short”, as seen in Fig. 8.182, and save it under a different name. This template already contains a Partition node, which splits the data into a training set and a test set in the usual ratio. We add a Distribution node to the Type node, to display the frequencies of the leukemia types. This can be viewed in Fig. 8.184. We then add an Auto Data Prep to the stream and place it between the Partition and Type node. In this node, we enable the standardization via a z-transformation as described in Sect. 8.5.2. In Fig. 8.172, the settings for this data preprocessing step are shown. In the Type node, we set the roles of the variables as described in step 3 of the previous solution. See also Fig. 8.187. Now, the SVM nodes can automatically identify the target and input variables. 2. We add an SVM node to the stream and connect it to the Type node. As we intend to use a sigmoid kernel, we open the SVM node and go to the Expert tab. There, we enable the expert mode and choose “Sigmoid” as the kernel type. See Fig. 8.191. Here, we work with the default parameter settings, i.e., Gamma is 1 and Bias equals 0. See Sect. 8.5.2 and Table 8.11 for the meaning and influence of the parameters. Now, we run the stream and the model nugget appears.
8.5 Support Vector Machine
919
Fig. 8.190 Complete stream of the leukemia detection classifier with sigmoid and RBF kernel
Fig. 8.191 Selection of the sigmoid kernel type in the SVM node
3. We add an Analysis node to the model nugget and set the usual evaluation statistics calculations. Note that the target variable is multinomial here and no Gini or AUC can be calculated. See Sect. 8.2.6. We run the Analysis node and inspect the output statistics in Fig. 8.192. We observe that the training set and the test set are about 71–73% accurate, which is still a good value. When looking at the coincidence matrix however, we see that the model only predicts the majority classes AML and CLL. The minority classes ALL, CLL, and Non-leukemia are neglected and cannot be predicted by the SVM model with a sigmoid kernel.
920
8 Classification Models
Fig. 8.192 Accuracy and coincidence matrix in the SVM with sigmoid kernel for the leukemia data
Although the accuracy is quite good, the sigmoid kernel mapping of the data is therefore defective for the purpose of distinguishing between all five target classes. 4. We now train another SVM model that uses an RBF kernel. For that purpose, we add another SVM node to the stream and connect it to the Type node. As we want to apply the SVM to the RBF kernel and default parameters, no options have to be changed in the SVM node. We can just use the default setting provided by the SPSS Modeler. We run the SVM node so that the model nugget appears. After adding the usual Analysis node to the model nugget, we run the stream again and compare the accuracy statistics with the ones of the sigmoid model. See Fig. 8.193 for the coincidence matrix and the accuracy of the SVM model with an RBF kernel. As can be seen, the overall accuracy of the RBF model has increased (over 90%) compared with the sigmoid model. Furthermore, the RBF model takes
8.5 Support Vector Machine
921
Fig. 8.193 Accuracy and coincidence matrix of the SVM with RBF kernel type
all target categories into account and can predict all of these classes quite accurate. In conclusion, the transformation specified by the sigmoid kernel is inappropriate because the SVM cannot identify the minority classes. The SVM with RBF kernel, on the other hand, is able to predict the majority as well as the minority classes. Hence, the second model describes the data better and is therefore preferred to the sigmoid one. This is an example of how the wrong choice of kernel function can lead to corrupted and inadequate models. Exercise 3: Titanic Survival Prediction and Feature Engineering Name of the solution streams Theory discussed in section
titanic_data_feature_generation_SVM Sect. 8.2 Sect. 8.5.1 Sect. 8.5.2
The stream displayed in Fig. 8.194 is the complete solution to this exercise. The stream contains commentaries that point to its main steps. 1. We start by opening the template stream “017 Template-Stream_Titanic” and saving it under a different name. See Fig. 8.195. The template already comprises a Partition node with a data split of 70% to training and 30 % to test. When thinking of a sinking ship, passengers in the upper decks have a better chance of getting to the lifeboats in time. Therefore, the deck of the passenger’s cabin can be a relevant variable. See Fig. 8.196 for an insight into the cabin
922
8 Classification Models
Fig. 8.194 Complete stream of the Titanic survival prediction stream with new feature generation
Fig. 8.195 Template stream of the Titanic data
variable. When there is one, each cabin number occurs just a few times (once or twice). Hence, the exact cabin number of a passenger is kind of unique and a consolidation of cabins on the same deck (first letter of the cabin number) can increase the prediction power, as it describes a more general structure. The sex of the passenger is already one of the variables which should have a good prediction power, as a woman more likely survives a sinking ship than a man. There are more differences in survival indicators, e.g., masters are normally rescued before their servants. Furthermore, the probability of survival can differ for married or unmarried passengers, as for example a married woman may refuse to leave her husband on the ship. This information is hidden in the name variable. There, the civil status of the person, academic status, or aristocratic title are located after the “,” (Fig. 8.197). Furthermore, when thinking of the chaotic situation on a sinking ship, families are separated, some get lost and fall behind and passengers look for their relatives. Thus, it is reasonable to assume that the family size can have an influence on the survival probability. The two variables “sibsp” and “parch” describe the number of siblings/spouses and parents/children. So, the sum of these two variables gives the number of relatives who were traveling with the passenger.
8.5 Support Vector Machine
923
Fig. 8.196 Distribution plot of the cabin variable of the Titanic data
Fig. 8.197 Insight into the name variable of the Titanic data
2. Figure 8.198 shows the stream to derive these three new variables “deck”, “title”, and “family size”. In Fig. 8.194, this complete part of stream is joined together into a SuperNode.
924
8 Classification Models
Fig. 8.198 SuperNode containing the part of the stream where the new features are derived
Fig. 8.199 Derive node that creates the variable “deck” from the variable “cabin”
First, we add a Derive node to the Type node, to extract the deck from the cabin number. In this node, we set the field type to nominal, since the values are letters. The formula to extract the deck can be seen in Fig. 8.199. If the cabin number is present, then the first character is taken as the deck, and otherwise, the deck is named with the dummy value “Z”. In Fig. 8.200, the distribution of the new deck variable is displayed. Next, we extract the title from the name variable. For that purpose, we add another Derive node to the stream, open it, and choose “Nominal” as the field type
8.5 Support Vector Machine
925
Fig. 8.200 Frequencies of the deck values
Fig. 8.201 Derive node that extracts the title from the name
once again. The formula that dissects the title from the name can be seen in Fig. 8.201. The title is located between the characters “,” and “.”. Therefore, the locations of these two characters are established with the “locchar” statement, and
926
8 Classification Models
Fig. 8.202 Frequencies of the title values
then the sub-string between these two positions is extracted. Figure 8.202 visualizes the occurrence of the titles in the names. When reviewing Figs. 8.200 and 8.202, we note that there are some values of the new variables, “deck” and “title”, that occur uniquely, in particular in the “title” variable. We assign these single values a similar, more often occurring value. To do that, we first add a Type node to the stream and click on the “read value” button, so the values of the two new variables are known in the succeeding nodes. Now, we add a Reclassify node to the stream and open it. We choose the “deck” variable as the reclassify field and enable the reclassify into the existing field option. The latter ensures that the new values are replaced and no new field is created. Then, we click on the “Get” button to get all the values of the “deck” variable. Lastly, we put all the existing values as new values, except for the value “T” which is assigned to the “Z” category. See Fig. 8.203. We proceed similarly with the title variable. We add a Reclassify node to the stream, select “Title” as the reclassify field, and make sure that the values are overwritten by the new variables and no additional variable is created. We click on “Get” and assign the new category to the original values. Therefore, the following values were reclassified:
8.5 Support Vector Machine
927
Fig. 8.203 Reclassification of the “deck” values
Old value Capt, Don, Major Dona, Jonkheer, the Countess Mme.
New value Sir Lady Mlle.
All other values remain the same. See Fig. 8.204. Finally, we add another Derive node to the stream that calculates the family size by just adding the variables “sibsp” and “parch”. See Fig. 8.205. In Fig. 8.204 the distribution of the “familySize” variable is displayed separately for the surviving and non-surviving passengers. We see that passengers with
928
8 Classification Models
Fig. 8.204 Reclassification of the “Title” values
smaller travelling families have a better chance of survival than passengers with a large family (Fig. 8.206). 3. We add an Auto Data Prep node to the stream and select the standard options for the data preparation, i.e., replacement of the missing values with the mean and mode, and performing of a z-transformation for the continuous variables. See Fig. 8.207, and Sect. 2.7.6 for additional information on the Auto Data Prep node. After running the Auto Data Prep node, we add another Type node to the stream, in order to determine all the variable values, and we define the “survival” variable as our target variable and set the measurement type to “Flag”.
8.5 Support Vector Machine
Fig. 8.205 Calculation of the “familySize” variable from the “sibsp” and “parch” variables
Fig. 8.206 Histogram of the “familySize” variable
929
930
8 Classification Models
Fig. 8.207 Auto Data Prep node that does the data preparation steps in the stream to build a SVM classifier on the Titanic data
4. Now, we add four SVM nodes to the stream and connect them all to the last Type node. We open the first one, and in the Fields node we define the variable “survived” as the target and the following variables as input: “sibsp_transformed”, “parch_transformed”, “age_transformed”, “fare_transformed”, “sex_transformed”, “embarked_transformed”, “pclass_transformed”. Furthermore, we put “Partition” as the partition field. See Fig. 8.208. In the Analyze tab, we then enable the predictor importance calculations. We proceed with the other three SVM nodes in the same manner, but add successively the new established variables “deck”, “Title”, and “familySize”. We then run all four SVM nodes and rearrange the appearing model nuggets by connecting them into a series. See Fig. 8.209, where the alignment of the model nuggets is displayed. 5. We open the model nuggets one after another to determine the predictor importance. These are displayed in Figs. 8.210, 8.211, 8.212, and 8.213. We observe that in the model with only the original variables as input, the “sex” variable is the most important for survival prediction, followed by “pclass” and “embarked.”
8.5 Support Vector Machine
931
Fig. 8.208 Selection of the variable roles in the SVM node of the model without new established features
Fig. 8.209 Sub-stream with alignment of the model nuggets in a series
932
8 Classification Models
Fig. 8.210 Variable importance in the SVM detecting Titanic survival with no new features included
If the “deck” variable is considered as an additional input variable, the importance of the “sex” is reduced, but this variable is still the most important one. The second most important variable for the prediction is the new variable “deck”, however. This means that this new variable describes a new aspect in the data. When the “Title” variable is also included in the SVM model, it becomes the second most important one and further reduces the importance of the “sex” variable. Finally, “familySize” is the variable with the least predictor importance in the model that includes all variables, which means that it contributes some, but not very much new information to the classification problem. See Fig. 8.213. We now add a Filter node to the stream and connect it to the last model nugget. This is only done to rename the predictor fields. See Fig. 8.214.
8.5 Support Vector Machine
933
Fig. 8.211 Variable importance in the SVM detecting Titanic survival with the “deck” variable included
We then add the Analysis and Evaluation nodes to the stream and connect them to the Filter node. See Sects. 8.2.6 and 8.2.7 for the options in these nodes. When inspecting the evaluation statistics from the Analysis node (Fig. 8.215), we observe that accuracy as well as the Gini increase successively in the training set and test set. There is just one exception in the test set statistics. When adding the “deck” variable, the accuracy and Gini are both a bit lower than when we exclude the “deck”. All in all, however, the newly generated features improve the prediction performance, more precisely, from 0.723 Gini points to 0.737 points in the test set and 0.68–0.735 in the training data. This improvement is visualized by the ROC curves of the four classifiers in Fig. 8.216. There, the model including all new generated variables lies above the other ones (at least most of the time).
934
8 Classification Models
Fig. 8.212 Variable importance in the SVM detecting Titanic survival with “deck” and “Title” included
Fig. 8.213 Variable importance in the SVM detecting Titanic survival, with “deck”, “Title”, and “familySize” included
8.6 Neuronal Networks
935
Fig. 8.214 Renaming of the prediction fields in the Filter node for the SVM classifiers of the Titanic survivors
8.6
Neuronal Networks
Neural networks (NN) are inspired by the functionality of the brain. They also consist of many connected units that receive multiple inputs from other units, process them, and pass new information onto yet other units. This network of units simulates the processes of the brain in a very basic way. Due to the relationship to the brain, the units are also called neurons, hence, neural network. A NN is a black box algorithm, just like the SVM, since the structure and mechanism of data transformation and the transfer of neurons are so complex and unintuitive. The results of a NN are difficult to retrace and therefore hard to interpret. On the other hand, its complexity and flexibility makes the NN one of the most powerful and universal classifiers, which can be applied to a variety of problems where other methods, such as rule-based ones, would fail. The training of a NN requires the estimation of a huge number of parameters, which is why they are most powerful when dealing with large datasets. However, on the other hand, training a NN is therefore more time-consuming. Since computational power has increased massively
936
8 Classification Models
Fig. 8.215 Evaluation statistics calculated by the Analysis node for the four Titanic survival SVM classifiers
in recent years, making the runtime problem less relevant, the NN has become one of the most popular methods. The NN family shows its enormous strength in fields like speech and image recognition, where highly complex data structures are involved and other methods reach their limits. In the first section, we describe in brief the theoretical background of a NN in more detail by following Lantz (2013), before proceeding to look at utilization in the SPSS Modeler.
8.6 Neuronal Networks
937
Fig. 8.216 ROC curves of the four Titanic survival SVM classifiers
8.6.1
Theory
The concept of a NN is motivated by the human brain and its functionality. A NN is intended to simulate easy brain processes, and like its original, a NN consists of multiple neurons or units that process and pass information between each other. Functionality of One Neuron and the Activation Function During data processing, a neuron receives weighted signals from some other neurons and transforms the sum of these weighted signals into new information via an activation function φ. For example, in the illustration of this mechanism in Fig. 8.217, the input signals +1, x1, x2 are multiplied by the weights ω0, ω1, ω2 and then added up. This sum is then transformed via the activation function φ and passed to the next neuron. Hence, the output of the neuron in the middle is y¼φ
2 X
! ω i xi ,
i¼0
where x0 ¼ 1 The input x0 is added as a constant in the sum and is often called bias. The purpose of the weight is to regularize the contribution of each input signal to the
938
8 Classification Models
Fig. 8.217 Function of a neuron
sum. Since every neuron has multiple inputs with different weights, this gives huge flexibility in tuning the inputs individually for each neuron. This is one of the biggest strengths of the NN, qualifying their application to a variety of complex classification problems. The weights are not interpretable in their contribution to the results however, due to the complexity of the network. The activation function φ is typically the sigmoid function or the hyperbolic tangent function, i.e., tanh ðxÞ ¼
exp ðxÞ exp ðxÞ : exp ðxÞ þ exp ðxÞ
A linear function is also plausible, but a function that is linear in a neighborhood of 0 and nonlinear at the limits, as the two above-mentioned functions are, is a more suitable choice, since they add a nonlinear component to the model. The SPSS Modeler uses only the tanh activation function which sadly limits the possibilities of NN architectures, see IBM (2019a). The Topology and Layers of a NN Besides the activation function, the topology is also important, that is the number of neurons and their connection to each other, for the definition and functionality of the NN. In more detail, the neurons of a NN are structured in layers between which the information gets passed through. There are three types of layers in a NN: An Input layer, one or multiple Hidden layer(s), and an Output layer. See Fig. 8.218 for a sketch of a typical NN with three layers. The input layer comprises the initial neurons, which receive unprocessed raw data. Each neuron in the input layer therefore is responsible for handling one input
8.6 Neuronal Networks
939
Fig. 8.218 Sketch of a typical neural network
variable, which is just passed to each neuron in the next layer untreated. The neurons in the output layer, on the other hand, receive the data, which were processed by multiple neurons in the network, and calculate a final score, e.g., a probability, and prediction for each target class. Each neuron in the output layer represents one target category and outputs the score for this category. To calculate the probabilities, a special activation function is typically assigned to the output layer, namely, the softmax activation function exp ðxi Þ , j¼1 exp x j
y i ¼ Pk
where yi is the probability of class i (returned by neuron i in the output layer), k denotes the number of neurons in the output layer, and xi describes the weighted sum of the signals coming from the neurons of the previous layer. Between the input and output layer can be one or multiple hidden layers. The neurons of these layers get the data from neurons of the previous layer and process them as described above. The manipulated data are then passed to the neurons of the next hidden or the output layer. The SPSS Modeler provides only a maximum of two hidden layers, which is probably caused by reducing computational resources and time. However, this limits the flexibility when designing a NN architecture. Necessary Conditions and Further Remarks The NN as described until now is the most common and well-known Multilayer Perceptron (MLP) model. Other NN models exist, however, for example the Radial
940
8 Classification Models
Basis Function (RBF) network, which is also provided by the SPSS Modeler. In this model, the network consists of only one hidden layer and uses a distance measure instead of the weighted sum, as the input for the activation function, which follows a Gaussian structure. For more information on the RBF network and differences between the two network types, we refer to Tuffery (2011). Recalling the complex structure of the network and the large number of weights, and thus tuning parameters, the NN is one of the most flexible, powerful, and accurate data mining methods. These many parameters do cause drawbacks, however, since the NN is prone to overfitting. One has to be aware of this phenomenon and always use a test set to verify the generalization ability of the model. Due to this problem, many software applications, such as the SPSS Modeler, have implemented overfitting prevention in the NN, where a small part of the data is used to validate the model during training and thus warn if overfitting occurs. One of the greatest dangers with NN is the possibility of a suboptimal solution. This results in the mechanism of parameter/weight estimation. The parameters are calculated with an approximation algorithm, called gradient descent, which can lead to a suboptimal solution. See Nielsen (2018) for information of training a NN with gradient descent. Usually, the inputs have to be continuous for the neurons to process the data, and therefore, categorical variables to be translated into numeric ones in a preprocessing step. The SPSS Modeler, however, can also deal with categorical and discrete variables by estimating separate weights for each entity of such a variable. So, the user doesn’t have to worry about this source of error. The NN can also be used for regression problems. We won’t describe those situations here, but refer interested readers to Runkler (2012) and Cheng and Titterington (1994). For additional remarks and assumptions with the NN, in classification or regression cases, see Lantz (2013), Tuffery (2011), Cheng and Titterington (1994) and Nielsen (2018).
8.6.2
Building a Network with SPSS Modeler
A neural network (NN) can be trained with the Neural Network node in the SPSS Modeler. We now present how to set up the stream for a NN with this node, based on the digits recognition data, which comprises image data on handwritten digits from different people. The goal now is to build a classifier that is able to identify the correct digit from an arbitrary person’s handwriting. These classifiers are already in use in many areas, as described in Sect. 8.1.
8.6 Neuronal Networks
Description of the model Stream name Based on dataset Stream structure
941
Neural_network_digits_recognition Optdigits_training.txt, optdigits_test.txt (see Sect. 12.1.29)
Important additional remarks: For a classification model, the target variable should be categorical. Otherwise, a regression model should be trained by the Neural Network node. See Runkler (2012) and Cheng and Titterington (1994) for regression with a NN Related exercises: All exercises in Sect. 8.6.4
The stream consists of two parts, the training and the validation of the model. We therefore split the description into two parts also. Training of a NN Here, we describe how to build the training part of the stream. This is displayed in Fig. 8.219. 1. We start by opening the template stream “020 Template-Stream_digits” and saving it under a different name. The template stream consists of two parts, in which the training (“optdigits_training.txt”) and test sets (“optdigits_test.txt”) are imported. See Fig. 8.220. A Filter node is then attached to each Source node, to rename the field with the digit labels, i.e., Field65 becomes “Digit”. This field is then assigned in the Type node as the target variable. See Fig. 8.221 for definition of the target variable and its values.
942
8 Classification Models
Fig. 8.219 Training part of the stream, which builds a digit identification classifier
Fig. 8.220 Template stream of handwritten digits data
2. We now concentrate on the training stream and add a Distribution node to the Type node, to display how often each digit occurs within the training set. For the description of the Distribution node, see Sect. 3.2.2. The frequencies of the handwritten digits can be viewed in Fig. 8.222. We note that the digits 0–9 appear almost equally in the training data. 3. Now we add a Neural Network node to the stream and connect it to the Type node. In the Fields tab of the node options, the target and input variables for the NN can be defined. Here, the “Digit” variable is the target variable that contains
8.6 Neuronal Networks
Fig. 8.221 Type node of the digits data and assignment of the target field and values
Fig. 8.222 Frequency of the digits within the training set
943
944
8 Classification Models
Fig. 8.223 Definition of the target and input variables in the Neural Network node
the digit label for each handwritten data record. All other variables, i.e., “Field1” to “Field64”, are treated as inputs in the network. See Fig. 8.223. In the Build Option tab, the parameters for the model training process are defined. Firstly, in the Objectives options, we can choose between building a new model and continuing to train an existing one. The latter is useful if new data are available and a model has to be updated, to avoid building the new model from scratch. Furthermore, we can choose to build a standard or an ensemble model. For a description of ensemble models, boosting and bagging, we refer to Sect. 5. 3.6. Here, we intend to train a new, standard model. See Fig. 8.224. In the Basic options, the type of the network, with its activation function and topology, has to be specified. The Neural Network node provides the two network models, MLP and RBF; see Sect. 8.6.1 for a description of these two model types. We choose the MLP, which is the default setting and the most common one. See Fig. 8.225. The number of hidden layers and the unit size for each layer can be specified here too. Only networks with a maximum of 2 hidden layers can be built
8.6 Neuronal Networks
945
Fig. 8.224 Selection of the model objective in the Neural Network node
with the Neural Network node, however. Furthermore, the SPSS Modeler provides an algorithm that automatically determines the number of layers and units. This option is enabled by default and we accept. See bottom arrow in Fig. 8.225. We should point out that automatic determination of the network topology is not always optimal, but a good choice to go with in the beginning. With the next options, the stopping rules of the network training can be defined. Building a neural network can be time- and resource consuming. Therefore, the SPSS Modeler provides a couple of possibilities for terminating the training process at a specific time. These include a maximum training time, a maximum number of iterations of the coefficient estimation algorithm, and a minimum accuracy. The latter can be set if a particular accuracy has adequate prediction power. We choose the default setting and fix a maximum processing time of 15 min. See Fig. 8.226.
946
8 Classification Models
Fig. 8.225 Definition of the network type and determination of the layer and unit number
In the Ensemble options, the aggregation function and number of models in the ensemble can be specified for bagging and boosting ensembles. See Fig. 8.227. These options are only relevant if an ensemble model is trained. The available aggregation options for a categorical target, as in classifiers, are listed in Table 8.12. For additional information on ensemble models, as well as boosting and bagging, we refer to Sect. 5.3.6. In the Advanced option view, the size of the overfitting prevention set can be specified; 30% is the default setting. Furthermore, a NN is unable to handle missing values. Therefore, a missing values handling tool should be specified. The options here are the deletion of data records with missing values or the replacement of missing values. Continuous variables impute the average of minimum and maximum value, while the category field imputes the most frequent category. See Fig. 8.228 for the Advanced option view of the Neural Network node. In the Model Options tab, the usual calculation of predictor importance can be enabled, which we do in this example. See Fig. 8.229. 4. The option setting for the training process is now completed and we run the stream, thus producing the model nugget. The model nugget, with its graphic and statistics, is explained in the subsequent Sect. 8.6.3.
8.6 Neuronal Networks
947
Fig. 8.226 Setting of the stopping rules for the training process in the Neural Network node
5. We now add an Analysis node to the stream, connect it to the model nugget, and enable the calculation of the coincidence matrix. See Sect. 8.2.6 for a description of the Analysis node. The output of the Analysis node can be viewed in Fig. 8.230. We see that accuracy is extremely high, with a recognition rate of over 97% on handwritten digits. On the coincidence matrix, we can also see that the prediction is very precise for all digits. In other words, there is no digit that falls behind significantly in the accuracy of the prediction by the NN. Validation of the NN Now, we validate the trained model from part one with a test set. Figure 8.231 shows the stream. Since the validation of a classifier is part of the modeling process, we continue the enumeration of the model training here.
948
8 Classification Models
Fig. 8.227 Definition of ensemble model parameters in the Neural Network node
Table 8.12 Aggregation mechanism for the ensemble models of a classifier in the Neural Network node Mechanism Voting Highest probability wins Highest mean probability
Description The category that is mostly predicted by the single model wins. The category with the highest probability over all models is predicted. The probabilities for each category are averages over all models and the one with the highest average win.
6. First, we copy the model nugget and paste it into the stream canvas. Afterwards, we connect the new nugget to the Type node of the stream segment that imports the test data (“optdigits_test.txt”). 7. We then add another Analysis node to the stream and connect it to the new nugget. After setting the option in the Analysis node, see Sect. 8.2.6, we run it and
8.6 Neuronal Networks
949
Fig. 8.228 Prevention of overfitting and the setting of missing values handling in the Neural Network node
the validation statistics open in another window. See Fig. 8.232 for the accuracy of the test set. We observe that the NN predicts the digits still very precisely with over 94% accuracy, without neglecting any digit. Hence, we see that the digits recognition model is applicable to independent data and can identify digits from unfamiliar handwritings.
8.6.3
The Model Nugget
In this section, we introduce the contents of the Neural Network model nugget. All graphs and statistics from the model are located in the Model tab, which is described in detail below.
950
8 Classification Models
Fig. 8.229 The predictor importance calculation is enabled in the Model Option tab of the Neural Network node
Model Summary The first view in the Model tab of the nugget displays a summary of the trained network. See Fig. 8.233. There, the target variable and model type, here “Digit” and MLP, are listed as well as the number of neurons in every hidden layer that was included in the network structure. In our example, the NN contains one hidden layer with 18 neurons. This information on the number of hidden layers and neurons is particularly useful when the SPSS Modeler automatically determined them. Furthermore, the reason for stopping is displayed. This is important to know, as a termination due to time issues or overfitting reasons, instead of an “Error cannot be further decreased” stopping, means that the model is not optimal and can be improved by adjusting parameters or having a larger run-time. Basic information on the accuracy of the NN on the training data is displayed below. Here, the handwriting digits NN classifier has a 97.9% accuracy. See Fig. 8.233.
8.6 Neuronal Networks
951
Fig. 8.230 Analysis node statistics for the NN classifier on the digits training dataset
Fig. 8.231 Validation part of the stream, which builds a digit classifier with a NN
Predictor Importance The next view displays the importance of the input variables in the NN. See Fig. 8.234. This view is similar to the one in the Logistic node, and we refer to Sect. 8.3.4 for a description of this graph. At the bottom of the graph there is a sliding regulator, where the number of displayed input fields can be selected. This is convenient when the model includes many variables, as in our case with the digits data. We see that nearly all fields are equally important for digit identification. Coincidence Matrix In the classification view, the predicted values against the original values are displayed in a heat map. See Fig. 8.235. The background color intensity of a cell thus correlates with its proportion of cross-classified data records. The entries on the
952
8 Classification Models
Fig. 8.232 Analysis node statistics for the NN classifier on the digits test set
Model Summary Target
Digit
Model
Multilayer Perceptron
Stopping Rule Used
Error cannot be further decreased
Hidden Layer 1 Neurons
18
Worse
Better 97.9%
0%
25%
50%
75%
Accuracy
Fig. 8.233 Summary of the NN in the model nugget
100%
8.6 Neuronal Networks
953 Predictor Importance Target: Digit
field49 field57 field9 field25 field17 field48 field32 field8 field64 field16 0.0
0.2
0.4
0.6
0.8 field16
Least Important
1.0 field49
Most Important
Fig. 8.234 Predictor importance in the NN model nugget
Fig. 8.235 Heat map of the classification of the NN in the model nugget
matrix can be changed at the bottom, see arrow in Fig. 8.235. Depending on the selected option, the matrix displays the percentage of correctly identified values for each target category, the absolute counts, or just a heat map without entries.
954
8 Classification Models
Network Structure The Network view visualizes the constructed neural network. See Fig. 8.236. This can be very complex with a lot of neurons in each layer, especially in the input layer. Therefore, only a portion of the input variables can be selected, e.g., the most important variables, by the sliding regulator at the bottom. Furthermore, the alignment of the drawn network can be changed at the bottom, from horizontal to vertical or bottom to top orientation. See left arrow in Fig. 8.236. Besides the structure of the network, the estimated weights or coefficients within the NN can be displayed in network form. See Fig. 8.237. One can switch between
Fig. 8.236 Visualization of the NN architecture in the model nugget
Fig. 8.237 Visualization of the coefficients of the NN in the model nugget
8.6 Neuronal Networks
955
these two views with the select list. See bottom right arrow in Fig. 8.236. Each connecting line of the coefficients network represents a weight, which is displayed when the mouse cursor is moved over it. Each line is also colored; darker tones indicate a positive weight, and lighter tones indicate a negative weight.
8.6.4
Exercises
Exercise 1: Prediction of Chess Endgame Outcomes and Comparison with Other Classifiers The dataset “chess_endgame_data.txt” contains 28,056 chess endgame positions, with the white king, a white rook, and the black king only left on the board (see Sect. 12.1.7). The goal of this exercise is to train a NN to predict the outcome of such endgames, i.e., whether white wins or black achieves a draw. The variable “Result for White” therefore describes the number of moves white needs, to win or reach a draw. 1. Import the chess endgame data with a proper Source node and reclassify all “Result for White” variables into a binary field that indicates whether white wins the game or not. What is the proportion of “draws” in the dataset? 2. Train a neural network with 70% of the chess data and use the other 30% as a test set. What are the accuracy and Gini values for the training set and test set? Is the classifier overfitting the training set? 3. Build a SVM and a Logistic Regression model on the same training data. What are the accuracy and Gini values for these models? Compare all three models with each other and plot the ROC curve for each of them. Exercise 2: Credit rating with a neural network and finding the best network topology. The “tree_credit” dataset (see Sect. 12.1.38) comprises demographic and historical loan data from bank customers, as well as a prognosis on credit worthiness (“good” or “bad”). In this exercise, a neural network has to be trained that decides if the bank should give a certain customer a loan or not. 1. Import the credit data and set up a cross-validation scenario with training, test, and validation sets, in order to compare two networks with different topologies from each other. 2. Build a neural network to predict the credit rating of the customers. To do this, use the default settings provided by the SPSS Modeler and, in particular, the automatic hidden layer and unit determination method. How many hidden layers and units are included in the model and what is its accuracy? 3. Build a second neural network with customer-defined hidden layers and units. Try to improve the performance of the automatically determined network. Is there a setup with a higher accuracy on the training data and what is its topology?
956
8 Classification Models
4. Compare the two models, automatic and custom determination of the network topology, by applying the trained models to the validation and test sets. Identify the Gini values and accuracy for both models. Draw the ROC curve for both models. Exercise 3: Construction of Neural Networks Simulating Logical Functions Neural networks have their origin in calculating logical operations, i.e., AND, OR, and NOT. We explain this in the example of the AND operator. Consider the simple network shown in Fig. 8.217, with only a single neuron in the hidden layer and two input variables x1, x2. Let us assume that these variables can take only the values 0 or 1. The activation function φ is the sigmoid function, i.e., φð x Þ ¼
1 : 1 þ ex
The task now is to assign values to the weights ω0, ω1, ω2, such that φð x Þ
1, 0,
if
x1 ¼ 1
and
x2 ¼ 1
otherwise:
A proper solution for the AND operator is shown in Fig. 8.238. When looking at the four possible input values and calculating the output of this NN, we get x1 0 1 0 1
x2 0 0 1 1
Fig. 8.238 Neural network for the logical AND
φ(x) φ(200) 0 φ(50) 0 φ(50) 0 φ(100) 1
8.6 Neuronal Networks
957
Construct a simple neural network for the logical OR and NOT operators, by proceeding in the same manner as just described for the logical AND. Hint: the logical OR is not exclusionary, which means the output of the network has to be nearly 0, if and only if both input variables are 0.
8.6.5
Solutions
Exercise 1: Prediction of Chess Endgame Outcomes and Comparison with Other Classifiers Name of the solution streams Theory discussed in section
chess_endgame_prediction_nn_svm_logit Sect. 8.2 Sect. 8.6.1 Sect. 8.3.1 Sect. 8.5.1
Figure 8.239 shows the final stream for this exercise. 1. First, we import the dataset with the File Var. File node and connect it to a Type node. Then, we open the latter one and click the “Read Values” button, to make sure the following nodes know the values and types of variables in the data. Afterwards, we add a Reclassify node and connect it to the Type node. In the Reclassify node, we select the “Result for White” field, since we intend to change its values. By clicking on the “Get” button, the original values appear and can be assigned to another value. In the New value column, we write “win” next to each original value that represents a win for white, i.e., those which are not a “draw”. The “draw” value, however, remains the same in the newly assigned variable. See Fig. 8.240. We now connect a Distribution node to the Reclassify node, to inspect how often a “win” or “draw” occurs. See Sect. 3.2.2 for how to plot a bar plot with the Distribution node. In Fig. 8.241, we observe that a draw occurs about 10% of the time in the present chess dataset.
Fig. 8.239 Stream of the chess endgame prediction exercise
958
8 Classification Models
Fig. 8.240 Reclassification of the “Results for White” variable to a binary field
2. We add another Type node to the stream after the Reclassify node, and set the “Result for White” variable as the target field and its measurement type to “Flag”, which ensures a calculation of the Gini values for the classifiers hereinafter. See Fig. 8.242. We now add a Partition node to the stream to split the data into a training set (70%) and test set (30%). See Sect. 2.7.7 for how to perform this step in the Partition node. Afterwards, we add a Neural Network node to the stream and connect it to the Partition node. Since the target and input variables are defined in the preceding Type node, the roles of the variables should be automatically identified by the Neural Network node and, hence, appear in the right role field. See Fig. 8.243.
8.6 Neuronal Networks
959
Fig. 8.241 Distribution of the reclassified “Result for White” variable
Fig. 8.242 Definition of the target variable and setting its measurement type to “Flag” for the chess endgame dataset
We furthermore use the default settings of the SPSS Modeler, which in particular include the MLP network type, as well as an automatic determination of the neurons and hidden layers. See Fig. 8.244. We also make sure that the predictor importance calculation option is enabled in the Model Options tab. See how this is done in Fig. 8.229. Now, we run the stream and the model nugget appears. In the following, the model nugget is inspected and the results and statistics interpreted.
960
8 Classification Models
Fig. 8.243 Target and input variable definition in the Neural Network node for the chess endgame classifier
Fig. 8.244 Network type and topology settings for the chess outcome prediction model
8.6 Neuronal Networks
961 Model Summary
Target
Result for White
Model
Multilayer Perceptron
Stopping Rule Used
Error cannot be further decreased
Hidden Layer 1 Neurons
7
Worse
Better 99.5%
0%
25%
50%
75%
100%
Accuracy
Fig. 8.245 Summary of the neural network classifier that predicts the outcome of a chess game
The first thing that strikes our attention is the enormous high accuracy of 99.5% correct chess outcome predictions in the training data. See Fig. 8.245. To eliminate overfitting, we have to inspect the statistics for the test set later in the Analysis node. These are, however, almost as good (see Fig. 8.248) as the accuracy here, and we can assume that the model is not overfitting the training data. Moreover, we see in the model summary overview that one hidden layer with 7 neurons was included in the network. See Fig. 8.245. When inspecting the predictor importance, we detect that the positions of the white rook and black king are important for prediction, while the white king’s position on the board plays only a minor role in the outcome of the game in 16 moves. See Fig. 8.246. In the classification heat map of absolute counts, we see that only 100 out of 19,632 game outcomes are misclassified. See Fig. 8.247. To also calculate the accuracy of the test set, and the Gini values for both datasets, we add an Analysis node to the neural network model nugget. See Sect. 8.2.6 for the setting options in the Analysis node. In Fig. 8.248, the output of the Analysis node is displayed, and we verify that the model performs as well on the test set as on the training set. More precisely, the accuracies are 99.49% resp. 99.48%, and the Gini values are 0.999 and 0.998 for the training and test set, respectively. Hence, the model can be applied to unknown data with no decrease of prediction power.
962
8 Classification Models Predictor Importance Target: Result for White
White Rook file
Black King file
White Rook rank
Black King rank
White King file
White King rank 0.2
0.0
0.4
0.6
0.8
White King rank
1.0
White Rook file
Least Important
Most Important
Fig. 8.246 Importance of the pieces’ positions on the chessboard for the chess outcome prediction
Classification for Result for White Overall Percent Correct = 99.5%
Obse rved draw
win
Predicted
Row Percent win
draw 1878
30
70
17654
100.00 80.00 60.00 40.00 20.00 0.00
Fig. 8.247 Heat map of counts of predicted versus observed outcomes of the chess endgame classifier
3. To build a SVM and Logistic Regression model on the same training set, we add a SVM node and a Logistic node to the stream and connect each of them to the Partition node. In the SVM node, no options have to be modified, while in the Logistic node, we choose the “Stepwise” variable selection method. See Fig. 8.249. Afterwards, we run the stream.
8.6 Neuronal Networks
963
Fig. 8.248 Analysis node statistics of the chess outcome neural network classifier
Before comparing the prediction performances of the three models, NN, SVM, and logistic regression, we rearrange the model nuggets and connect them into a series. The models are now executed successively on the data. Compare the rearrangement of the model nuggets with Fig. 8.147. We now add an Analysis node and an Evaluation node to the last model nugget in this series and run these two nodes. See Sects. 8.2.6 and 8.2.7 for a description of these two nodes. The outputs are displayed in Figs. 8.250, 8.251, and 8.252. When looking at the accuracy of the three models, we notice that the NN performs best here; it has over 99% accuracy in the training set and test set, followed by the SVM with about 96% or 95% accuracy, and lastly the logistic regression, with still a very good accuracy of about 90% in both datasets. The coincidence matrix, however, gives a more detailed insight into the prediction performance and reveals the actual inaccuracy of the last model. While NN and SVM are able to detect both “win” and “draw” outcomes, the logistic regression model has assigned every game as a win for white. See Fig. 8.250. This technique gives good accuracy, since only 10% of the games end with a draw, recall Fig. 8.241,
964
8 Classification Models
Fig. 8.249 Variable selection method in the Logistic node for the model predicting chess outcomes
but that is by chance. The reason for this behavior could be that the chess question is nonlinear, and a linear classifier such as logistic regression is unable to separate the two classes from each other, intensified by imbalance in the data. For this chess question, a nonlinear classifier performs better, and this is also confirmed by the Gini values of the three models in Fig. 8.251. The Gini of the NN and SVM are pretty high and nearly 1, while the Gini of the logistic regression is about 0.2, noticeably smaller. This also indicates an abnormality in the model. The Gini or AUC is visualized by the ROC curve in Fig. 8.252. The ROC curves of the NN and SVM are almost perfect, while the ROC curve of the logistic regression runs clearly beneath the other two. In conclusion, the problem of predicting the outcome of a chess endgame is very complex, and linear methods reach limits here. A NN, on the other hand, is well-suited for such problems and outperforms the other methods, especially the logistic regression, shown in this exercise.
8.6 Neuronal Networks
965
Fig. 8.250 Accuracy of the neural network, SVM, and logistic regression chess outcome classifier
Fig. 8.251 AUC and Gini of the neural network, SVM, and logistic regression chess outcome classifier
966
8 Classification Models
Fig. 8.252 ROC curves of the neural network, SVM, and logistic regression chess outcome classifier
Exercise 2: Credit Rating with a Neural Network and Finding the Best Network Topology Name of the solution streams Theory discussed in section
tree_credit_nn Sect. 8.2 Sect. 8.6.1
Figure 8.253 shows the final stream in this exercise. 1. We start by opening the stream “000 Template-Stream tree_credit”, which imports the tree_credit data and already has a Type node attached to it, and save it under a different name. See Fig. 8.254.
Fig. 8.253 Stream of credit rating prediction with a NN exercise
8.6 Neuronal Networks
967
Fig. 8.254 Template stream for the tree_credit data
Fig. 8.255 Definition of target and input fields in the Neural Network node
To set up a cross-validation with training, validation, and testing data, we add a Partition node to the stream and place it between the Source node and Type node. Then, we open the node and define 60% of the data as training data, 20% as validation data, and the remaining 20% as test data. See Sect. 2.7.7 for a description of the Partition node. 2. We open the Type node and define the measurement type of the variable’s credit rating as “Flag” and its role as “Target”. This is done as in the previous solution, see Fig. 8.242. Now, we add a Neural Network node to the stream and connect it to the Type node. The variable roles are automatically identified, see Fig. 8.255, and since we use the default settings, nothing has to be modified in the network
968
8 Classification Models
Fig. 8.256 Default network type and topology setting in the Neural Network node Model Summary Target
Credit rating
Model
Multilayer Perceptron
Stopping Rule Used
Error cannot be further decreased
Hidden Layer 1 Neurons
6
Worse
Better 76.4%
0%
25%
50%
75%
100%
Accuracy
Fig. 8.257 Summary of the NN, with automatic topology determination that predicts credit ratings
settings. In particular, we use the MLP network type and automatic topology determination. See Fig. 8.256. Now we run the stream. In the Model summary view in the model nugget, we see that one hidden layer with six neurons is included while training the data. See Fig. 8.257. We also notice that the model has an accuracy of 76.4% on the training data. The network is visualized in Fig. 8.258.
8.6 Neuronal Networks
Bias
Bias
Neuron1
969
Age
Income level
Number of credit cards
Neuron2
Neuron3
Neuron4
Car loans
Neuron5
Education
Neuron6
Credit rating
Fig. 8.258 Visualization of the NN, with automatic topology determination that predicts credit ratings
3. We add a second Neural Network node to the stream and connect it to the Type node. Unlike with the first Neural Network node, here we manually define the number of hidden layers and units. We choose two hidden layers with ten and five neurons as our network structure, respectively. See Fig. 8.259. The network type remains MLP. We run this node and open the model nugget that appears. The accuracy of this model with two hidden layers (ten units in the first one and five in the second) has increased to 77.9%. See Fig. 8.260. Hence, the model performs better on the training data than the automatically established network. Figure 8.261 visualizes the structure of this network with two hidden layers. 4. To compare the two models with each other, we add a Filter node to each model nugget and rename the prediction fields with a meaningful name. See Fig. 8.262 for the Filter node, after the model nugget with automatic network determination. The setting of the Filter node for the second model is analog, except for the inclusion of the “Credit rating” variable, since this is needed to calculate the evaluation statistics. With a Merge node, we combine the predictions of both models. We refer to Sect. 2.7.9 for a description of the Merge node. We add the usual Analysis and Evaluation nodes to the Merge node and set the usual options for a binary target
970
8 Classification Models
Fig. 8.259 Manually defined network topology in the Neural Network node for the credit rating classifier
Model Summary Target
Credit rating
Model
Multilayer Perceptron
Stopping Rule Used
Error cannot be further decreased
Hidden Layer 1 Neurons
10
Hidden Layer 2 Neurons
5
Worse
Better 77.9%
0%
25%
50%
75%
100%
Accuracy
Fig. 8.260 Summary of the NN, with manually defined topology that predicts credit ratings
8.6 Neuronal Networks
971
Fig. 8.261 Visualization of the NN, with manually defined topology that predicts credit ratings
Fig. 8.262 Filter node to rename the prediction fields of the NN for the credit rating classifiers
972
8 Classification Models
Fig. 8.263 Evaluation statistics in the Analysis node from the two neural networks that predict credit ratings
Fig. 8.264 ROC curves of the two neural networks that predict credit ratings
variable, by recalling Sects. 8.2.6 and 8.2.7. In Fig. 8.263, the accuracy and Gini measures can be viewed for both models and all three subsets. We notice that for the network with a manually defined structure, the accuracy is higher in all three sets (training, validation, and test), as are the Gini values. This is also visualized by the ROC curves in Fig. 8.264, where the curves of the manually defined
8.6 Neuronal Networks
973
topology network lie above the automatically defined network. Hence, a network with two hidden layers having ten and five units describes the data slightly better than a network with one hidden layer containing six neurons. Exercise 3: Construction of Neural Networks Simulating Logical Functions The logical “OR” network A logical “OR” network is displayed in Fig. 8.265. When looking at the four possible input values and calculating the output of the neuron, we get x1 0 1 0 1
x2 0 0 1 1
φ(x) φ(100) 0 φ(100) 1 φ(100) 1 φ(300) 1
The logical “NOT” network A logical “NOT” network has only one input variable and should output 1 if the input is 0, and vice versa. A solution is displayed in Fig. 8.266. When looking at the two possible input values and calculating the output of the neuron, we get x1 0 1
Fig. 8.265 Neural network for the logical “OR”`
φ(x) φ(100) 1 φ(100) 0
974
8 Classification Models
Fig. 8.266 Neural network for the logical “NOT”
8.7
K-Nearest Neighbor
The k-nearest neighbor (kNN) algorithm is nonparametric and one of the simplest among the classification methods in machine learning. It is based on the assumption that data points in the same area are similar to each other, and thus, of the same class. So, the classification of an object is simply done by majority voting within the data points in a neighborhood. The theory and concept of kNN are described in the next section. Afterwards, we turn to the application of kNN on real data with the SPSS Modeler.
8.7.1
Theory
The kNN algorithm is nonparametric, which means that model parameters don’t have to be calculated. A kNN classifier is trained with just saving the training data with the values of the involved features. This learning technique is also called lazylearning. So training of a model is pretty fast. In return, however, the prediction of new data points can be very resource and time-consuming, as multiple distances have to be calculated. We refer to Lantz (2013) and Peterson (2009) for information that goes beyond our short instruction here. Description of the Algorithm and Selection of k The kNN is one of the simplest methods among machine learning algorithms. Classification of a data point is done by identifying the k nearest data points and counting the frequency of each class among the k nearest neighbors. The class occurring most often wins, and the data point is assigned to this class. In the left graph of Fig. 8.267, this procedure is demonstrated for a 3-nearest neighbor algorithm. The data point marked with a star has to be assigned to either the circle or the rectangle class. The three nearest neighbors of this point are identified as one rectangle and two circles. Hence, the star point is classified as a circle in the case of k ¼ 3. The above example already points to a difficulty, however, namely that the choice of k, i.e., the size of the neighborhood, massively affects the classification. This is
8.7 K-Nearest Neighbor
975
Fig. 8.267 Visualization of the kNN method for different k, for k ¼ 3 on the left and k ¼ 1 on the right graph
caused by the kNN algorithm’s sensitivity to the local structure of the data. For example, in the right graph in Fig. 8.267, the same data point as before has to be classified (the star), but this time, a 1-nearest neighbor method is used. Since the nearest data point is a rectangle, the star point is classified as a rectangle. So, the same data point has two different classifications, a circle or a rectangle, depending on the choice of k. Unfortunately, choosing the right k is important. A small k will take into account only points within a small radius and thus give each neighbor a very high importance when classifying. In so doing, however, it makes the model prone to noisy data and outliers. If, for example, the rectangle nearest to the star is an outlier in the right graph of Fig. 8.267, the star will probably be misclassified, since it would rather belong to the circle group. If k is large, then the local area around a data point gets less important and the model gets more stable and less affected by noise. On the other hand, more attention will then be given to the majority class, as a huge number of data points are engaged in the decision making. This can be a big problem for skewed data. In this case, the majority class most often wins, suppressing the minority class, which is thus ignored in the prediction. Of course, the choice of k depends upon the structure of the data and the number of observations and features. In practice, k is typically set somewhere between three and ten. A common choice for k is the square root of the number of observations. This is just a suggestion that turned out to be a good choice of k for many problems, but does not have to be the optimal value for k and can even result in poor predictions. The usual way to identify the best k is via cross-validation. That is, various models are trained for different values of k and validated on an independent set. The model with the lowest error rate is then selected as the final model.
976
8 Classification Models
Fig. 8.268 Distance between two data points in a 2-dimensional space, and visualization of the Euclidian and City-block distance
Table 8.13 Overview of the distance measures Distance measure
Euclidean distance
Formula Object x and object y are described by (variable1, variable2, . . ., variablen) ¼ (x1, x2, . . ., xn) and (y1, y2, . . ., yn). Using the vector components xi and yi, the metrics are defined as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n P dðx, yÞ ¼ ðxi yi Þ2 i¼1
City-block-metric (Manhattan metric)
dðx, yÞ ¼
n P
jxi yi j
i¼1
Distance Metrics A fundamental aspect of the kNN algorithms is the metric with which the distance of data points is calculated. The most common distance metrics are the Euclidian distance and City-block distance. Both distances are visualized in Fig. 8.268. The Euclidian metric describes the usual distance between data points and is the length of the black solid line between the two points in Fig. 8.268. The City-block distance is the sum of the distance between the points in every dimension. This is indicated as the dotted line in Fig. 8.268. The City-block distance can be also thought of as the way a person has to walk in Manhattan to get from one street corner to another one. This is why this metric is also called the Manhattan metric. Both distance formulas are shown in Table 8.13, and we also refer to the Clustering Chap. 7 and IBM (2019a) for further information. Feature Normalization A problem that occurs when calculating the distances of different features is the scaling of those features. The values of different variables are located on differing
8.7 K-Nearest Neighbor
977
Fig. 8.269 Examples of bank customers
scales. For example, consider the following data in Fig. 8.269 of bank customers who could be rated for credit worthiness, for example. When calculating the distance between two customers, it is obvious that “Income” dominates the distance, as the variations between customers in this variable are on a much larger scale than the variation in the “Number of credit cards” or the “Age” variable. The Euclidean distance between John and Frank, for example, is qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dðJohn, Frank Þ ¼ ð21 34Þ2 þ ð2000 3500Þ2 þ ð5 6Þ2 1500:06: So, the contribution of the “Income” difference to the squared distance is ð2000 3500Þ2 =dðJohn, FrankÞ2 ¼ 15002 =1500:062 0:99: The consequence of this is that a change in the number of credit cards that John or Frank owns has nearly no effect on the distance between these two customers, although this might have a huge influence on their credit scoring. To prevent this problem, all features are transformed before being entered into the kNN algorithm, so they all lie on the same scale and thus contribute equally to the distance. This process is called normalization. One of the most common normalization methods is the min–max normalization, that is xnorm ¼
x min ðX Þ max ðX Þ min ðX Þ
for a value x of the feature X. The KNN node of the SPSS Modeler provides an adjusted min–max normalization, namely, xnorm ¼
2 ðx min ðX ÞÞ 1: max ðX Þ min ðX Þ
While the min–max normalization maps the data between 0 and 1, the adjusted min–max normalized data take values between 1 and 1. This normalization is automatically performed for all continuous variables if enabled in the KNN node. Additionally, to calculate distances, all variables have to be numeric. To provide this, the categorical variables are automatically transferred into numerical ones by the KNN node, by dummy coding the categories of these variables with integers. If a variable has c classes, then c new dummy variables are added to the dataset, each representing one of the c classes. The values of these dummies are 0 except the one
978
8 Classification Models
representing the actual class of the considered data point. To give an easy example, let us consider a variable “gender” with two classes “M” and “F”. Then two dummy (or indicator) variables are added, one for the class “M” and one for “F”. Assume, a data point has the gender entry “M”. Then the dummy variable representing the class “M” has the value 1, whereas the value of the dummy variable representing “F” is 0. Another way to address the feature-scaling problem is the usage of weights for the features. More important features should have a higher influence on the distance, and thus, the prediction. So, the features are weighted, e.g., by their importance. With this technique, an input variable that has high prediction power gets a bigger weighting. This method is provided by the SPSS Modeler; it weights the features with prediction importance, see IBM (2019b). Dimension Reduction and the Curse of Dimensionality The Curse of Dimensionality describes the phenomenon whereby in highdimensional space the Euclidian distance of a data point to all other points is nearly the same. In high dimensions, multiple variables contribute to the distance and thus equalize the tendencies of each other, so all data points can be thought of as lying on the surface around the query point. See Beyer et al. (1999). So when dealing with high-dimensional data (data with more than ten dimensions), a data reduction process is usually performed prior to applying the kNN algorithm. Therefore, often a dimensional reduction technique, such as PCA (see Sect. 6.3), is pre-applied to the data, in order to reduce the feature dimensions and consolidate the variables into less, but more meaningful, features. See Sect. 8.7.4 for an example on this issue. Another option is to exclude “unimportant” input variables and only consider the most relevant variables in the model. This process is also provided by the SPSS Modeler and can be performed during the training of the model, whereas a PCA, for example, has to be done before executing the kNN method. Besides the curse of dimensionality, another reason to reduce dimensions is the huge amount of resources, time, and memory that the calculations would otherwise consume during the prediction process. There are a couple of algorithms that are more efficient, but when predicting the class of a new data point, the distance to all other data points has to be calculated, which can result in a huge number of computer operations, which can then lead to memory problems or simply take forever. Smaller dimensions reduce the computational time and resources needed, by keeping the prediction efficient. We will later see that this is a real issue for the SPSS Modeler. For further information on the kNN method, we refer the reader to Lantz (2013) and Peterson (2009).
8.7.2
Building the Model with SPSS Modeler
A k-nearest neighbor classifier (kNN) can be trained with the KNN node in the SPSS Modeler. Here, we show how this node is utilized for the classification of wine data. This dataset contains chemical analysis data on three Italian wines, and the goal is to identify the wine based on its chemical characteristics.
8.7 K-Nearest Neighbor
Description of the model Stream name Based on dataset Stream structure
979
k-nearest neighbor—wine Wine_data.txt (see Sect. 12.1.39)
Important additional remarks: The KNN node is sensitive to variable names. In order for all the graphs to work properly, we recommend avoiding special characters and blanks in the variable names. Related exercises: All exercises in Sect. 8.7.5
1. First, we open the template stream “021 Template-Stream wine” (see Fig. 8.270) and save it under a different name. The target variable is called “Wine” and can take values 1, 2, or 3, indicating the wine type. This template stream already contains a Type node, in which the variable “Wine” is set as the target variable and its measurement type is set to nominal. See Fig. 8.271. Additionally, a Partition node is already included that splits the wine data into a training set (70%) and a test set (30%), so a proper validation of the kNN is provided. See Sect. 2.7.7 for a description of the Partition node.
Fig. 8.270 Template stream of the wine data
980
8 Classification Models
Fig. 8.271 Type node in the wine template stream
2. Next, we add a KNN node to the stream and connect it to the Partition node. After opening the KNN node, the model options can be set. 3. In the Objectives tab, we define the purpose of the KNN analysis. Besides the usual prediction of a target field, it is also possible to identify just the nearest neighbors, to get an insight into the data and maybe find the best data representatives to use as training data. Here, we are interested in predicting the type of wine and thus select the “Predict a target field” option. See Fig. 8.272. Furthermore, predefined settings can be chosen in this tab, which are related to the performance properties, speed, and accuracy. Switching between these options changes the settings of the KNN, in order to provide the desired performance. When changing the setting manually, the option changes automatically to “Custom analysis”. 4. In the Fields tab, the target and input variables can be selected. See Fig. 8.273 for variable role definition with the wine data. If the roles of the fields are already defined in a Type node, the KNN node will automatically identify them. 5. The model parameters are set in the Settings tab. If a predefined setting is chosen in the Objectives tab, these are present according to the choice of objective.
8.7 K-Nearest Neighbor
981
Fig. 8.272 Objectives tab in the KNN node
Model In the Model options, the cross-validation process can be initialized by marking the usual “Use partitioned data” option. See top arrow in Fig. 8.274. Furthermore, feature normalization can be enabled. See bottom arrow in Fig. 8.274. The SPSS Modeler uses the adjusted min–max normalization method; see Sect. 8.7.1 for details and the formula. We also recommend always normalizing the data, since this improves the prediction in almost all cases. If in the “Fields” tab, the target and input variables are defined manually, in this tab, the partition node has to be specified as well, in order to train the model on the training data only. Here, we select the “Partition” variable to identify the data of the training and test set. See middle arrow in Fig. 8.274. Neighbors In the Neighbors view, the value of k, i.e., the number of neighbors, is specified along with the metric used to calculate the distances. See Fig. 8.275. The KNN node provides an automatic selection of the “best” k. Therefore, a range for k has to be defined, in which the node identifies the “optimal” k, based on the classification error. The value with the lowest error rate is then chosen as k. Detection of the best
982
8 Classification Models
Fig. 8.273 Variable selection in the KNN node
k is done either by feature selection or cross-validation. This depends on whether or not the feature selection option is requested in the Feature Selection panel. Both options are discussed later. "
The method used to detect the best k depends upon whether or not the feature selection option is requested in the Feature Selection panel.
1. If the feature selection method is enabled, then the model will be built by identification and inclusion of the best features for each k. 2. If features selection is not in effect, a V-fold cross-validation (see subsequent section) will be performed for each k, in order to identify the optimal number of neighbors. "
In both cases, the k with the lowest error rate will be chosen as the size of the neighborhood. A combination of both options cannot be selected, due to performance issues. These options are described in the introduction within their panels.
8.7 K-Nearest Neighbor
983
Fig. 8.274 Model setting and normalization are enabled in the KNN node
The number of neighbors can also be fixed to a specific value. See framed field in Fig. 8.275. Here, we choose the automatic selection of k, with a range between 3 and 5. As an option, the features, i.e., the input variables, can be weighted by their importance, so more relevant features have a greater influence on the prediction, in order to improve accuracy. See Sect. 8.7.1 for details. When this option is selected, see bottom arrow in Fig. 8.275, the predictor importance is calculated and shown in the model nugget. We therefore select this option in our example with the wine data. Feature Selection In the Feature Selection panel, feature selection can be enabled in the model training process. See arrow in Fig. 8.276. Therefore, features are added one by one to the model, until one of the stopping criteria is reached. More precisely, the feature that reduces the error rate most is included next. The stopping criteria will either be a predefined maximum number of features or a minimum error rate improvement.
984
8 Classification Models
Fig. 8.275 The neighborhood and metric are defined in the KNN node
See Fig. 8.276. Feature selection has the advantage of reducing dimensionality and just focusing on a subset of the most relevant input variables. See Sect. 8.7.1 for further information on the “curse of dimensionality”. We exclude feature selection from the model training process, as we intend to use the cross-validation process to find the optimal k. Cross-validation This panel defines the setting of the cross-validation process for finding the best k in a range. The method used here is V-fold cross-validation, which randomly separates the dataset into V segments of equal size. Then V models are trained, each on a different combination of V 1 of these subsets, and the remaining subset is not included in the training, but is used as a test set. The V error rates on the test sets are then aggregated into one final error rate. V-fold cross-validation gives very reliable information on model performance.
8.7 K-Nearest Neighbor
985
Fig. 8.276 Feature selection options in the KNN node
"
The process of V-fold cross-validation entails the following:
1. The dataset is separated randomly into equally sized V subsets, called folds. 2. These subsets are rearranged into V training and test set combinations. Thus, each of the subsets is treated as a test set, while the other V 1 subsets are the training set in one of these new combinations. This is demonstrated in Fig. 8.277, where the gray segments indicate the test data. The remaining white subsets are the training data. 3. A model is trained and tested on each of these new combinations. 4. The error rates of the test set are then aggregated (averaged) into one final error rate. "
Since each data record is used for test data, V-fold cross-validation eliminates a “good luck” error rate that can occur by splitting the data into training and test sets just once. For this reason, it gives very reliable information on model performance.
6. The splitting method is defined, in the cross-validation panel. Unless there is a field that splits the data into groups, typically the way to go here is with random partitioning into equally sized parts. See Fig. 8.278. If this option is chosen, as in
986
8 Classification Models
Fig. 8.277 Visualization of the V-fold cross-validation process. The gray boxes are the test sets and the rest are subsets of the training data in each case
Fig. 8.278 Cross-validation options in the KNN node
8.7 K-Nearest Neighbor
987
Fig. 8.279 Analysis output and evaluation statistics for the KNN wine classifier
our case, the number of folds has to be specified. A ten-fold cross-validation is very common, and we select this procedure in our case too. 7. Now, we run the KNN node and the model nugget appears. In this nugget, the final selection of k can be reviewed. The graphs provided by the nugget, including the final selection of k, are described in the next Sect. 8.7.3, along with more information. 8. To evaluate the model performance, we add the usual Analysis node to the stream and connect it to the model nugget. See Sect. 8.2.6 for information on the Analysis node. The output of this node, with its evaluation statistics, is shown in Fig. 8.279. We see that the KNN model has very high prediction accuracy, since it has an error rate in the training set and test set of only about 2%. The coincidence tables further show that all three wines are equally well-identified from their chemical characteristics.
988
8 Classification Models
8.7.3
The Model Nugget
The main tab in the model nugget of the KNN node comprises all the graphs and visualizations from the model finding process. The tab is split into two frames. See Fig. 8.281. In the left frame, a 3-dimensional scatterplot of the data is shown, colored by the target classes. The three axes describe the most important variables in the model. Here, these are “Alcohol”, “Proline”, and “color_intensity”. The scatterplot is interactive and can be spun around with the mouse cursor. Furthermore, the axes variables can be changed, and a click on a data point marks itself and the nearest neighbors. A description of how to change the axes would exceed the purpose of this book, and we point to IBM (2019b) for details. "
The KNN node and the model nugget are pretty sensitive to variable names with special characters or blanks, and they have problems dealing with them. Therefore, in order for all graphs in the model nugget to work properly and for predictions to work properly, we recommend avoiding special characters and blanks in the variable names. Otherwise, the graphic will not be displayed or parts will be missing. Furthermore, predictions with the model nugget might fail and produce errors, as shown in Fig. 8.280.
The second graph in Fig. 8.281 shows error rates for each k considered in the neighborhood selection process. To do this, the “K Selection” option is selected in the drop-down menu at the bottom of the panel. See arrow in Fig. 8.281. In our example of the wine data, the model with k ¼ 4 has the lowest error rate, indicated by the minimum of the curve in the second graphic, and is thus picked as the final model.
Fig. 8.280 Error message when a variable with a special character, such as (is present in the model)
8.7 K-Nearest Neighbor
989
Fig. 8.281 Main graphics view with a 3-dimensional scatterplot (left side) and the error rates of the different variants of k (right side)
When selecting the “Predictor Importance” option in the bottom right drop-down menu, the predictor importance graph is shown, as we already know from the other models. See Fig. 8.282 and Sect. 8.3.4, for more information on this chart. In the 4-nearest neighbor model on the wine data, all the variables are almost equally relevant with “Alcohol”, “Proline”, and “Color_intensity” being the top three most important in the model. There are a couple of views in the drop-down menu in the right side of the main tab, which display only graphs and tables if a data point is marked in the scatterplot on the left side as mentioned above. These views are: • Peer Visualization of the marked point and its k nearest neighbors on every predictor variable and the target. The shown predictor variables can be selected via the Button “Select Predictors” located at the bottom of the view. See Fig. 8.283, where a data point is marked in the left scatterplot, and the “Peer Chart” shown in the right side. • Neighbor and Distance Table This table shows the distances of the marked point to each of the k nearest neighbors. • Quadrant Map Like the Peer view, but it displays the marked point and its k nearest neighbors for the predictor variable on the x-axis against the target variable on the y-axis. Furthermore, a reference line is drawn that symbolizes the mean of the variable upon the training partition. The Button “Select Predictors” located at the bottom of the view enables the option to select the predictor variables that are used in the graph. See Fig. 8.284 for an example of this view.
990
8 Classification Models
Fig. 8.282 Predictor importance graph in the KNN model nugget
Fig. 8.283 Peer chart on the right side for the marked data point with its k nearest neighbors
8.7 K-Nearest Neighbor
991
Fig. 8.284 Quadrant Map which displays the target by predictor variable for the marked data point and k nearest neighbors
The last two views, which can be chosen through the drop-down menu, are Classification Table and Error summary, which display the confusion matrix resp. error rates of every partition. For more detailed information on these graphs and features exceeding the description here, we refer interested readers to IBM (2019b).
8.7.4
Dimensional Reduction with PCA for Data Preprocessing
As mentioned in Sect. 8.7.1, dimensional reduction is a common preprocessing step prior to applying kNN to data. Here, we present this procedure on the multidimensional leukemia gene expression data, which comprises 851 different variables. Our goal is to build a nearest neighbor classifier that can differentiate between acute
992
8 Classification Models
(AML, ALL) and chronic (CML, CLL) leukemia patients, based on their gene expression. Here, we use a PCA to reduce dimensionality, which is the standard way to go in these situations. Description of the model Stream name Based on dataset Stream structure
knn_pca_gene_expression_acute_chronic_leukemia gene_expression_leukemia_all.csv (see Sect. 12.1.14)
Important additional remarks: Dimensional reduction is a common way to reduce multidimensional data, saving memory space, increasing computational speed, and counteracting the “curse of dimensionality” problem.
The stream is split into two parts, the data import and target setting part and the dimensional reduction and model building part. As the focus of this section is dimension reduction with the PCA, we keep the target setting part short, as it is unimportant for the purpose of this section. Data Importing and Target Setting 1. We open the template stream “018 Template-Stream gene_expression_leukemia” and save it under a different name. See Fig. 8.154 in a prior exercise solution. The template already consists of the usual Type and Partition nodes, which define the target variable and split the data into 70% training data and 30% test data. 2. We then add a second Type node, a Reclassify node, and a Select node to the stream and insert them between the Source node and the Partition node. In the Type node, we click the “Read Values” button, so that the nodes that follow know the variable measurements and value types. In the Reclassify node, we select the “Leukemia” variable as our reclassification field, enable value replacement at the top, and run the “Get” button. Then, we merge AML and ALL into one group named “acute” and proceed in the same manner with the chronic leukemia types, CML and CLL, by relabeling both “chronic”. See Fig. 8.285. 3. As we are only interested in the treatment of leukemia patients, we add a Select node to exclude the healthy people. See Fig. 8.286 for the formula in the Select node.
8.7 K-Nearest Neighbor
993
Fig. 8.285 Reclassification node, which joins the acute and chronic leukemia types into groups
Fig. 8.286 Select node where the healthy people are excluded from the dataset to get the intended training data
994
8 Classification Models
Dimensional Reduction with the PCA/Factor Node Here, we present in brief how to set up a PCA as a preprocessing step. For details on PCA, and a complete description of the node, we refer to Sect. 6.3. 4. To perform dimension reduction with PCA, we add a PCA/Factor node to the stream and connect it to the last Type node. In the PCA/Factor node, we then select all the genomic positions as inputs and the partition indicator variable as the partition field. See Fig. 8.287. In the Model tab, we mark the box that enables the use of partitioned data, so the PCA is only performed on the training data. Furthermore, we select the principal component method, so that the node uses PCA. See Fig. 8.288. In the Expert tab, we choose the predefined “Simple” setup (Fig. 8.289). This ensures a calculation of the first five factors. If more factors are needed, these can be customized under the “Expert” options. Please see Sect. 6.3 for more information on these options. 5. After running the PCA/Factor node, it calculates the first five factors, and we observe in the PCA/Factor model nugget that these five factors explain 41.6% of the data variation. See Fig. 8.290.
Fig. 8.287 Definition of the input and partitioning variables for the leukemia dataset in the PCA/Factor node
8.7 K-Nearest Neighbor
995
Fig. 8.288 Selection of the principle component method in the PCA/Factor node for the leukemia dataset
Fig. 8.289 Expert tab and setup of the PCA factors that are determined by PCA
996
8 Classification Models
6. An advantage of dimensional reduction with PCA is the consolidation of mutual variables into much more meaningful new variables. With these, we can get an impression of the position of the groups (acute, chronic) and the potential for classification. For that purpose, we add a Plot node to draw a scatterplot of the first two factors and a Graphboard node to draw a 3D scatterplot of the first three factors. In both cases, the data points are colored and shaped according to their group (acute, chronic). These plots are shown in Figs. 8.291 and 8.292, respectively, and we immediately see that the two groups are separated, especially in the 3D graph, which indicates that separation might be feasible.
Total Variance Explained Initial Eigenvalues
Extraction Sums of Squared Loadings
Component 1
Total 135,110
% of Variance Cumulative % 15,877 15,877
Total 135,110
% of Variance Cumulative % 15,877 15,877
2 3 4
102,696 42,395 40,424
12,068 4,982 4,750
27,944 32,926 37,676
102,696 42,395 40,424
12,068 4,982 4,750
27,944 32,926 37,676
5
33,391
3,924
41,600
33,391
3,924
41,600
Fig. 8.290 Variance explained by the first five factors calculated by PCA, on the gene expression dataset
Fig. 8.291 Scatterplot of the first two factors determined via PCA for the leukemia gene expression dataset
8.7 K-Nearest Neighbor
997
Fig. 8.292 3-D scatterplot of the first three factors determined by PCA for the leukemia gene expression data
kNN on the Reduced Data 7. Now, we are ready to build a kNN on the five established factors with PCA. Before we do so, we have to add another Type node to the stream, so that the KNN node is able to read the measurement and value types of the factor variables. 8. We finally add a KNN node to the stream and connect it to the last Type node. In the KNN node, we select the 5 factors calculated by the PCA as input variables and the “Leukemia” field as the target variable. See Fig. 8.293. In the Settings tab, we select the Euclidian metric and set k to 3. See Fig. 8.294. This ensures a 3-nearest neighbor model. 9. Now we run the stream and the model nugget appears. We connect an Analysis node to it and run it to get the evaluation statistics of the model. See Sect. 8.2.6 for a description on the options in the Analysis node. See Fig. 8.295, for the output from the Analysis node. As can be seen, the model has high prediction accuracy for both the training data and the test data. Furthermore, the Gini values are extremely high. So, we conclude that the model fits the reduced data well and is able to distinguish between acute and chronic leukemia from factor variables alone, which explain just 41.6% of the variance, with a minimal error rate. The Gini error rate can probably be improved by adding more factors, calculated by PCA, as input variables to the model.
998
8 Classification Models
Fig. 8.293 The factors that were generated by the PCA are selected as input variables in the KNN node to build a model on the dimensionally reduced data
10. To sum up, the PCA reduced the multidimensional data (851 features) into a smaller dataset with only five variables. This smaller data, with 170 times fewer data entries, contains almost the same information as the multidimensional dataset, and the kNN model has very high prediction power on the reduced dataset. Furthermore, the development and prediction speed have massively increased for the model trained on the smaller dataset. Training and evaluation of a kNN model on the original and multidimensional data takes minutes, whereas the same process requires only seconds on the dimensionally reduced data (five factors), without suffering in prediction power. The smaller dataset obviously needs less memory too, which is another argument for dimension reduction.
8.7.5
Exercises
Exercise 1: Identification of Nearest Neighbors and Prediction of Credit Rating Consider the following credit rating dataset in Fig. 8.296, consisting of nine bank customers, with information on their age, income, and number of credit cards held. Figure 8.297 comprises a list of four bank customers who have received no credit rating yet.
8.7 K-Nearest Neighbor
999
Fig. 8.294 Metric and neighborhood size selection in the KNN node for the model on the dimensionally reduced data
In this exercise, the credit rating for these four new customers should be established with the k-nearest neighbor method. 1. Normalize the features of the training set and the new data, with the min–max normalization method. 2. Use the normalized data to calculate the Euclidian distance between the customers John, Frank, Penny, and Howard, and each of the customers in the training set. 3. Determine the 3-nearest neighbors of the four new customers and assign a credit rating to each of them. 4. Repeat steps two and three, with the City-block metric as your distance measure. Has the prediction changed? Exercise 2: Feature Selection Within the KNN Node In Sect. 8.7.2, a kNN classifier was trained on the wine data with the KNN node. There, a cross-validation approach was used to find the best value for k, which turned out to be 4. In this exercise, the 4-nearest neighbor classifier is revisited, which is able to identify the wine based on its chemical characteristics. This time, however, the feature selection method should be used. More precisely,
1000
8 Classification Models
Fig. 8.295 Evaluation statistics of the kNN on the dimensionally reduced leukemia gene expression data
Fig. 8.296 The training dataset of bank customers who already have a credit rating
8.7 K-Nearest Neighbor
1001
Fig. 8.297 List of four new bank customers that need to be rated
1. Build a 4-nearest neighbor model as in Sect. 8.7.2 on the wine data, but enable the feature selection method, to include only a subset of the most relevant variables in the model. 2. Inspect the model nugget. Which variables are included in the model? 3. What is the accuracy of the model for the training data and test data? Exercise 3: Dimensional Reduction and the Influence of k for Imbalanced Data Consider the dataset “gene_expression_leukemia_all”, which contains genomic sample data from several patients suffering from one of four different types of leukemia (ALL, AML, CLL, CML) and data from a small healthy control group (see Sect. 12.1.16). The gene expression data are measured at 851 locations in the human genome and correspond to known cancer genes. The goal is to build a kNN classifier that is able to separate the healthy patients from the ill patients. Perform a PCA on the data in order to reduce dimensionality. 1. Import the dataset “gene_expression_leukemia_all.csv” and merge the data from all leukemia patients (AML, ALL, CML, CLL) into a single “Cancer” group. What is the frequency of both the leukemia and the healthy data records in the dataset? Is the dataset imbalanced? 2. Perform a PCA on the gene expression data of the training set. 3. Build a kNN model on the factors calculated in the above step. Use the automatic k selection method with a range of 3–5. What is the performance of this model? Interpret the results. 4. Build three more kNN models for k equals to 10, 3, and 1, respectively. Compare these three models and the one from the previous step with each other. What are the evaluation statistics? Draw the ROC curves. Which is the best performing model from these four and why? Hint: Use the stream of Sect. 8.7.4 as a starting point.
8.7.6
Solutions
Exercise 1: Identification of Nearest Neighbors and Prediction of Credit Rating 1. Figure 8.298 shows the min–max normalized input data, i.e., age, income, number of credit cards. The values 0 and 1 indicate the minimum and maximum
1002
8 Classification Models
Fig. 8.298 Normalized values of the input data
Fig. 8.299 Euclidean distance between John, Frank, Penny, and Howard and all the other bank customers
values. As can be seen, all variables are now located in the same range, and thus the effect of the large numbers and differences in Income are reduced. 2. The calculated Euclidian distance between customers John, Frank, Penny, and Howard and each of the other customers is shown in Fig. 8.299. As an example, we show here how the distance between John and Sandy is calculated qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d ðJohn, SandyÞ ¼ ð0 0:29Þ2 þ ð0:17 0:44Þ2 þ ð0:67 0:67Þ2 0:4: 3. The 3-nearest neighbors for each new customer (John, Frank, Penny, Howard) are highlighted in Fig. 8.299. For example, John’s 3-nearest neighbors are Andrea, Ted, and Walter, with values 0.3, 0.31, and 0.27.
8.7 K-Nearest Neighbor
1003
Fig. 8.300 Final credit rating for John, Frank, Penny, and Howard
Fig. 8.301 City-block distance between John, Frank, Penny, and Howard and all the other bank customers
Counting the credit ratings of these three neighbors for each of the four new customers, we get, by majority voting, the following credit ratings, as listed in Fig. 8.300. 4. Figure 8.301 displays the City-block distances between John, Frank, Penny, and Howard and each of the training samples. The colored values symbolize the 3-nearest neighbors in this case. As an example, here we calculate the distance between John and Sandy. d ðJohn, SandyÞ ¼ j0 0:29j þ j0:17 0:44j þ j0:67 0:67j ¼ 0:56: We see that the nearest neighbors haven’t changed, from comparing the Euclidian distance. Hence, the credit ratings are the same as before and can be viewed in Fig. 8.300. Exercise 2: Feature selection within the KNN node. Name of the solution streams Theory discussed in section
wine_knn_feature_selection Sect. 8.2 Sect. 8.7.1 Sect. 8.7.2
1004
8 Classification Models
Fig. 8.302 Stream of the wine identification exercise with feature selection in the KNN node
Figure 8.302 shows the final stream for this exercise. 1. Since this is the same stream construct as in Sect. 8.7.2, we open the stream “knearest neighbor—wine” and save it under a different name. We recall that this stream imports the wine data, defines the target variables and measurement, splits the data into 70% training data and 30% test data, and builds a kNN model that automatically selects the best k; in this case k ¼ 4. This is the reason why we perform a 4-nearest neighbor search in this exercise. To enable a feature selection process for a 4-nearest neighbor model, we open the KNN node and go to the Neighbors panel in the Settings tab. There, we fix k to be 4, by changing the automatic selection option to “specify fixed k” and writing “4” in the corresponding field. See Fig. 8.303. In the Feature Selection panel, we now activate the feature selection process by checking the box at the top. See top arrow in Fig. 8.304. This enables all the options in this panel, and we can choose the stopping criteria. We can choose to stop when a maximum number of features are included in the model or when the error rate cannot be lowered by more than a minimum value. We choose the second stopping criteria and define the minimum improvement for each inserted feature by 0.01. See Fig. 8.304. Activation of the feature selection process disables the cross-validation procedure, and all options in the Cross-Validation panel are shown as grayed out. This finishes the option setting and we run the stream. 2. We open the model nugget and observe first that the scatterplot on the right has changed, compared with Fig. 8.281, due to the different axes. See Fig. 8.305. In the right drop-down menu, we select the “Predictor Selection” option, and the visualization of the feature selection process is shown. The curve in the graph shows the error rate for each further feature added to the model. The corresponding name of the added variable is printed next to the point in the graph. See Fig. 8.305. In this case, seven features are included in the model. These are, in order of their inclusion in the model, “Flavanoids”, “Color_intensity”, “Alcohol”, “Magnesium”, “Proline”, “Proanthocyanins”, and “Ash”.
8.7 K-Nearest Neighbor
1005
Fig. 8.303 k is defined as 4 in the KNN node for the wine selection model
Since not all variables are considered in the model, the predictor importance has changed, which we already recognized by the axes variables in the 3-D scatterplot. The importance of variables in this case is shown in Fig. 8.306. 3. The output of the Analysis node is displayed in Fig. 8.307. As can be seen, the accuracy has not changed much, since the performance of the model in Sect. 8.7.2 was already extremely precise. The accuracy has also not suffered from feature selection, however, and the dimensionality of the data has halved, which reduces memory requirements and hastens the prediction performance. Exercise 3: Dimensional Reduction and the Influence of k on Imbalanced Data Name of the solution streams Theory discussed in section
gene_expression_leukemia_knn_imbalanced_data Sect. 8.2 Sect. 8.7.1 Sect. 8.7.4 Sect. 6.3.1
1006
8 Classification Models
Fig. 8.304 Activation of the features selection procedure in the KNN node
Fig. 8.305 Model view in the model nugget and visualization of the error rate curve in the feature selection process
8.7 K-Nearest Neighbor
1007 Predictor Importance Target: Wine
Magnesium Alcohol Proanthocyanins Ash Color_intensity Proline Flavanoids 0.0
0.2
0.4
0.6
0.8
1.0
Fig. 8.306 Predictor importance in the case of a feature selection KNN model on the wine data
Fig. 8.307 Evaluation statistics in the 4-nearest neighbor model with feature selection on the wine data
1008
8 Classification Models
Figure 8.308 shows the final stream for this exercise. 1. Since we are dealing here with the same data as in Sect. 8.7.4, where a kNN model was also built on PCA processed data, we use this stream as a starting point. We therefore open the stream “knn_pca_gene_expression_acute_chronic_leukemia” and save it under a different name. See Fig. 8.309 for the stream in Sect. 8.7.4. This stream has already imported the proper dataset, partitioned it into a training set and a test set, and reduced the data with PCA. See Sect. 8.7.4 for details. Starting from here, we first need to change the reclassification of target values. For that purpose, we delete the Select node and connect the Distribution node to the Reclassify node. The latter is then opened and the value of the “Leukemia” variable representing all leukemia types (AML, ALL, CML, CLL) is set to “Cancer”. See Fig. 8.310. Next, we run the Distribution node, to inspect the frequency of the new values of the “Leukemia” variable. As can be seen in Fig. 8.311, the dataset is highly imbalanced, with the healthy data being the minority class, comprising only 5.73% of the whole data. 2. The PCA/Factor node is already included in the stream, but since the healthy patients are now added to the data, unlike in the stream from Fig. 8.309, we have
Fig. 8.308 Stream of the kNN leukemia and healthy patient identification exercise
Fig. 8.309 The stream “knn_pca_gene_expression_acute_chronic_leukemia” from Sect. 8.7.4
8.7 K-Nearest Neighbor
1009
Fig. 8.310 Reclassification of leukemia types into a general cancer group
Fig. 8.311 Distribution plot of the reclassified “Leukemia” variable for the gene expression data
1010
8 Classification Models Total Variance Explained Initial Eigenvalues
Extraction Sums of Squared Loadings
Component 1
Total 131,662
% of Variance Cumulative % 15,471 15,471
Total 131,662
% of Variance Cumulative % 15,471 15,471
2 3 4
103,540 44,943 42,210
12,167 5,281 4,960
27,638 32,919 37,879
103,540 44,943 42,210
12,167 5,281 4,960
27,638 32,919 37,879
5
32,676
3,840
41,719
32,676
3,840
41,719
Fig. 8.312 Explained variation in the first five components established by PCA
Fig. 8.313 Scatterplot matrix for all five factors calculated by PCA on the gene expression data
to run the PCA/Factor node again to renew the component computations. In the PCA/Factor model nugget, we can then view the variance in the data, explained by the first five components. This is 41.719, as shown in Fig. 8.312. To get a feeling for the PCA components, we add a Graphboard node to the PCA/Factor model nugget and plot a scatterplot matrix for all five factors. See Fig. 8.313. We notice that in all dimensions, the healthy patient data seems to be
8.7 K-Nearest Neighbor
1011
Fig. 8.314 Automatic selection of k is activated in the KNN node for the leukemia classification model
located in a cluster. This is a slight indicator that classification will be successful; a clearer statement is not possible, however. 3. Now, we open the KNN node and switch the fixed definition of k to an automatic calculation, with a range between 3 and 5. See Fig. 8.314. We now run the KNN node to update the model nugget, in which the selection process of k can be viewed. Figure 8.315 shows the graph in the nugget related to this process, and we observe that four neighbors result in the lowest error in the training data. Afterwards, we run the connected Analysis node to view the accuracy and Gini values. These statistics are presented in Fig. 8.316, and we observe that accuracy is extremely high, as is the Gini, and at first glance this suggests a good model. When inspecting the coincidence matrix, however, we see that only 20% of the healthy patients are classified correctly. See arrow in Fig. 8.316. This issue results from an unbalance of the data. The minority class “Healthy” is ignored in favor of the majority class “Cancer”.
1012
8 Classification Models k Selection Error log 0.064 0.063
Error rate
0.062 0.061 0.060 0.059 0.058 3
4 Number of Nearest Neighbors (k)
5
Fig. 8.315 Visualization of the k selection process for the KNN leukemia classifier
Fig. 8.316 Evaluation statistics in the KNN leukemia classifier, with automatic neighborhood selection
8.7 K-Nearest Neighbor
1013
Fig. 8.317 A 10-nearest neighbor model is defined in the KNN node
4. To build a 10-nearest neighbor model, we copy the existing KNN node and paste it onto the Modeler canvas. Then, we connect it to the last Type node. In the node itself, we fix the number of neighbors to 10, as shown in Fig. 8.317. To build the 3-nearest and 1-nearest neighbor models, we proceed in the same manner. Afterwards, we run all three models (10-, 3-, and 1-nearest neighbor). To provide a clear overview and make comparison of the models easier, we connect all KNN model nuggets into a series. See Fig. 8.147 as an example of how this is performed. Next, we add a Filter node to the end of the stream, to rename the predictor fields of the models with a proper name. We then add an Analysis node and Evaluation node to the stream and connect them to the Filter node. See Sects. 8.2.6 and 8.2.7 for options within these two nodes. Figure 8.318 shows the accuracy and Gini, as calculated by the Analysis node. We notice that the models improve according to Gini and accuracy measures, by reducing the value of k. The smaller the neighborhood, the better the model prediction. Thus, the statistics indicate that a 1-nearest neighbor
1014
8 Classification Models
Fig. 8.318 Accuracy and Gini for the four kNN models that predict leukemia based on gene expression data
classifier is the best one for this data. This is evidenced by the perfect classification of this model, i.e., accuracy of 100%. Analysis further exposes that the automatic k selection method output is not always the best solution. k ¼ 3 may be more appropriate than k ¼ 4 in this situation. Improvement of the models is also visualized by the ROC curves in Fig. 8.319. The coincidence matrices can also confirm improvement in the prediction performance of models with a smaller k. A smaller k brings more attention to the absolute nearest data points, and ergo the minority class. Hence, misclassification of “Healthy” patients is reduced when k is lowered. See Fig. 8.320.
8.8 Decision Trees
1015
Fig. 8.319 ROC curves of the four kNN models that predict leukemia based on gene expression data
8.8
Decision Trees
We now turn to the rule-based classification methods, from which Decision Trees (DT) are the most famous group of algorithms. In comparison with the classification methods discussed so far, rule-based classifiers have a completely different approach; they inspect the raw data for rules and structures that are common in a target class. These identified rules then become the basis for decision making. This approach is closer to real-life decision making. Consider a situation where we plan to play tennis. The decision on whether to play tennis or not depends, for example, on the weather forecast. We will go to the tennis court “if it is not raining and the temperature is over 15 C”. If “rainy weather or a temperature of less than 15 C” is forecasted, we decide to stay at home. These rules are shown in Fig. 8.321, and we recall Sect. 8.2.1 for a similar example. A decision tree algorithm now constructs and represents this logical structure in the form of a tree, see Fig. 8.322. The concept of finding appropriate rules within the data is discussed in the next chapter. Then the building of a DT model with the SPSS Modeler is presented. Rule-based classifiers have the advantage of being easily understandable and interpretable without statistical knowledge. For this reason, DTs, and rule-based classifiers in general, are widely used in fields where the decision has to be transparent, e.g., in credit scoring, where the scoring result has to be explained to the customer.
1016
8 Classification Models
Fig. 8.320 Coincidence matrices on the four kNN models that predict leukemia based on gene expression data
Fig. 8.321 Rules for making the decision to play tennis
8.8.1
Theory
Decision trees (DT) belong to a group of rule-based classifiers. The main characteristic of a DT lies within how it orders the rules in a tree structure. Looking again at the tennis example, where a decision is made based upon the weather, the logical rule on which the decision will be based is shown in Fig. 8.321. In a DT, these rules are represented in a tree, as can be seen in Fig. 8.322.
8.8 Decision Trees
1017
Fig. 8.322 Decision tree for the playing tennis example
A DT is like a flow chart. A data record is classified by going through the tree, starting at the root node, and deciding in each tree node which of the conditions the particular variable satisfies, and then following the branch of this condition. This procedure is repeated until a leaf is reached that brings the combination of decisions made thus far to a final classification. For example, let us consider an outlook “No rain”, with a temperature of 13 C. We want to make our decision to play tennis using the DT in Fig. 8.322. We start at the root node and see that our outlook is “No rain” and we follow the left branch to the next node. This looks at the temperature variable, is it larger or smaller than 15 C? As the temperature in our case is 13 C, we follow the branch to the right and arrive at a “No” leaf. Hence, in our example, we decide to stay home as the temperature is too cold. As can be seen in this short example, our decision to cancel our plans to play tennis was easily determined by cold weather. This shows the great advantage of DT or rule-based classifiers in general. The cause of a certain decision can be reconstructed and is interpretable without statistical knowledge. This is the reason for DT’s popularity of use for a variety of fields and problems, where the classification has to be interpretable and justifiable to other people. The Tree Building Process Decision trees are built with a recursive partitioning mechanism. At each node, the data is split into two distinct groups by the values of a feature, resulting in subsets, which are then split again into smaller subsets, and so on. This process of splitting a problem into smaller and smaller subproblems of the same type is also known as the divide and conquer approach. See Cormen (2009), for a detailed description of the divide and conquer paradigm.
1018
8 Classification Models
The recursive process of building a DT is described below: 1. The DT consists of a single root node. 2. In the root node, a variable and partitioning of its values are determined with the “best” partitioning ability, that is, partitioning into groups of similar target classes. There are several methods to select the “best” variable, and they are described in the section hereafter. 3. The data is split into distinct subgroups, based on the previously chosen splitting criteria. 4. These new data subsets are the basis for the nodes in the first level of the DT. 5. Each of these new nodes again identifies the “best” variable for partitioning the subset and splits it according to the splitting criteria. The new subsets are then the basis for the splitting of the nodes in the second level of the DT. 6. This node partitioning process is repeated until a stopping criterion is fulfilled for that particular node. Stopping criteria are: • • • •
All (or almost all) training data in the node are of the same target category. The data can no longer be partitioned by the variables. The tree has reached its predefined maximum level. The node has reached the minimum occupation size.
7. A node that fulfills one of the stopping criteria is called a leaf and indicates the final classification. The target category that occurs most often in the subset of the leaf is taken as the predictor class. This means each data sample that ends up in this leaf when passed through the DT is classified as the majority target category of this leaf. In Fig. 8.323, node partitioning is demonstrated with the “play tennis” DT example, see Fig. 8.322, where the circles represent days we decide to play tennis, and the rectangles indicate days we stay at home. In the first node, the variable “Outlook” is chosen and the data are divided into the subsets “Non rain” and “Rain”. See the first graph in Fig. 8.323. If the outlook is “Rain”, we choose not to play tennis, while in the case of “No rain”, we can make another split on the values of the temperature. See right graph in Fig. 8.323. Recalling Fig. 8.322, this split is done at 15 C, whereby on a hotter day, we decide to play a tennis match. As can be seen in the above example and Fig. 8.323, a DT is only able to perform axis-parallel splits in each node. This divides the data space into several rectangles, where each of them is assigned to a target category, i.e., the majority category in that rectangle. Pruning If the stopping criteria are set too narrowly, the finished DT is very small and underfitting is likely. In other words, the tree is unable to describe particular
8.8 Decision Trees
1019
Fig. 8.323 Illustration of the data space partitioning by a DT for the “play tennis” example
structures in the data, as the criteria are too vague. Underfitting can occur, for example, if the minimum occupation size of each node is set too large, or the maximum level size is set too small. On the other hand, if the stopping criteria are too broad, the DT can continue splitting the training data until each data point is perfectly classified, and then the tree will be overfitting. The constructed DT is typically extremely large and complex. Several pruning methods have therefore been developed to solve this dilemma, originally seen by Breiman et al. (1984). The concept is basically the following: Instead of stopping the tree growth at some point, e.g., at a maximum tree level, the tree is overconstructed, allowing overfitting of the tree. Then, nodes and “subbranches” are removed from the overgrown tree, which do not contribute to the general accuracy. Growing the tree to its full size and then cutting it back is usually more effective than stopping at a certain point, since determination of the optimal tree depth is difficult without growing it first. As this method allows us to better identify important structures in the data, this pruning approach generally improves the generalization and prediction performance. For more information on the pruning process and further pruning methods, see Esposito et al. (1997). Decision Tree Algorithms and Separation Methods There are numerous implementations of decision trees that mainly differ in the splitting mechanism, that is, the method of finding the optimal partition and the number of new nodes that can be grown from a single node. We outline below the most well-known decision tree algorithms (also provided by the SPSS Modeler) and their splitting algorithms:
1020
8 Classification Models
CART (Classification and Regression Tree) The CART is a binary splitting tree. That means, each non-leaf node has exactly two outreaching branches. Furthermore, it provides the pruning process as described above, to prevent over- and underfitting. The split in each node is selected with the Gini coefficient, sometimes also called the Gini index. The Gini coefficient is an impurity measure and describes the dispersion of a split. The Gini coefficient should not be confused with the Gini index that measures the performance of a classifier, see Sect. 8.2.5. The Gini coefficient at node σ is defined as XN ðσ, jÞ2 Giniðσ Þ ¼ 1 , N ðσ Þ j where j is a category of the target variable, N(σ, j) the number of data in node σ with category j, and N(σ) the total number of data in node σ. In other words, N ðσ, jÞ=N ðσ Þ is the relative frequency of category j upon the data in node σ. The Gini coefficient reaches its maximum, when the data in the node are equally distributed across the categories. If, on the other hand, all data belong to the same category in the node, then the Gini equals 0, its minimum value. A split is now measured with the Gini Gain GiniGainðσ, sÞ ¼ Giniðσ Þ
N ðσ L Þ N ðσ R Þ Giniðσ L Þ Giniðσ R Þ, N ðσ Þ N ðσ Þ
where σ L and σ R are the two child nodes of σ, and s is the splitting criteria. The binomial split that maximizes the Gini Gain will be chosen. In this case, the child nodes deliver maximal purity, with regard to category distribution, and so it is best to partition the data along these categories. When there are a large number of target categories, the Gini coefficient can encounter problems. The CART therefore provides another partitioning selection measure, called twoing measure. Instead of trying to split the data so that the subgroups are as pure as possible, the twoing measure takes also equal-sized splits into account, which can lead to more balanced branches. We resist a detailed description of this measure within the CART here and refer to Breiman et al. (1984) and IBM (2019a). C5.0 The C5.0 algorithm was developed by Ross Quinlan and is an evolution of his own C4.5 algorithm Quinlan (1993), which itself originated from the ID3 decision tree Quinlan (1986). Its ability to split is not strictly binary, but allows for partitioning of a data segment into more than two subgroups. As with the CART, the C5.0 tree
8.8 Decision Trees
1021
provides pruning after the tree has grown, and the splitting rules of the nodes are selected via an impurity measure. The measure used is the Information Gain of the Entropy. The entropy quantifies the homogeneity of categories into a node and is given by X N ðσ, jÞ N ðσ, jÞ Entropyðσ Þ ¼ log 2 , N ðσ Þ N ðσ Þ j where σ symbolizes the current node, j is a category, N(σ, j) the number of data records of category j in node σ, and N(σ) the total number of data in node σ (see the description of CART). If all categories are equally distributed in a node segment, the entropy maximizes, and if all data are of the same class then it takes its minimum as 0. The Information Gain is now defined as InformationGainðσ, sÞ ¼ Entropyðσ Þ
X N ðσ 1 Þ σ1
N ðσ Þ
Entropyðσ 1 Þ,
with σ 1 being one of the child nodes of σ resulting from the split and measuring the change in purity in the data segments of the nodes. The splitting criteria s, which maximizes the Information Gain, is selected for this particular node. In 2008, the C4.5 was picked as one of the top ten algorithms for data mining (Wu et al. 2008. More information on the C4.5 and C5.0 decision trees can be found in Quinlan (1993) and Lantz (2013). CHAID (Chi-squared Automatic Interaction Detector) The CHAID is one of the oldest decision tree algorithms (Kass 1980) and allows splitting into more than two subgroups. Pruning is not provided with this algorithm, however. The CHAID uses the Chi-square independence test Kanji 2009) to decide on the splitting rule for each node. As the Chi-square test is only applicable to categorical data, all numerical input variables have to be grouped into categories. The algorithm does this automatically. For each input variable, the classes are merged into a super-class, based on their statistical similarity, and maintained if they are statistically dissimilar. These super-class variables are then compared with the target variable for dependency, i.e., similarity, with the Chi-square independence test. The one with the highest significance is then selected as the splitting criteria for the node. For more information, see Kass (1980) and IBM (2019a). QUEST (Quick, Unbiased, Efficient Statistical Tree) The QUEST algorithm (Loh and Shih 1997) only constructs binary trees and has been specially designed to reduce the execution time of large CART trees. It was furthermore developed to reduce the tendency for input variables that allow more splits, i.e., numeric variables or categorical variables with many classes. For each
1022
8 Classification Models
Table 8.14 Decision tree algorithms with the corresponding node in the modeler Decision tree CART C5.0 CHAID QUEST
Method • Gini coefficient • Twoing criteria Information gain (entropy) Chi-squared statistics Significance statistics
SPSS modeler node C&R tree C5.0 CHAID QUEST
split, an ANOVA F-test (numerical) or Chi-square test (categorical variable), see Kanji (2009), is performed, to determine the association between each input variable and the target. QUEST further provides an automatic pruning method, to avoid overfitting and improve the determination of an optimal tree depth. For more detailed information, see Loh and Shih (1997) and IBM (2019a). In Table 8.14, the decision tree algorithms, their splitting methods, and the corresponding nodes in the Modeler are displayed. For additional information and more detailed descriptions of these decision trees, we refer the interested reader to Rokach and Maimon (2008). Boosting In Sect. 5.3.6, the technique of ensemble modeling and particularly Boosting was discussed. Since the concept of Boosting originated with decision trees and is still mostly used for classification models, we hereby explain the technique once again, but in more detail. Boosting was developed to increase prediction accuracy by building a sequence of models (components). The key idea behind this method is to give misclassified data records a higher weight and correct classified records a lower weight, to point the focus of the next model in the sequence onto the incorrectly predicted records. With this approach, the classification problem is then shifted to the data records, which usually perish in the analysis, and the records that are easy to handle and correctly classified anyway are neglected. All component models in the ensemble are built on the entire dataset, and the weighted models are aggregated into a final prediction. This process is demonstrated in Fig. 8.324. In the first step, the unweighted data (circles and rectangles) are divided into two groups. With this separation, two rectangles are located on the wrong side of the decision boundary, hence, they are misclassified. See the circled rectangles in the top right graph. These two data rectangles are now more heavily weighted, and all other points are down-weighted. This is symbolized by the size of the points in the bottom left graph in Fig. 8.324. Now, the division is repeated with the weighted data. This results in another decision boundary, and thus, another model. See the bottom right graph in Fig. 8.324. These
8.8 Decision Trees
1023
Fig. 8.324 Illustration of the boosting method
Fig. 8.325 The concept of boosting. Aggregation of the models and the final model
two separations are now combined through aggregation, which results in perfect classification of all points. See Fig. 8.325. The most common and best-known boosting algorithm is the AdaBoost or adaptive boosting. We refer to Sect. 5.3.6 and Lantz (2013), Tuffery (2011), Wu et al. (2008), Zhou (2012) and James et al. (2013) for further information on boosting and other ensemble methods, such as bagging. Boosting and bagging are provided by the decision tree nodes of the SPSS Modeler (C5.0 node, C&R Tree node, Quest node, CHAID node), which embrace the four above-mentioned tree algorithms (see Table 8.14). These ensemble
1024
8 Classification Models
methods, and in particular booting, are often a good try to improve the stability and quality of the model. The SPSS Modeler provides nodes explicitly dedicated to boosting models, e.g., XGBoost Tree. As mentioned above, the regular four decision tree nodes described in this book (see Table 8.14) are already able to perform boosting itself. So, we omit a discussion of the XGBoost nodes here, and refer to IBM (2019b) for details and further information. One Additional Remark on Decision Trees The nodes of a decision tree depend upon each other during the splitting process, as the data from where the next partition rule is selected is created by splitting the previous node. This is one reason why splitting one node into multiple segments can improve the prediction performance of a tree. Random Forest Random forests can be seen as an enhancement of decision trees. Loosely speaking, a random forest consists of multiple independently constructed decision trees. The outcomes of these decision trees are then aggregated to a final prediction via bagging, typically by voting, that means the class that is predicted the most often by the individual trees is the final prediction of the random forest. Hence, the random forest is an ensemble learning method and is therefore very stable and robust to outliers. It can typically model very complex data structures with high accuracy and is thus one of the most popular methods used for classification problems by data miners. For further information on the random forest algorithm, we refer to Kuhn and Johnson (2013) and Hastie (2009). We leave it at this very brief introduction of random forests and omit going into detail in this book. However, a complete discussion of the method and the description how to train a random forest model with the SPSS Modeler is provided at the corresponding website of this book “http://www.statistical-analytics.net”. See Chap. 1 for details to access the content on the website related to this book.
8.8.2
Building a Decision Tree with the C5.0 Node
There are four nodes that can be used to build one of the above-described decision trees in the SPSS Modeler. See Table 8.14. As their selection options are relatively similar, we only present the C5.0 in this section and the CHAID node in the subsequent chapter and refer to the exercises for usage of the remaining nodes. We show how a C5.0 tree is trained, based on the credit rating data “tree_credit”, which contains demographic and historic loan data from bank customers and their related credit rating (“good” or “bad”).
8.8 Decision Trees
Description of the model Stream name Based on dataset Stream structure
1025
C5.0_credit_rating Tree_credit.sav (see Sect. 12.1.38)
Important additional remarks: The target variable should be categorical (i.e., nominal or ordinal) for the C5.0 to work properly. Related exercises: 3, 4
1. First, we open the stream “000 Template-Stream tree_credit”, which imports the tree_credit data and already has a Type node attached to it. See Fig. 8.326. We save the stream under a different name. 2. To set up a validation of the tree, we add a Partition node to the stream and place it between the source and the Type node. Then, we open the node and define 70% of the data as training data and 30% as test data. See Sect. 2.7.7 for a description of the Partition node. 3. Now we add a C5.0 node to the stream and connect it to the Type node. 4. In the Fields tab of the C5.0 node, the target and input fields have to be selected. As in the other model nodes, we can choose between a manual setting and automatic identification. The latter is only applicable if the roles of the variables have already been defined in a previous Type node. Here, we select the “Credit rating” variable as the target and “Partition” as the partition defining field. All the other variables are chosen as the input. See Fig. 8.327. Fig. 8.326 Template stream for the tree_credit data
1026
8 Classification Models
Fig. 8.327 Role definition of the variables in the C5.0 node
5. In the Model tab, the parameters of the tree building process are set. We first enable the “Use partitioned data” option, in order to build the tree on the training data and validate it with the test data. Besides this common option, the C5.0 tree offers two other output types. In addition to the decision tree, one can choose “Rule set” as the output. In this case, the set of rules is derived from the tree and contains a simplified version of the most important information of the tree. Rule sets are handled a bit differently, as now, multiple rules, or no rule at all, can apply to a particular data record. The final classification is thus done by voting, see IBM (2019b). Here, we select a decision tree as our output; see arrow in Fig. 8.328. For “Rule set” building, see Exercise 4 in Sect. 8.8.5. For the training process, three additional methods, which may improve the quality of the tree, can be selected. See Fig. 8.328. These are: • Group symbolics. This method attempts to group categories with a similar structure of variables to the target variable. • Boosting to improve the models accuracy. See Sect. 8.8.1, for a description of this method. • Cross-Validation, more precisely V-fold cross-validation, which is useful if the data size is small. It also generates a more robust tree. See Sect. 8.7.2, for the concept of V-fold cross-validation.
8.8 Decision Trees
1027
Fig. 8.328 The options for the C5.0 tree training process can be set in the Model tab
In the bottom area of the Model tab, the parameters for the pruning process are specified. See Fig. 8.328. We can choose between a “Simple” mode, with many predefined parameters, and an “Expert” mode, in which experienced users are able to define the pruning settings in more detail. We select the “Simple” mode and declare accuracy as more important than generality. In this case, the pruning process will focus on improving the models’ accuracy, whereas if the “Generality” option is selected, trees that are less susceptible to the problem would be favored. If the proportion of noisy records in the training data is known, this information can be included in the model building process and will be considered while fitting the tree. For further explanation of the “Simple” options and the “Expert” options, we refer to IBM (2019b). 6. In the Cost tab, one can specify the cost when a data record is misclassified. See Fig. 8.329. With some problems, misclassifications are costlier than others. For example, in the case of a pregnancy test, a diagnosis of non-pregnancy of a pregnant woman might be costlier than the other way around, since in this case the woman might return to drinking alcohol or smoking. To incorporate this into the model training, the error costs can be specified in the misclassification cost
1028
8 Classification Models
Fig. 8.329 Misclassification cost options in the C5.0 node
matrix of the Costs tab. By default, all misclassification costs are set to 1. To change particular values, enable the “Use misclassification costs” option and enter new values into the matrix below. Here, we stay with the default misclassification settings. 7. In the Analyze node, we further select the “Calculate predictor importance” option. 8. Now we run the stream and the model nugget appears. Views of the model nugget are presented in the Sect. 8.8.3. 9. We add the usual Analysis node to the stream, to calculate the accuracy and Gini for the training set and test set. See Sect. 8.2.6 for a detailed description of the Analysis node options. The output of the Analysis node is displayed in Fig. 8.330. Both the training set and testing set have an accuracy of about 80% and a Gini of 0.705 and 0.687, respectively. This shows quite good prediction performance, and the tree doesn’t seem to be overfitting the training data.
8.8.3
The Model Nugget
The model nuggets of all four decision trees, C5.0, CHAID, C&R Tree, and QUEST, are exactly the same with the same views, graphs, and options. Here, we present the model nuggets and graphs of these four trees, by inspecting the model nugget of the C5.0 model built in the previous Sect. 8.8.2 on the credit rating data.
8.8 Decision Trees
1029
Fig. 8.330 Accuracy and Gini in the C5.0 decision tree on the “tree_credit” data
Model Tab: Tree Structure Rules and Predictor Importance The model tab is split into two panels. See Fig. 8.331. If the predictor importance calculation is selected in the model settings, the usual graph that visualizes these statistics is displayed in the right panel. In this graph, we can also get a quick overview of all the variables that are used to build the tree and define the splitting criteria in at least one node. In our case with the C5.0 tree and the credit rating data, node splitting involves the three variables, “Income level”, “Number of credit cards”, and “Age”. The variables “Education” and “Car loans”, which were also selected as input variables (see Fig. 8.327), are not considered in any node partitioning, and thus, not included in the final model. The left panel in Fig. 8.331 shows the rules that define the decision tree. These are displayed in a tree structure, where each rule represents one element of the tree that the properties of a data record have to fulfill, in order to belong in this branch. Figure 8.332 shows part of the tree structure in the left panel. Behind each rule, the mode of the branch is displayed, that is, the majority target category of the branch belonging to this element. If the element ends in a leaf, the final classification is further added, symbolized by an arrow. In Fig. 8.332, for example, the last two rules
1030
8 Classification Models
Fig. 8.331 Model view in the model nugget. The tree structure is shown on the left and the predictor importance in the right panel
Fig. 8.332 Part of the rule tree of the C5.0 tree, which classifies customers into credit score groups
define the splitting of a node using the age of a customer. Both elements end in a leaf, where one assigns the customers a “Bad” credit rating (Age 29.28) and the other a “Good” credit rating (Age > 29.28). Detailed View of the Decision Tree and Its Nodes In the View tab, the decision tree can be viewed in more detail with the occupation statistics of each node. Figure 8.333 shows the C5.0 tree of the credit rating, which was built in Sect. 8.8.2. We see that the tree consists of eight nodes, with the root node on the left, five leaves, four splits, and five levels. The view of the tree can be easily modified in the top panel. There, the orientation of the tree can be specified, as well as the visualization mode of the tree nodes. Figure 8.334 shows the three visualization node types, which can be chosen in the top options (see left arrow in Fig. 8.333). The default node type is the first one in the figure, which displays the occupation statistics from the training data of this node. First, the absolute and relative frequencies of the whole training data belonging to this node are shown at the bottom. In our example, 49.94% of the training data are in node 3, which are 839 data records in total. Furthermore, the distribution of the target variable of the training data in this node is displayed. More precisely, for each category in the target, the absolute and relative frequencies of the data in the node are
8.8 Decision Trees
1031
Fig. 8.333 Visualization of the tree in the Viewer tab of a tree model nugget
Node 3 % n Category Bad 70.560 592 Good 29.440 247 Total 49.940 839 –
Node 3
Node 3 % n Category Bad 70.560 592 Good 29.440 247 Total 49.940 839
– –
Node Statistics
Node Graph
Node Statistics & Graph
Fig. 8.334 Node types in the tree view of a decision tree model nugget
shown. In this example, 70.56% of the training data from node 3 are from customers with a “Bad” credit rating. That is 592 in total. See first graph in Fig. 8.334. Besides statistics visualization, the node can also be chosen to show only a bar-graph, visualizing the distribution of the data’s target variable in the node. See the second graph in Fig. 8.334. As a third option, the nodes in the tree can combine both the statistics and the bar-graph visualization and present them in the tree view. See last graph in Fig. 8.334. The choice of visualization just depends on the analyst’s preference, the situation, and the audience for the results.
8.8.4
Building a Decision Tree with the CHAID Node
Here, we present how to build a classifier with the CHAID node. To do this, we reuse the credit rating data, from Sect. 8.8.2, while building a C5.0 decision tree. The options and structure of the CHAID node are similar to the C&R Tree and QUEST node and are introduced as representative of these three nodes. The C&R Tree and QUEST node are described in more detail in the exercises later. A CHAID model can also be built with the Tree-AS node to run on the SPSS Analytics Server (IBM
1032
8 Classification Models
2019c). The Tree-AS node comprises almost the same options as in the CHAID node, and we refer to IBM (2019b) for details. Description of the model Stream name Based on dataset Stream structure
CHAID_credit_rating Tree_credit.sav (see Sect. 12.1.38)
Important additional remarks: The CHAID node can also be used for regression. In order to build a classification model, the target variable has to be categorical. The nodes C&R Tree and QUEST comprise very similar options to the CHAID node. See exercises. Related exercises: 1, 2, 3
1. As in Sect. 8.8.2, we first open the stream “000 Template-Stream tree_credit” and save it under a different name. See Fig. 8.326. The stream imports the tree_credit data and defines the roles and measurements of the variables with a Type node. 2. We insert a Partition node between the Source and the Type node, to set up a validation process in the tree, and define 70% of the data as training and 30% as test. See Sect. 2.7.7 for a description of the Partition node. 3. Now we add a CHAID node to the stream and connect it to the Type node. 4. In the Fields tab of the CHAID node, the target and input fields can be selected. As in the other model nodes, we can choose between a manual setting and automatic identification. The latter is only applicable if the roles of the variables are already defined in a previous Type node. Here, we select the “Credit rating” variable as our target. The partitioning defined in the Partition node is identified automatically, and all other variables are chosen as the input. See Fig. 8.335. The tree building options are specified in the Building Options tab. In the Objective view, the general tree building parameters can be set. Here, we can choose between building a new model and continuing to train an existing one. The latter is useful if new data are available and a model has to be updated with the new data; it will save us from building a completely new one. Furthermore, we can select to build a single decision tree or to use an ensemble model to train several trees and combine them into a final prediction. The CHAID node provides a boosting and bagging procedure for creating a tree. For a description of ensemble models, boosting and bagging, we refer to Sects. 5.3.6 and 8.8.1. Here, we select to build a single tree. See Fig. 8.336.
8.8 Decision Trees
1033
Fig. 8.335 Role definition of the variables in the CHAID node
If an ensemble algorithm is selected as the training method, the finer options of these models can be specified in the Ensemble view. There, the number of models in the ensemble, as well as the aggregation methods, can be selected. As this is the same for other classification nodes that provide ensemble techniques, we refer to Fig. 8.227 and Table 8.12 in the chapter on neural networks (Sect. 8.6) for further details. In the Basics view, the tree growing algorithm is defined. See Fig. 8.337. By default, this is the CHAID algorithm, but the CHAID node provides another algorithm, the “Exhaustive CHAID” algorithm, which is a modification of the basic CHAID. We refer to Biggs et al. (1991) and IBM (2019a) for further details. Furthermore, in this panel the maximum tree depth is defined. See bottom arrow in Fig. 8.337. The Default tree depth is five, which means the final tree has only five levels beneath the root node. The maximum height of the decision tree can be changed by clicking in the Custom option and inserting the favored height. In the Stopping Rules panel, we define the criteria for when a node stops splitting and is defined as a leaf. See Fig. 8.338. These criteria are based on the number of records in the current or child nodes, respectively. If these are too low in absolute numbers relative to the total data size, the tree will stop branching in that particular node. Here, we stay with the default settings, which are pertaining to the relative number of data records on the whole dataset in the current node
1034
8 Classification Models
Fig. 8.336 Selection of the general model process, i.e., a single or ensemble tree
Fig. 8.337 The tree growing algorithm is selected in the CHAID node
8.8 Decision Trees
1035
Fig. 8.338 Stopping rules in the CHAID node
Fig. 8.339 Misclassification cost option in the CHAID node
(2%) and child node (1%). The latter stopping rules come into effect if one of the child nodes would contain less than 1% of the whole dataset. In the Cost panel, the misclassification cost can be adjusted. This is useful if some classification errors are costlier than others. See Fig. 8.339 for the Cost
1036
8 Classification Models
Fig. 8.340 Advanced option in the tree building process of the CHAID node
panel and Sect. 8.8.2 for a description of the misclassification problem and how to change the default misclassification cost. In the Advanced view (see Fig. 8.340), the tree building process can be finetuned, mainly the parameters of the algorithm that selects the best splitting criteria. As these options should only be applied by experienced users, and the explanation of each option would be far beyond the scope of this book, we omit a detailed description here and refer the interested reader to IBM (2019b). 5. We are now finished with the parameters setting in the mode building process and can run the stream. The model nugget appears. 6. Before inspecting the model nugget, we add an Analysis node to the nugget and run it to view the accuracy and Gini of the model. See Sect. 8.2.6 for a description of the Analysis node. The output of the Analysis node is displayed in Fig. 8.341. We notice that the accuracy of the training set and testing set are both slightly over 80%, and the Gini is 0.787 and 0.77, respectively. That indicates quite precise classification with no overfitting.
8.8 Decision Trees
1037
Fig. 8.341 Accuracy and Gini of the CHAID decision tree on the “tree_credit” data
The Model Nugget and Trained Decision Tree When inspecting the model nugget, we see in the Model tab on the right side that the variables included in the tree are “Income level”, “Number of credit cards”, and “Age”. On the left side, the rule set is displayed, and we notice that several nodes are split into more than two branches, which is a property of the CHAID tree (Fig. 8.342). The complete, large tree structure can be viewed in the View tab of the CHAID model nugget, but it is too large to show here.
8.8.5
Exercises
Exercise 1: The C&R Tree Node and Variable Generation The dataset “DRUG1n.csv” contains data of a drug treatment study (see Sect. 12. 1.11). The patients in this study all suffer from the same illness, but respond differently to medications A, B, C, X, and Y. The task in this exercise is to train a
1038
8 Classification Models
Fig. 8.342 Model view of the CHAID model nugget on the credit rating data
CART with the C&R Tree node in the SPSS Modeler that automatically detects the most effective drug for a patient. To do so, follow the steps listed below. 1. Import the “DRUG1n.csv” dataset and divide it into a training (70%) and testing (30%) set. 2. Add a C&R Tree node to the stream. Inspect the node and compare the options with the CHAID node described in Sect. 8.8.4. What settings are different? Try to find out what their purpose is, e.g., by looking them up in IBM (2019b). 3. Build a single CART with the Gini impurity measure as the splitting selection method. What are the error rates for the training and test set? 4. Try to figure out why the accuracy of the above tree differs so much between the training and test set. To do so, inspect the K and Na variables, e.g., with a scatterplot. What can be done to improve the models’ accuracy? 5. Create a new variable that describes the ratio of the variables Na and K and discard Na and K from the new stream. Why does this new variable improve the prediction properties of the tree? 6. Add another C&R Tree node to the stream and build a model that includes the ratio variable of Na and K as input variable instead of Na and K separately. Has the accuracy changed? Exercise 2: The QUEST node—Boosting & Imbalanced data. The dataset “chess_endgame_data.txt” contains 28,056 chess endgame positions with a white king, white rook, and black king left on the board (see Sect. 12.1.7). In this exercise, the QUEST node is introduced while training a decision tree on the chess data which predicts the outcome of such endgames, i.e., whether white wins or
8.8 Decision Trees
1039
they draw. The variable “Result for White” therefore describes the number of moves white needs to win or to achieve a draw. 1. Import the chess endgame data with a proper Source node and reclassify the “Result for White” variable into a binary field that indicates whether white wins the game or not. What is the proportion of “draws” in the dataset? Split the data into 70% training and 30% test data. See Exercise 1 in Sect. 8.6.4. 2. Add a QUEST node to the stream and inspect its options. Compare them to the CHAID node, which is described in Sect. 8.8.4, and the C&R Tree node introduced in Exercise 1. What settings are different? Try to find out what their purpose is, e.g., by looking them up in IBM (2019b). 3. Train a decision tree with the QUEST node. Determine the accuracy and Gini values for the training and test set. 4. Build a second decision tree with another QUEST node. Use the boosting method to train the model. What are accuracy and Gini values for this model? Compare them with the first QUEST model. Exercise 3: Detection of Diabetes: Comparison of Decision Tree Nodes The “diabetes_data_reduced” dataset contains blood measurements and bodily characteristics of Indian females (see Sect. 12.1.10). Build three decision trees with the C&R Tree, CHAID, and C5.0 node to detect diabetes (variable “class_variable”) of a patient based on its blood measure values and body constitution. 1. Import the diabetes data and set up a cross-validation stream with training, testing, and validation set. 2. Build three decision trees with the C&R Tree, CHAID, and C5.0 node. 3. Compare the structures of the three trees to each other. Are they similar to each other or do they branch completely differently? 4. Calculate appropriate evaluation measures and graphs to measure their prediction performance. 5. Comprise these three decision trees in an ensemble model by using the Ensemble node. What is the accuracy and Gini of this ensemble model? Exercise 4: Rule Set and Cross-validation with C5.0 The dataset “adult_income_data” contains the census data of 32,561 citizens (see Sect. 12.1.1). The variable “income” describes whether a census participant has an income of more or less than 50,000 US dollars. Your task is to build a C5.0 rule set and decision tree to predict the income of a citizen. 1. Import the data and divide it into 70% training and 30% test data. 2. Use two C5.0 nodes to build, respectively, a decision tree and a rule set that predict the income of a citizen based on the variables collected in the census study.
1040
8 Classification Models
3. Compare both approaches to each other by calculating the accuracy and Gini of both models. Then draw the ROC curve for each model.
8.8.6
Solutions
Exercise 1: The C&R Tree Node and Variable Generation Name of the solution streams Theory discussed in section
Druglearn modified C&R tree Sect. 8.8.1 Sect. 8.8.4
The final stream for this exercise should look like the stream in Fig. 8.343. 1. First we import the dataset with the Statistics File node and connect it with a Partition node, followed by a Type node. We open the Partition node and define 70% of the data as training and 30% as test set. See Sect. 2.7.7 for a detailed description of the Partition node. 2. Next, we add a C&R Tree node to the stream and connect it to the Type node. We open it to inspect the properties and options. We see that compared to the CHAID node, there are three main differences in the node Build Options. In the Basics panel, the tree growing method selection is missing, but the pruning process with its parameter can be set. See Fig. 8.344. Remember that pruning cuts back “unnecessary” braches from the fully grown tree in order to face the problem of overfitting. To manipulate the pruning algorithm, the maximum risk change between the pruned tree and the larger tree can be defined. The risk describes the chance of misclassification, which is typically higher after pruning the large tree to a smaller and less complex tree. For example, if 2 is selected as the maximum difference in risk, then the estimated risk of the pruned tree is at most 2 standard error larger. Furthermore, the maximum number of surrogates can be changed. Surrogates are a method that is used for handling missing values. For each split, the input fields that are most similar to the splitting field are identified and set as its
Fig. 8.343 Drug study classification exercise stream
8.8 Decision Trees
1041
Fig. 8.344 Enabling and definition of pruning settings in the C&R Tree node
surrogates. If a data record having a field with a missing value has to be classified, the value of a surrogate can be used to make the splitting in a node where usually the field with the missing value is needed. Surrogates are identified during the training phase. So, if missing values are expected in training or scoring, then make sure that the training data also contains missing values for the surrogates to be determined. If no surrogates are identified, a data record with missing values automatically falls into the child node with the largest number of data records. Increasing the number of surrogates will generate flexibility of the model, but increase memory usage and runtime of the model training. See IBM (2019b) for more details. In the Cost & Priors panel, priors can be set for each target category, apart from the misclassification costs. See Fig. 8.345. The prior probability describes the relative frequency of the target categories of the total population from which the training data is drawn. It gives prior information about the target variable and can be changed if, for example, the distribution of the target variable in the training data does not equal the distribution of the population. There are three possible settings: • Base on training data: This is the default setting and is the distribution of the target variable in the training data. • Equal for all classes: All target categories appear equally in the population and therefore have the same prior probability. • Custom: Specification of customized probabilities for each target category. The prior probabilities must be set in the corresponding table, and they have to add up to 1. Otherwise, an error will occur.
1042
8 Classification Models
Fig. 8.345 Prior setting in the C&R Tree node
See IBM (2019b) for additional information on the prior settings. The impurity measure can be defined in the Advanced panel, see Fig. 8.346. By default, this is the Gini coefficient. Further measures are “Twoing” (see Sect. 8.8.1) and “Ordered”, which add the constraints to the twoing method that only neighboring target classes of the ordinal target variable can be grouped together. If the target variable is nominal and “Ordered” is selected, then the twoing method is used by default since the target classes cannot be ordered. See IBM (2019b) for further details. Moreover, the minimum change in impurity which has to be fulfilled in order for a node to split can be set. Here, we select Gini as the impurity measure. Additionally, the proportion of data that is used to prevent overfitting can be defined. 30% is the default. See Fig. 8.346. We define Gini as the impurity measure and run the stream. The model nugget appears. On inspecting the model nugget, we notice in the Model tab that the K variable is the most important one in the tree, with K being the field of the first split. See Fig. 8.347. The tree itself consists of 4 levels.
8.8 Decision Trees
Fig. 8.346 Setting of the impurity measure in the C&R Tree node
Fig. 8.347 Tree structure and variable importance of the CART for the drug exercise
1043
1044
8 Classification Models
Fig. 8.348 Accuracy of the first CART for the drug exercise
3. We run the stream, and the model nugget appears. On inspecting the model nugget, we notice in the Model tab that the “K” variable is the most important one in the tree, with “K” being the field of the first split. See Fig. 8.347. The tree itself consists of 4 levels. We add an Analysis node to the model nugget to calculate the accuracy of the model. See Sect. 8.2.6 for a description of the Analysis node. The output of the Analysis node is shown in Fig. 8.348. We see that the accuracy in the training data is pretty high at about 94%, while in the testing set the accuracy is only 79%. This indicates overfitting of the model. 4. To figure out the reason for this discrepancy in the accuracy of the training and test set, we add a Plot node to the Type node to draw a scatterplot of the variables “Na” and “K”. We further want to color the data of each drug (target class) differently. See Sect. 4.2 for how to create a scatterplot with the Plot node. The scatterplot is shown in Fig. 8.349. We see that the drug Y can be separated from all other drugs by a line. However, this line is not parallel to the axis. As we learned in Sect. 8.8.1, a decision tree is only able to divide the data space parallel to the axis. This can be a reason for the overfitting of the model, as it separates the training data perfectly, but the horizontal and vertical found decision boundaries
8.8 Decision Trees
1045
Fig. 8.349 Scatterplot of the variables “Na” and “K”
are not sufficient to classify the test data. A ratio value of “Na” and “K” might therefore be more appropriate and can lead to more precise predictions. 5. We add a Derive node to the stream and connect it to the Type node to calculate the ratio variable of “Na” and “K”. See Fig. 8.350 for the Formula Entered in the Derive node that calculates the new ratio variable. To descriptively validate the separation power of the new variable “Na_K_ratio”, we add a Graphboard node to the stream and connect it with the Derive node. In this node, we select the new variable and choose the Dot plot. Furthermore, we select the data of the drugs to be plotted in different colors. See Sect. 4.2 for a description of the Graphboard node. In the Dot Plot in Fig. 8.351, we can see that the drug Y can now be perfectly separated from all other drugs by this new ratio variable. We now add a Filter node to the stream and discard the “Na” and “K” variable from the following stream. See Fig. 8.352. Then we add another Type node to the stream. 6. Then, we add another C&R Tree node to the stream and connect it with the last Type node. We choose the same parameter setting as in the first C&R Tree node,
1046
8 Classification Models
Fig. 8.350 Derive node to calculate the ratio of “Na” and “K”
except for the input variables. Here, the new variable “Na_K_Ratio” is included instead of the “Na” and “K” variables. See Fig. 8.353. We run the stream, and the model nugget appears. In the nugget, we immediately see that the new variable “Na_K_Ratio” is the most important predictor, and it is chosen as the field of the root spilt. See Fig. 8.354. In addition, we notice that the tree is slimmer compared to the first build tree (see Fig. 8.347), meaning that the tree has fewer branches. See the Viewer tab for a visualization of the build tree. Next, we add the standard Analysis node to the model nugget (see Sect. 8.2.6 for details on the Analysis node) and run it. In Fig. 8.355, the accuracy of the decision tree with the “Na_K_Ratio” variable included is shown. In this model, the accuracy of the test data has noticeably improved from about 79% (see Fig. 8.348) to more than 98%. The new variable thus contains a higher separation ability which improves the prediction power and robustness of a decision tree.
8.8 Decision Trees
Fig. 8.351 Dot Plot of the ratio variable of “Na” and “K”
Fig. 8.352 Filter node to discard the “Na” and “K” variables from the following stream
1047
1048
8 Classification Models
Fig. 8.353 Definition of the variable roles of the second CART comprising the new “Na_K_Ratio” variable for the drug exercise
Fig. 8.354 Variable importance of the CART for the drug exercise with the new ratio variable “Na_K_Ratio”
8.8 Decision Trees
1049
Fig. 8.355 Accuracy of the CART for the drug data with the new ratio variable “NA_K_Ratio”
Exercise 2: The QUEST Node—Boosting & Imbalanced Data Name of the solution streams Theory discussed in section
QUEST_chess_endgame_prediction Sect. 8.8.1 Sect. 8.8.4
The final stream for this exercise looks like Fig. 8.356.
Fig. 8.356 Stream of the chess endgame prediction with the QUEST node exercise
1050
8 Classification Models
Fig. 8.357 Settings of the splitting parameter in the Advanced options in the QUEST node
1. The first part of the exercise is analog to the first two parts of Exercise 1 Sect. 8.6.4, so a detailed description is omitted here, and we refer to this solution for the import, reclassify, and partition of the data. See Figs. 8.240, 8.241, and 8.242. We recall that a chess game ends in a tie about 10% of the time, while in 90% of the games, white wins. See Fig. 8.241. 2. We now add a QUEST node to the stream and connect it with the Partition node. Comparing the model node and its options with the CHAID and C&R Tree node, we notice that the QUEST node is very similar to both of them, especially to the C&R Tree node. The only difference to the C&R Tree node options appears in the Advanced panel in the Build Options tab. See Fig. 8.357. In addition to the overfitting prevention setting, the significance level for the splitting selection method can be set here. The default value is 0.05. See the solution of Exercise 1 in Sect. 8.8.5, and we refer to Sect. 8.8.4 for a description of the remaining options and the difference to the other tree nodes. We further cite the manual IBM (2019b) for additional information on the QUEST node. 3. We run the QUEST node and the model nugget appears. In the model nugget, we see no splits are executed, and the decision tree consists of only one node, the root node. Thus, all chess games are classified as having the same ending, “white wins”. This is also verified by the output of an Analysis node (see Sect. 8.2.6), which we add to the model nugget. See Fig. 8.358 for the performance statistics
8.8 Decision Trees
1051
Fig. 8.358 Performance statistics of the QUEST model for the chess endgame data
of the QUEST model. We see in the coincidence matrix that all data records (chess games) are classified as white winning games. With this simple classification, the accuracy is high by about 90 %, exactly the proportion of white winning games in the dataset. However, the Gini value in the training and test set is 0, which indicates that the prediction is not better than guessing. 4. We add another QUEST node to the stream, connect it with the Partition node, and open it. We choose the same settings as in the first QUEST node, but change from a standard model to a boosting model, which can be set in the Building Options tab. See Fig. 8.359. We then run the stream and a second model nugget appears. We rearrange the two model nuggets so that they are aligned consecutively. See Fig. 8.147 for an illustration of this procedure. We add another Analysis node to the stream, connect it with the last model nugget, and set the options to calculate the coincidence matrix and Gini. After running the Analysis node, the calculated model evaluation statistics are shown in the pop-up window which then appears.
1052
8 Classification Models
Fig. 8.359 Definition of a boosted decision tree modeling in the QUEST node
Figure 8.360 shows the accuracy and coincidence matrix of the boosting model and the Gini of the decision tree with and without boosting. We immediately see that the accuracy has improved to over 99%. The coincidence matrix reveals that both draws and white winning games are detected by the boosted tree. The model’s high quality of prediction is also demonstrated by the extremely high Gini values (over 0.99). In conclusion, boosting improves the prediction power of the decision tree built by the QUEST algorithm massively. Compared to a tree without boosting, the minority class, a “draw” in this example, is paid more attention during the model building by boosting. This results in a much better model fit, and thus, prediction ability.
8.8 Decision Trees
1053
Fig. 8.360 Evaluation statistics of the boosted QUEST model for the chess endgame data
Exercise 3: Detection of diabetes—comparison of decision tree nodes. Name of the solution streams Theory discussed in section
diabetes_prediction_comparison_of_tree_nodes Sect. 8.8.1 Sect. 8.8.2 Sect. 8.8.4
The final stream for this exercise looks like the stream in Fig. 8.361. 1. We start by opening the template stream “015 Template-Stream_Diabetes” (see Fig. 8.362), which imports the diabetes data and has a Type node already attached to the Source node in which the roles of the variables are defined. The variable “class_variable” is set as target variable. In the Type node, we change the measurement class of the variable “class_variable” to Flag in order to calculate the Gini evaluation measure later. Next, we add a Partition node to the stream and place it between the Source and Type node. In the Partition node, we specify 60% of the data as training and
1054
8 Classification Models
Fig. 8.361 Stream of the diabetes detection with decision trees exercise
Fig. 8.362 Template stream of the diabetes dataset
20% as test and validation set, respectively. See Sect. 2.7.7 for the description of the Partition node. 2. To build the three decision tree models, we add a C&R Tree, a CHAID, and a C5.0 node to the stream and connect each of them with the Type node. The input and target variable are automatically detected by the three tree nodes since the variables roles were already set in the Type node. Here, we use the default settings provided by the SPSS Modeler to build the models and therefore run the stream for the decision trees to be constructed. We align the three model nuggets so that the data is passed through the three trees successively to predict whether a patient suffers from diabetes. See Fig. 8.147 for an example of the rearrangement of the model nuggets in a line. 3. Inspecting the three model nuggets and the structures of the constructed decision trees, we first see that the CART, built by the C&R Tree, consists of only a single split and is therefore a very simple tree. See Fig. 8.363. This division is done based on the variable “glucose_concentration”. So the Gini coefficient partition criteria is not able to find additional partitions of the data that will improve the accuracy of the model.
8.8 Decision Trees
1055
Fig. 8.363 Fitted CART for diabetes prediction
Fig. 8.364 Fitted CHAID for diabetes prediction
Fig. 8.365 Fitted C5.0 for diabetes prediction
The variable “glucose_concentration” is also the first variable according to which the data are split in the CHAID and C5.0. The complete structures of these trees (CHAID and C5.0) are shown in Figs. 8.364 and 8.365. These two trees are more complex with more branches and subtrees. When comparing the CHAID and C5.0 trees to each other, we see that the structure and node splits of the CHAID can be also found in the C5.0. The C5.0, however, contains further splits and thus divides the data into finer subtrees. So, where the CHAID has 3 levels beneath the root, the C5.0 has 6 tree levels. 4. At the last model nugget, we add an Analysis and an Evaluation node to calculate and visualize prediction performance of the three models. See Sects. 8.2.6 and 8.2.7 for a description of these two nodes. We run these two nodes to view the evaluation statistics. The accuracy and Gini values are shown in Fig. 8.366. We
1056
8 Classification Models
Fig. 8.366 Accuracy and Gini values of all three decision trees for diabetes prediction
see that the accuracy, despite the training set in the C&R Tree, is nearly the same in a dataset among all models. The Gini values paint a similar picture; the Gini coefficient of the CART model is significantly lower in all datasets than the other two models. The CHAID and C5.0, however, have nearly the same Gini coefficients, with the C5.0 model slightly ahead in the training and validation set. This indicates that a more complex tree might increase prediction ability. In Fig. 8.367, the Gini values are visualized by the ROC curves for the three models and sets. This graph confirms our conjecture that CHAID and C5.0 are pretty similar in prediction performance (the ROC curves of these modes have nearly the same shape), while the CART curve is located far beneath the other two curves, indicating a less ideal fit to the data. However, none of the three curves is consistently located above the other two. So, the three models perform differently in some regions of the data space. Even the CART outperforms the other two models in some cases. See the curves of the
8.8 Decision Trees
1057
Fig. 8.367 ROC curves of all three decision trees for diabetes prediction
Fig. 8.368 Settings of the Ensemble node for the model that predicts diabetes
test set in Fig. 8.367. Thus, an ensemble mode of the three trees might improve the quality of the prediction. 5. To compress the three trees into an ensemble model, we add an Ensemble node to the stream and connect it with the last model nugget. In the Ensemble node, we set “class_variable” as target for the ensemble and choose “Voting” as the aggregation method. See Fig. 8.368 for the setting of these options,
1058
8 Classification Models
Fig. 8.369 Analysis node output statistics for the Ensemble node
and Table 8.12 for an extract of aggregation methods available in the Ensemble node. We add another Analysis node to the stream and connect it with the Ensemble node. Then, we run the Analysis node. The output statistics are presented in Fig. 8.369. We note that the accuracy has not changed much compared to the other models, but the Gini has increased for the testing and validation set. This indicates that the ensemble model balances the errors of each individual model and is thus more precise in the prediction of unknown data. Exercise 4: Rule Set and Cross-validation with C5.0 Name of the solution streams Theory discussed in section
Income_C5_RuleSet Sect. 8.8.1 Sect. 8.8.2
The final stream for this exercise looks like the stream in Fig. 8.370. 1. First we import the dataset with the Var. File node and connect it with a Type node. In the Type node, we set the measurement of the variable “income” to Flag and the role to target. Then we add a Partition node, open it, and define 70% of the data as training and 30% as test set. See Sect. 2.7.7 for a detailed description of the Partition node. 2. We now add two C5.0 nodes to the stream and connect them with the Partition node. Since the roles of the variables are already defined in the Type node, the C5.0 nodes automatically identify the target and input variables. We use the
8.8 Decision Trees
1059
Fig. 8.370 Solution stream of the income prediction exercise
Fig. 8.371 Definition of the rule set output type in the C5.0 node
default model settings provided by the SPSS Modeler and so just have to change the output type of one node to “Rule Set”. See Fig. 8.371. Now, we run the stream and the two model nuggets appear. We then rearrange them in a line, so the models are applied successively to the data. As a result, the models can be more easily compared to each other. See Fig. 8.147 for an example of the rearrangement of model nuggets.
1060
8 Classification Models
The final constructed decision tree is very complex and large with a depth of 23, meaning the rule set contains a large number of sole rules. The rule set as well as the decision tree are too large and complex to describe here and for that reason have been omitted. However, we encourage the reader to examine the build models in order to understand their structure and the differences between these models. 3. To compare the two models to each other, we first add a Filter node to the stream and connect it with the last model nugget. This node is added simply to rename the prediction fields, which are then more easily distinguishable in the Analysis node. See Sect. 2.7.5 for the description of the Filter node. Afterwards, we add an Analysis and Evaluation node to the stream and connect it with the Filter node. See Sects. 8.2.6 and 8.2.7 for a description of the Analysis and Evaluation node options. We then run these two nodes. The Analysis output with the accuracy and Gini of the two models is shown in Fig. 8.372. We see that the accuracy of the decision tree C5.0 and rule set C5.0
Fig. 8.372 Analysis node output for the C5.0 rule set and decision tree models for the income data
8.9 The Auto Classifier Node
1061
Fig. 8.373 ROC curves of the two C5.0 models (decision tree and rule set)
model are similar with about 12% error rate in the training and 14% error rate in the test set. However, the decision tree model has a slightly better performance, as additionally confirmed by the Gini values, which are a bit higher for both datasets in this case. This indicates that the finding processes with the rule set or a decision tree are close to each other, but have minor differences. This is evident in the evaluation statistics. The ROC curves of the two models are displayed in Fig. 8.373. As can be seen, the curve of the decision tree model lies slightly above the curve of the C5.0 rule set model. Hence, the C5.0 decision tree provides a better prediction power than the rule set model.
8.9
The Auto Classifier Node
As for the regression and clustering methods (Chaps. 5 and 7), the SPSS Modeler also provides a node, the Auto Classifier node, which comprises several different classification methods and can thus build various classifiers in a single step. This Auto Classifier node provides us with the option of trying out and comparing a variety of different classification approaches without adding the particular nodes and setting the parameter of the algorithm of each model individually, which can result in
1062
8 Classification Models
Fig. 8.374 Model types and nodes supported by the Auto Classifier node. The top list contains model types discussed in detail in this chapter. The bottom list shows further model types and nodes that are included in the Auto Classifier node
a very complex stream with many different nodes. Finding the optimal parameter of a method, e.g., the best kernel and its parameter of an SVM or the number of neurons in a neural network in particular, can be extremely cumbersome and is thus reduced to a very clear process in a single node. Furthermore, the utilization of the Auto Classifier node is an easy way to consolidate several different classifiers into an ensemble model. See Sects. 5.3.6 and 8.8.1 and the references given there for a description of ensemble techniques and modeling. All models built with the Auto Classifier node are automatically evaluated and ranked according to a predefined measure. So the best performing models can be easily identified and added to the ensemble. Besides the classification methods and nodes introduced in this chapter, the Auto Classifier node comprises additional nodes and model types, like the Bayesian Network, Decision List, Random Forest, or boosting nodes. We recommend Ben-Gal (2008), Rivest (1987), Kuhn and Johnson (2013), Hastie (2009), and Zhou (2012) for a description of these classification algorithms, and IBM (2019b) for a detailed introduction of their nodes and options in the SPSS Modeler. See Fig. 8.374 for a list of all nodes included in the Auto Classifier node. Before turning to the description of the Auto Classifier node and how to apply it to a dataset, we would like to point out that building a huge number of models is very time-consuming. That’s why one must pay attention to the number of different parameter settings chosen in the Auto Classifier node since a large number of different parameter values lead to a huge number of models. The building process may take a very long time in this case, sometimes hours.
8.9 The Auto Classifier Node
8.9.1
1063
Building a Stream with the Auto Classifier Node
Here, you will learn how to use the Auto Classifier node effectively to build different classifiers for the same problem in a single step and identify the optimal models for our data and mining task. A further advantage of this node is its ability to unite the best classifiers into an ensemble model and combine the strength of different classification approaches as well as counteract the weaknesses. Furthermore, crossvalidation to find the optimal parameters of a model can be easily carried out within the same stream. We introduce the Auto Classifier node by applying it to the Wisconsin breast cancer data, and train classifiers that are able to determine benign from malignant cancer samples. Description of the model Stream name Based on dataset Stream structure
Auto_classifier_node WisconsinBreastCancerData (see Sect. 12.1.40)
Important additional remarks: The target variable must be nominal or binary in order to use the auto classifier node. Related exercises: 1, 2
1. First, we open the template stream “016 Template-Stream_Wisconsin BreastCancer” (see Fig. 8.375) and save it under a different name. The target variable is called “class” and takes values “2” for benign and “4” for malignant samples. The template stream imports the data and already has a Type node attached to it. In the Type node, the variable “class” is already defined as target variable with measurement Flag. Except for the “Sample Code” variable, which is set to none since it just labels the cancer samples, all other variables are set as input variables. 2. To set up validation of the models, we add a Partition node to the stream and insert it between the Source and Type node. We split the data into 70% training and 30% testing data. The Partition node is described in more detail in Sect. 2.7.7. 3. Now, we add the Auto Classifier node to the canvas, connect it with the Type node, and open it with a double-click. The target and input variables can be specified in the Fields tab. See Fig. 8.376. If the variable roles are already set in a previous Type node, these are automatically recognized by the Auto Classifier
1064
8 Classification Models
Fig. 8.375 Template stream of the Wisconsin breast cancer data
Fig. 8.376 Definition of target, input variables, and the partition field in the Auto Classifier node for the Wisconsin breast cancer data
8.9 The Auto Classifier Node
1065
Fig. 8.377 Model tab in the Auto Classifier node with the criteria that models should be included in the ensemble
node if the “Use type node setting” option is chosen in the Fields tab. For description purposes, we set the variable roles here manually. So we select the “Use custom settings” option and define “class” as target, “Partition” as partitioning identification variable, and all remaining variables, except for “Sample Code”, as input variables. This is shown in Fig. 8.376. 4. In the Model tab, we enable the “Use partitioned of data” option. See the top arrow in Fig. 8.377. This option will lead the Model to be built based on the training data alone. In the “Rank models by” selection field, we can choose the evaluation measure with which the models are compared with each other and ordered in descending order of their performance. Possible measures are listed in Table 8.15. Some of the rank measures are only available for a binary (Flag) target variable. Here, we choose the “Area under the curve” (AUC) rank measure.
1066
8 Classification Models
Table 8.15 Rank criteria provided by the Auto Classifier node Rank measure Overall accuracy Area under the curve (AUC) Profit
Lift
Number of fields
Description Percentage of correctly predicted data records. Area under the ROC curve. A higher value indicates a better fitted model. See Sects. 8.2.5 and 8.2.8 for details on the ROC curve and the area under the curve measure. Sum of profits across cumulative percentiles. A profit for a data record is the difference in its revenue and cost. The revenue is the value associated with a hit and the cost is the value associated with a misclassification. These values can be set at the bottom of the model tab. See Sect. 8.2.7 for more information on the profit measure. The hit ratio in cumulative quantiles relative to the overall sample. The percentiles used to calculate the lift can be defined at the bottom of the model tab. See Sect. 8.2.7 for more information on the lift measure. Number of input fields used in the model.
Target type Nominal, flag Flag
Flag
Flag
Nominal, flag
With the “rank” selection, we can choose whether the models should be ranked by the evaluation of the training or the test partition, and how many models should be included in the final ensemble. Here, we choose that the ensemble should include the 3 best performing models on the test set according to the selected rank score, here, AUC. See the bottom arrow in Fig. 8.377. At the bottom of the Model tab, we can further set the revenue and cost values used to calculate the profit. Furthermore, a weight can be specified to adjust the results. In addition, the percentile considered for the lift measure calculations can be set (see Table 8.15, Sect. 8.2.7, and IBM (2019b)). The default here is 30. In the Model tab, we can also choose to calculate the predictor importance, and we recommend enabling this option each time. 5. The next tab is the “Expert” tab. Here, the classification models, which should be calculated and compared with each other, can be specified. See Fig. 8.378. We can include a classification method by checking its box on the left. All models marked in this way are built on the training set, compared to each other, and the best ranked are selected and added in the final ensemble model. We can further specify multiple settings for one model type, in order to include more model variations and to find the best model of one type. Here, we also want to consider a boosted Neural Network in addition to the standard approach. How to set the parameter to include this additional Neural Network in the Auto Classifier node building process is described below. To include more models of the same type in the building process, we click on the “Model parameters” field next to the particular model, Neural Net in this example, and choose the option “Specify” in the opening selection bar (Fig. 8.378). A window pops up which comprises all options of the particular node. See Fig. 8.379 for the options of the Neural Net node.
8.9 The Auto Classifier Node
1067
Fig. 8.378 Selection of the considered classification methods in the Auto Classifier node
In this window, we can specify the parameter combinations which should be considered in separate models. Each parameter or option can therefore be assigned multiple values, and the Auto Classifier node then builds a model for every possible combination of these parameters. For our case, we also want to consider a boosted neural network. So we click on the “Options” field next to the “Objectives” parameter and select the “Specify” option in the drop-down menu. In the pop-up window, we select the boosting and standard model options. This is shown in Fig. 8.380. Then we click the OK button.
1068
8 Classification Models
Fig. 8.379 Parameter setting of the Neural Net node in the Auto Classifier node
Fig. 8.380 Specification of the modeling objective type for a neural network in the Auto Classifier node
This will enable the Auto Classifier node to build a neural network with and without boosting. The selected options are shown in the “Option” field next to the particular “Parameter” field in the main settings window of the Neural Net node. See Fig. 8.379. 6. Then, we go to the “Expert” tab to specify the aggregation type for the boosting method in the same manner as for the model objective, i.e., boosting and standard
8.9 The Auto Classifier Node
1069
Fig. 8.381 Specification of the aggregation methods for the boosting model in the Neural Net algorithm settings in the Auto Classifier node
modeling procedure. We choose two different methods here, the “Voting” and “Highest mean probability” technique. So, a neural network is constructed for each of these two aggregation methods (Fig. 8.381). 7. In summary, we have specified two modeling techniques and two aggregation methods for ensemble models. The Auto Classifier node now takes all of these options and builds a model for each of the combinations. So four neural networks are created in this case, although the aggregation method has no influence on the standard modeling process. So, imprecise parameter setting can result in countless irrelevant model builds and will massively increase processing time. Furthermore, identical models can be included in the ensemble and so exclude models with different aspects that might improve the prediction power of the ensemble. The number of considered models in the Auto Classifier node is displayed right next to the model type field in the Expert tab of the Auto Classifier node, see Fig. 8.378.
1070
"
8 Classification Models
The Auto Classifier node takes all specified options and parameters of a particular node and builds a model for each of the combinations. For example, in the Neural Net node the modeling objective is chosen as “standard” and “boosting”, and the aggregation methods “Voting” and “Highest mean probability” are selected. Although the aggregation methods are only relevant for a boosting model, 4 different models are created by the Auto Classifier node:
– – – – "
Standard neural network with voting aggregation Standard neural network with highest mean probability aggregation Boosting neural network with voting aggregation Boosting neural network with highest mean probability aggregation
Imprudent parameter setting can result in countless irrelevant model builds and so massively increase the processing time and memory. Furthermore, identical models (in this case: standard neural net with voting and highest mean probability aggregation) can be included in the ensemble if they outperform the other models. In this case, models with different approaches and aspects that might improve the prediction or balance overfitting might be excluded from the ensemble.
8. Necessary criteria that a model has to fulfill to be considered as a candidate for the ensemble can be specified in the Discard tab of the Auto Classifier node. If a model fails to satisfy one of these criteria, it is automatically discarded from the subsequent process of ranking and comparison. The Discard tab and its options are shown in Fig. 8.382. The discarding criteria comprise the ranking measures and are: • Overall accuracy is less than A model is ignored for the ensemble if the accuracy is less than the specified minimum value. • Number of fields is greater than A model is discarded from the ensemble if it contains more variables than the defined maximum. This criterion guarantees a slim model. • Area under the curve is less than A model is only considered for the ensemble model if the AUC is greater than the specified value. • Lift is less than If the lift is less than the defined value it is ignored for the ensemble. • Profit is less than Only models with a profit greater than the defined minimum value are valid for the ensemble model. In our example of the Wisconsin breast cancer data, we discard all models that have an accuracy lower than 80%, so that the final model has a minimum hit rate. See Fig. 8.382.
8.9 The Auto Classifier Node
1071
Fig. 8.382 Definition of the discard criteria in the Auto Classifier node
9. In the Settings tab, the aggregation method can be selected: this combines all component models in the ensemble model generated by the Auto Classifier node to a final prediction. See Fig. 8.383. The most important aggregation methods are listed in Table 8.12. Besides these methods, the Auto Classifier node provides weighted voting methods, more precisely, • Confidence-weighted voting This aggregation method is similar to the “normal” voting, but instead of counting the predicted target categories, their predicted probabilities are summed up. The target category with the highest cumulated probability wins. For example, model I predicts class A with probability 0.8, and models II and III predict class B with probability 0.55 and 0.6. Then the confidence-weighted voting value for class A is 0.8, and for class B is 1.15. Hence, the ensemble consisting of these three models outputs B as the final prediction. • Raw propensity-weighted voting This aggregation method is equivalent to confidence-weighted voting, but the weights are the propensity scores. See IBM (2019b) for a description of propensity scores. This method is only available for Flag target variables.
1072
8 Classification Models
Fig. 8.383 Definition of the aggregation method for the ensemble model in the Auto Classifier node
We refer to IBM (2019a) for further details on these methods. We select the “Confidence-weighted voting” here. The ensemble method can also be later changed in the model nugget. See Fig. 8.387. 10. When we have set all the model parameters and Auto Classifier options of our choice, we run the model, and the model nugget appears in the stream. For each possible combination of selected model parameter options, the Modeler now generates a classifier, all of which are compared to each other and then ranked according to the specified criteria. If a model is ranked high enough under the top three models here, it is included in the ensemble. The description of the model nugget can be found in Sect. 8.9.2. 11. We add an Analysis node to the model nugget to calculate the evaluation statistics, i.e., accuracy and Gini. See Sect. 8.2.6 for the description of the Analysis node and its options. Figure 8.384 shows the output of the Analysis node. We see that the accuracy in both training and test set are pretty high at about 97%. Furthermore, the Gini values are nearly 1, which indicates an excellent prediction ability with the inserted variables.
8.9 The Auto Classifier Node
1073
Fig. 8.384 Analysis output with evaluation statistics from both the training and the test data for the Wisconsin breast cancer classifier generated by the Auto Classifier node
8.9.2
The Auto Classifier Model Nugget
In this short section, we will take a closer look at the model nugget generated by the Auto Classifier node and the graphs and options it provides. Model Tab and the Selection of Models Included in the Ensemble The top-ranked models built by the Auto Classifier node are listed in the Model tab. See Fig. 8.385. In this case, the ensemble model consists of the top three models to predict breast cancer, as suggested by the Auto Classifier node. These models are a Logistic Regression, a Discriminant, and a Neural Net classifier. The models are ordered by their AUC of the test set, as this is the rank criteria chosen in the node options (see previous section). The order can be manually changed in the drop-down menu on the left, labeled “Sort by”, in Fig. 8.385. In addition to the AUC measurement, all other ranking methods as well as the build time are shown. We can change the basis of the ranking measure calculations to be the training data on which all ranking and fitting statistics will then be based. See right arrow in Fig. 8.385. The test set, however, has an advantage in that the performance of the models is verified on unknown data.
1074
8 Classification Models
Fig. 8.385 Model tab of the Auto Classifier model nugget. Specification of the models in the ensemble used to predict target class
To determine whether each model is a good fit for the data, we recommend looking at the individual model nuggets manually to inspect the parameter values. Double-clicking on the model itself will open a new window of the particular model nugget, which provides all the graphs, quality statistics, decision boundary equations, and other model specific information. This is highlighted by the left arrow in Fig. 8.385. Each of the model nuggets is introduced and described separately in the associated chapter. In the furthest left column labeled “Use?”, we can choose which of the models should contribute to the ensemble model. More precisely, each of the enabled models takes the input data and estimates the target value individually. Then, all outputs are aggregated according to the specified method in the Auto Classifier node to one single output. This process of aggregating can prevent overfitting and minimize the impact of outliers, which will lead to more reliable predictions. Left of the Models, the distribution of the target variable and the predicted outcome is shown for each model individually. Each graph can be viewed in a larger, separate window by double-clicking on it. Predictor Importance and Visualization of the Prediction Accuracy In the “Graph” tab, the accuracy of the ensemble model prediction is visualized by a bar plot on the left. Each bar thereby represents a category of the target variables and its height the occurrence frequency in the data. So the bar plot is a visualization of the distribution of the target variable. The bars are also colored, with each color representing a category predicted by the ensemble model. This allows you to easily see the overall accuracy, as well as identify classes with numerous misclassifications, which are harder to detect. See Fig. 8.386. In the graph on the right, the importance of the predictors is visualized in the standard way. See Sect. 5.3.3 for predictor importance and the shown plot. The predictor importance of the ensemble model is calculated on the averaged output data.
8.9 The Auto Classifier Node
1075
Fig. 8.386 Graph tab of the Auto Classifier model nugget. Predictor importance and bar plot that shows the accuracy of the ensemble model prediction
Setting of the Ensemble Aggregation Method The aggregation method, which combines the predictions of the individual models to a single classification, can be specified in the “Settings” tab. See Fig. 8.387. These are the same options as in the Settings tab of the Auto Classifier node, and we therefore refer to Sect. 8.9.1 for a more detailed description of this tab.
Fig. 8.387 Settings tab of the Auto Classifier model nugget. Specification of the ensemble aggregation method
1076
8.9.3
8 Classification Models
Exercises
Exercise 1: Finding the Best Models for Credit Rating with the Auto Classifier Node The “tree_credit” dataset (see Sect. 12.1.38) comprises demographic and loan data history of bank customers as well as a prognosis for giving a credit (“good” or “bad”). Determine the best classifiers to predict the credit rating of a bank customer with the Auto Classifier node. Use the AUC measure to rank the models. What is the best model node and its AUC value, as suggested by the Auto Classifier procedure? Combine the top five models to create an ensemble model. What is its accuracy and AUC? Exercise 2: Detection of Leukemia in Gene Expression Data with the SVM— Determination of the Best Kernel Function The dataset “gene_expression_leukemia_short.csv” contains gene expression measurements of 39 human genome positions of various leukemia patients (see Sect. 12.1.17). Your task is to build a SVM classifier that will determine whether each patient will be diagnosed with blood cancer. To do this, combine all leukemia types in a new variable value which only indicates that the patient has cancer. Use the Auto Classifier node to determine the best kernel function to be considered in the SVM. What are the AUC values and which kernel function should be used in the final SVM classifier?
8.9.4
Solutions
Exercise 1: Finding the Best Models for Credit Rating with the Auto Classifier Node Name of the solution streams Theory discussed in section
tree_credit_auto_classfier_node Sect. 8.9.1
The final stream for this exercise looks like the stream in Fig. 8.388. 1. We start by opening the stream “000 Template-Stream tree_credit”, which imports the tree_credit data and already has a Type node attached to it, and save it under a different name. See Fig. 8.389. 2. We add a Partition node to the stream and place it between the Source and Type node. In the Partition node, we declare 70% of the data as training and the remaining 30% as test data. We then open the Type node and define the measurement type of the variable “Credit rating” as Flag and its role as target. 3. Now, we add an Auto Classifier node to the stream and connect it with the Type node. The variable roles are automatically identified. This means that nothing has to be changed in the Field tab settings.
8.9 The Auto Classifier Node
1077
Fig. 8.388 Stream of the credit rating prediction with the Auto Classifier node exercise
Fig. 8.389 Template stream for the tree_credit data
4. In the Model tab, we select the AUC as rank criteria and set 5 for the number of models to use since the final ensemble model should comprise 5 different classifiers. See Fig. 8.390. 5. We stick with the default selection of models, that should be built by the Auto Classifier and considered as candidates for the final ensemble model. So, nothing has to be done in the Expert tab. See Fig. 8.391 for the pre-defined model selection by the Auto Classifier node. 6. Since we only want to consider models with high prediction ability, we define two discard criteria. For a model to be a candidate for the ranking and finally ensemble model, it needs a minimum accuracy of 80% and an AUC of above 0.8. See Fig. 8.392
1078
8 Classification Models
Fig. 8.390 Definition of the ranking criteria and number of used models in the Auto Classifier node for the credit rating exercise
7. Now we run the stream and the model nugget appears. 8. To view the top five ranked classifiers, which were built by the Auto Classifier node, we open the model nugget. The best five models as suggested by the Auto Classifier node are, in the order of ranking, a Logistic Regression, LSVM, CHAID, Bayesian Network, and XGBoost Linear. The AUC values range from 0.888 for the logistic regression to 0.881 for the XGBoost Linear model. The accuracy of these models is also quite high with over 80% for all models. See Fig. 8.393. Observe, that three of the top five models are linear (logistic regression, LSVM, and XGBoost Linear), thus, the problem of separating the customer with “good” credit rating from the ones with a “bad” credit rating seems to be a linear. 9. To evaluate the performance of the ensemble model that comprises these five models, we add an Analysis node to the stream and connect it with the model nugget. We refer to Sect. 8.2.6 for information on the Analysis node options.
8.9 The Auto Classifier Node
1079
Fig. 8.391 Default selection of the models to be considered in the building and ranking process
Figure 8.394 presents the accuracy and AUC of the ensemble model. Like with all individual components, the accuracy of the ensemble model is a little above 80% for training and test set. The AUC for the test set, also at 0.888, is in the same range as the best ranked model, i.e., the logistic regression.
1080
8 Classification Models
Fig. 8.392 Definition of the discard criteria in the Auto Classifier node for the credit rating exercise
Fig. 8.393 Top five classifiers to predict the credit rating of a bank customer built by the Auto Classifier node
8.9 The Auto Classifier Node
1081
Fig. 8.394 Analysis node with performance statistics of the ensemble model that classifies customers according to their credit rating
Exercise 2: Detection of Leukemia in Gene Expression Data with the SVM— Determination of the Best Kernel Function Name of the solution streams Theory discussed in section
gene_expression_leukemia_short_svm_kernel_finding_auto_classifier_node Sect. 8.9.1 Sect. 8.5.1 Sect. 8.5.4 (Exercise 1)
The stream displayed in Fig. 8.395 is the complete solution of this exercise. 1. The first part of this exercise is the same as in Exercise 1 in Sect. 8.5.4. We therefore omit a detailed description of the importation, partitioning, and reclassification of the data into healthy and leukemia patients and begin referring to the first steps of the solution of the above-mentioned exercise. After following the steps of this solution, the stream should look like that in Fig. 8.396. This is our new starting point.
1082
8 Classification Models
Fig. 8.395 Complete stream of the best kernel finding procedure for the SVM leukemia detection classifier
Fig. 8.396 Sub-stream of data preparation of the solution stream of this exercise
2. We add an Auto Classifier node to the stream and connect it with the last Type node. In the Auto Classifier node, we select the “Area under the curve” rank criteria in the Model tab and set the number of used model to 4, as four kernel functions are provided by the SPSS Modeler. See Fig. 8.397. 3. In the Expert tab, we check the box next to the SVM model type and uncheck the boxes of all other model types. See Fig. 8.398. We then click on the model parameter field of the SVM and select “Specify”. See arrow in Fig. 8.398. 4. The parameter option window of the SVM node opens, and we go to the Expert tab. There, we change the Mode parameter to “Expert” for all other options to be changeable. Afterwards, we click on the “Options” field of the Kernel type parameter and click on “Specify”. In the pop-up window which appears, we can select the kernel methods that should be considered in the building process of the Auto Classifier node. As we want to identify the best among all kernel functions, we check all the boxes: the RBF, Polynomial, Sigmoid, and Linear kernel. See Fig. 8.399. We now click the OK buttons until we are back at the Auto Classifier node. For each of these kernels, an SVM
8.9 The Auto Classifier Node
1083
Fig. 8.397 Selection of the rank criteria and number of considered models in the Auto Classifier node for the SVM kernel finding exercise
is constructed, which means four in total. This is displayed in the Expert tab, see Fig. 8.398. 5. As the target variable and input variables are already specified in a Type node, the Auto Classifier node identifies them, and we can run the stream without additional specifications. 6. We open the appeared model nugget to inspect the evaluation and the ranking of the four SVMs with different kernels in the Model tab (Fig. 8.400). We see that the SVM named “SVM 2” has the highest AUC value, which is 0.953. This model is the SVM with a polynomic kernel and degree 3. This can be checked by double-clicking on the “SVM 2” model nugget and going to the Summary tab (see Fig. 8.401). The values of the models “SVM 1” (RBF kernel) and “SVM 4” (linear kernel) at 0.942 and 0.925, respectively, are not far away from the one of the polynomic kernels SVM. The AUC value of the last SVM (sigmoid kernel), however, is quite lower at 0.66. Thus, the prediction quality of this last model is
1084
8 Classification Models
Fig. 8.398 Selection of the SVM model in the Auto Classifier node for the kernel finding exercise
not as good as the other three. By looking at the bar plot of each model, we see that the “SVM 3” model classifies all patients as leukemia patients, whereas the other three models are able to recognize healthy patients. This explains the phenomena of the much lower AUC of “SVM 3”. 7. To recap, the SVM with a polynomial kernel of degree 3 function has the best performance in detecting leukemia from gene expression data as the three other
8.9 The Auto Classifier Node
1085
Fig. 8.399 Specification of the SVM kernel functions considered during the model building process of the Auto Classifier node
Fig. 8.400 Model tab of the Auto Classifier model nugget with the four ranked SVMs with different kernel functions
SVMs considered, and the Modeler suggests using this kernel in a SVM model for this problem. However, the RBF and Linear kernel models are also good choices, as they perform nearly as good on the test set. However, additional fine tuning of the parameters can further increase the quality of the models, which can lead to different ranks for the used kernels.
1086
8 Classification Models
Fig. 8.401 Summary tab of the SVM model nugget where the kernel type and further model specifics can be viewed
References Allison, P. D. (2014). Measures of fit for logistic regression. Retrieved September 19, 2015 from http://support.sas.com/resources/papers/proceedings14/1485-2014.pdf Azzalini, A., & Scarpa, B. (2012). Data analysis and data mining: An introduction. Oxford: Oxford University Press. Backhaus, K., Erichson, B., Plinke, W., Weiber, R., & Backhaus-Erichson-Plinke-Weiber. (2011). Multivariate Analysemethoden: Eine anwendungsorientierte Einführung (Springer-Lehrbuch, 13). Berlin: Springer. Ben-Gal, I. (2008). Bayesian networks. In F. Ruggeri, R. S. Kenett, & F. W. Faltin (Eds.), Encyclopedia of statistics in quality and reliability. Chichester, UK: Wiley. Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor” meaningful? In G. Goos, J. Hartmanis, J. van Leeuwen, C. Beeri, & P. Buneman (Eds.), Database theory—ICDT’99, lecture notes in computer science (Vol. 1540, pp. 217–235). Berlin: Springer. Biggs, D., de Ville, B., & Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, 18(1), 49–62. Box, G. E. P. (1949). A general distribution theory for class of likelihood. Biometrika, 36(3–4), 317–346.
References
1087
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton, FL: CRC Press. Cheng, B., & Titterington, D. M. (1994). Neural networks: A review from a statistical perspective. Statistical Science, 9(1), 2–30. Cormen, T. H. (2009). Introduction to algorithms. Cambridge: MIT Press. Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, L. E. (2005). A modern introduction to probability and statistics: Understanding why and how. London: Springer. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 476–493. Fahrmeir, L. (2013). Regression: Models, methods and applications. Berlin: Springer. Fawcett, T. (2003). ROC graphs: Slides on notes and practical considerations for data mining researchers. Retrieved August 30, 2019 from https://pdfs.semanticscholar.org/b328/ 52abb9e55424f2dfadefa4da74cbe194059c.pdf?_ga¼2.175361250.1565730699.1567140265997789844.1567140265 Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of Eugenics, 7(2), 179–188. Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques, the Morgan Kaufmann series in data management systems (3rd ed.). Waltham, MA: Morgan Kaufmann. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction. New York: Springer. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. IBM. (2019a). IBM SPSS modeler 18.2 algorithms guide. Retrieved December 16, 2019 from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.2/en/AlgorithmsGuide. pdf IBM. (2019b). IBM SPSS modeler 18.2 modeling nodes. Retrieved December 16, 2019 from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.2/en/ ModelerModelingNodes.pdf IBM. (2019c). SPSS modeler 18.2 source, process, and output nodes. Retrieved December 16, 2019 from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.2/en/ ModelerSPOnodes.pdf IBM. (2019d). IBM SPSS analytics server 3.2.1 overview. Retrieved December 16, 2019 from ftp:// public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/3.2.1/English/IBM_ SPSS_Analytic_Server_3.2.1_Overview.pdf James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 103). New York: Springer. Kanji, G. K. (2009). 100 statistical tests (3rd ed.). London: Sage (reprinted). Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119. Kononenko, I., & Bratko, I. (1991). Information-based evaluation criterion for Classifier’s performance. Machine Learning, 6, 67–80. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. Lantz, B. (2013). Machine learning with R: Learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications. Open source. Community experience distilled. Li, T., Zhu, S., & Ogihara, M. (2006). Using discriminant analysis for multi-class classification: An experimental investigation. Knowledge and Information Systems, 10, 453–472. Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815–840. Machine Learning Repository. (1998). Optical recognition of handwritten digits. Retrieved 2015 from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
1088
8 Classification Models
Moro, S., Laureano, R., & Cortez, P. (2011). Using data mining for bank direct marketing: An application of the CRISP-DM methodology. In Proceedings of the European Simulation and Modelling Conference – ESM’2011 (pp. 117–121). Guimarães. Niedermeyer, E., Schomer, D. L., & Lopes da Silva, F. H. (2011). Niedermeyer’s electroencephalography: Basic principles, clinical applications, and related fields (6th ed.). Philadelphia: Wolters Kluwer. Nielsen, M. A. (2018). Neural networks and deep learning. Determination Press. Oh, S.-H., Lee, Y.-R., & Kim, H.-N. (2014). A novel EEG feature extraction method using Hjorth parameter. International Journal of Electronics and Electrical Engineering, 2(2), 106–110. Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4, 1883. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning, The Morgan Kaufmann series in machine learning. San Mateo, CA: Morgan Kaufmann. R Core Team. (2014). R: A language and environment for statistical computing. http://www.Rproject.org/ Rivest, R. (1987). Learning decision lists. Machine Learning, 2(3), 229–246. Rokach, L., & Maimon, O. (2008). Data mining with decision trees. Theory and applications. World Scientific Publishing. https://doi.org/10.1142/9789812771728_0001. RStudio Team. (2015). RStudio: Integrated development environment for R. http://www.rstudio. com/ Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis. Wiesbaden: Springer Vieweg. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond, adaptive computation and machine learning. Cambridge, MA: MIT Press. Sever, M., Lajovic, J., & Rajer, B. (2005). Robustness of the Fisher’s discriminant function to skewcurved normal distribution. Advances in Methodology and Statistics/Metodološki zvezki, 2, 231–242. Tuffery, S. (2011). Data mining and statistics for decision making, Wiley series in computational statistics. Chichester: Wiley. Warner, R. M. (2013). Applied statistics: From bivariate through multivariate techniques (2nd ed.). Sage Publications. Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31, 218–220. Wikipedia. (2019). Receiver operating characteristic. Retrieved August 28, 2019 from https://en. wikipedia.org/wiki/Receiver_operating_characteristic Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, 87(23), 9193–9196. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms (Chapman & Hall/CRC machine learning & pattern recognition series). Boca Raton, FL: Taylor & Francis.
9
Using R with the Modeler
After finishing this chapter, the reader is able to: 1. Explain how to connect the IBM SPSS Modeler with R and why this can be helpful, 2. Describe the advantages of implementing R features in a Modeler stream, 3. Extending a stream by using the correct node to incorporate R procedures in a stream as well as 4. Use R features to determine the best transformation of a variable toward normality.
9.1
Advantages of R with the Modeler
The SPSS Modeler provides a wide range of algorithms, procedures, and options to build statistical models. The Modeler provides options for creating models and preparing data in most instances appropriately that are easy to understand and also intuitive. Why does IBM offer the user the option of implementing R functionalities in the SPSS Modeler graphical environment? There are several answers to this question: 1. It allows users with R knowledge to switch to the usage of the SPSS Modeler and present the functionalities in a more structured way the analysis process. 2. The user can sometimes modify data more easily by using the R language. 3. The variety of advanced and flexible graphical functions provided by R can be used to visualize the data in more significant plots. 4. As with any other software, the SPSS Modeler and R offer different options to analyze data and to create models. Sometimes it may also be helpful to assess the fit of a model by using R. # Springer Nature Switzerland AG 2021 T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler, https://doi.org/10.1007/978-3-030-54338-9_9
1089
1090
9
Using R with the Modeler
IBM SPSS Modeler Essentials for R
IBM SPSS Modeler R nodes offered by the modeler
users dataset ‘test_scores.sav’
R modelerData
R language, packages/libraries to extend modelerDataModel functionalities
Fig. 9.1 SPSS Modeler and R interaction
5. R can be extended by using a wide range of libraries that researchers all over the world have implemented. In this way, R is constantly updated, easier to modify, and better at coping with specific modeling challenges. 6. Embedding R in the SPSS Modeler has the overall benefit of combining two powerful analytics software, so the strengths of each can be used in an analysis. Each statistical software package has its advantages, and the option to use R functionalities within the IBM SPSS Modeler as well gives the user the chance to lock on the same data from different angles and to use the best method provided by both packages. The aim of this chapter is to explain the most important steps in how to use R functionalities and to implement the correct code in the SPSS Modeler. We will have a look at how to install R and the IBM SPSS R Essentials. Furthermore, we will discuss the R nodes of the Modeler that uses the R functionalities and present the results to user. Figure 9.1 depicts the interaction of both programs by accessing the same dataset “test_scores”. The authors do not intend to explain the details of the R language here because there are an overwhelming number of different options and functionalities that are beyond the scope of this book.
9.2
Connecting with R
In order to use R with the Modeler, we must install the IBM SPSS Modeler Essentials for R. This is the Modeler toolbox to connect with R as shown in Fig. 9.1. It manages to link the data of the Modeler and of R so that both applications have access to the data and can exchange data. Here, we will present the steps to set up the IBM SPSS R Essentials. Additionally, we want to use a stream to test the connection with the R engine. A detailed description of the installation procedure can also be found in IBM (2019).
9.2 Connecting with R
1091
Assumptions 1. The R Essentials and therefore R can only be used with the Professional or Premium Version of the Modeler. 2. The R Version 3.3.3 must be installed on the computer, and the folder of this installation must be known, e.g., “C:\Program Files\R\R-3.3.3”. DOWNLOAD: http://cran.r-project.org/bin/windows/base/old/3.3.3/ 3. The folder “C:\Program Files\IBM\SPSS\Modeler\18.2\ext\bin\pasw.rstats” must exist. 4. Run the R Essentials setup procedure as Administrator! Right-click the downloaded file and choose Run as Administrator. "
In order to use R with the Modeler the “IBM SPSS Modeler—Essentials for R” must be installed. The reader should distinguish between “IBM SPSS Statistics—Essentials for R” and “IBM SPSS Modeler—Essentials for R”. The tool last mentioned must be used. Furthermore, it is essential to start the setup program as administrator! Details can be found in the following detailed description and in IBM (2019).
Set-Up Procedure for the R Essentials If all of these requirements are fulfilled, we can start the setup procedure 1. Download the “IBM SPSS Modeler—Essentials for R”. See IBM (2019). The version of the Modeler and the version for the Essentials are corresponding. Depending on the operating system and the Modeler version, we must make sure to use the correct 32- or 64-bit version. 2. We must make sure not to start the install program after using the IBM download program. Instead we strongly recommend making a note of the folder where the download is saved and terminate the original download procedure after the file is being saved. Then we navigate to the folder with the setup program. We should start the setup as administrator. To do so, we click the file with the right mouse button and then choose the option “Run as Administrator”. 3. After unzipping the files, the setup program comes up and requests you to choose the correct language (Fig. 9.2). 4. We read the introduction and accept the license agreement. 5. Then we make sure to define the correct R folder (Fig. 9.3). 6. As suggested by the setup program, we have to determine the R program folder. In the previous steps, we verified that this folder exists. Figure 9.4 shows an example. The user may find that the offered standard folder in this dialog window is not correct and must be modified. 7. We carefully check the summary of the settings as shown in Fig. 9.5.
1092
9
Using R with the Modeler
Fig. 9.2 Setup process “IBM SPSS Modeler—Essentials for R” initial step
Fig. 9.3 Setup process “IBM SPSS Modeler—Essentials for R”—define the R folder
8. At the end, the setup program tells us that the procedure was successfully completed (Fig. 9.6).
9.2 Connecting with R
1093
Fig. 9.4 Setup process “IBM SPSS Modeler—Essentials for R”—define the pasw.rstats folder
Fig. 9.5 Setup process “IBM SPSS Modeler—Essentials for R”—Pre-Installation summary
1094
9
Using R with the Modeler
Fig. 9.6 Setup process “IBM SPSS Modeler—Essentials for R”—Installation summary
9.3
Test the SPSS Modeler Connection to R
Description of the model Stream name Based on dataset Stream structure
R_Connect_Test.str None
To test that the R essentials were installed successfully, and the R Engine can also be used from the Modeler, we suggest taking the following steps. 1. We open the stream “R_Connect_Test.str”. In Fig. 9.7, we can see that there is a User input node as well as an Extension Transform node.
9.3 Test the SPSS Modeler Connection to R
1095
Fig. 9.7 Stream to test the R essentials
Fig. 9.8 Table node to show the defined variables
2. In the User Input node, a variable called “Test_Variable” is defined and the value of the variable is just 1. We used this very simple node because we do not have to connect the stream to a complicated data source. The link to the data may be missing and the stream may be harder to use. 3. If we click on the left Table node by right (!) clicking the mouse, we can use “Run” to see what variables are defined so far (Fig. 9.8). 4. As expected, there is one variable and one record. The value of the variable “Test_Variable” is 1. 5. We can close the window with “OK”. 6. To finish the procedure, we click on the right Table node (see Fig. 9.9) once more with a right (!) click of the mouse button and we use “Run” to start the calculation procedure. 7. The table is being modified. It is simply the “Test_Variable + 1”. If we can see the new value as shown in Fig. 9.10, then the Modeler is successfully connected with R. "
The Extension Transform node enables R to grab the SPSS Modeler data and to modify the values stored in an object called “modelerData” using a script.
1096
9
Using R with the Modeler
Fig. 9.9 Stream to test the R essentials
Fig. 9.10 Results calculated by the R engine
"
Not all operations are possible to use in a Modeler node for R, e.g., sorting or aggregation. For further details, see also IBM (2014).
We have not yet covered the usage of the R language in the Transform node itself. If we double-click the Transform node in Fig. 9.9, we can find the first short R script as shown in Fig. 9.11. 1. The table or dataframe “modelerData” is the R object automatically linked to the SPSS Modeler dataset. 2. By using “modelerData$Test_Variable” we address the column “Test_Variable”. 3. We increase the values in this column by 1. The SPSS Modeler Essentials link the two objects “modelerData” and “modelerDataModel” from the Modeler to R and back. As shown in Fig. 9.1, the modeler copies the information to “modelerData”. The R script modifies the value (s). And in the Modeler, we can see the new values in the stream by looking at the Table node. All other variables that will be defined in R will not be recognized in the Modeler. The object “modelerDataModel” contains the structure of the object “modelerData”. See also IBM (2020). Because the structure of “modelerData” is
9.3 Test the SPSS Modeler Connection to R
1097
Fig. 9.11 R code in the Extension Transform node
not being modified here, the object “modelerDataModel” does not have to be addressed in the script. As long as we only modify the variable previously defined and we do not add any column or changes to the name of a column, we do not have to add more commands to the script. In the next example, we will show how to deal with modified data frames by R and how to make sure to get the correct values by showing the results in the Modeler. "
The object “modelerData” links the data to be exchanged between the Modeler and R. So, in the R script, “modelerData” must be addressed to get the input values. Furthermore, values that are modified or variables that are created in R must be part of “modelerData”.
"
The object “modelerDataModel” contains the structure of the object “modelerData”. See IBM (2020). In particular, for each column of “modelerData”, the column name and its scale type are defined. As long as the structure of “modelerData” is not being changed, the object “modelerDataModel” must not be modified.
1098
9.4
9
Using R with the Modeler
Calculating New Variables in R
Description of the model Stream name Based on dataset Stream structure
R_salary_and_bonus.Str salary_simple.Sav
Related exercises: 1
By using a new stream, we will now look at the data transport mechanism from the Modeler to R and back. We will describe the analysis of the stream step-by-step. 1. We show the predefined variables and their values in the dataset “salary_simple. sav” by double-clicking on the Table node on the left. Then we click “Run”. We find 5000 values in the column “salary”. See Fig. 9.12. We can close the window of the Table node now.
Fig. 9.12 Table node with the original values of “salary_simple.sav”
9.4 Calculating New Variables in R
1099
Fig. 9.13 Type node settings
2. In the Type node, we can see that the variable “salary” is defined as continuous and metrical and the role as “Input”. This shows also Fig. 9.13. 3. Now we double-click on the Extension Transform node. In the dialog window shown in Fig. 9.14, we can see the R script that transforms the data. The commands used here are explained in Table 9.1. The tab “Console Output” in the Extension Transform node shows the internal calculation results of R. After running the Transform node, the user can find the error message below the last line shown in Fig. 9.15 that helps to identify the R command to be modified. 4. We can see the values of the three variables “salary”, “bonus”, and “new_salary” in the Table node at the end of the stream (see Fig. 9.16). "
The Extension Transform node can be used to calculate new variables.
"
The objects “modelerData” and “modelerDataModel” are used to exchange values or information. All other variables defined in R cannot be found or used in the Modeler later.
"
The object “modelerDataModel” is needed for the SPSS Modeler to convert the data from an R object back to a SPSS Modeler data table. If new columns are defined, their name, type, and role must be defined in “modelerDataModel” too.
1100
9
Using R with the Modeler
Fig. 9.14 R Script in the Extension Transform node
"
In the Extension Transform node, the dialog window in the tab “Console Output” helps the user to identify R commands that are not correct and must be modified.
5
4
3
2
Command 1
Explanation # get the old values in the data frame “modelerData” temp_data