Machine Learning For Business Analytics: Concepts, Techniques, and Applications in R [2 ed.] 1119835178, 9781119835172

MACHINE LEARNING FOR BUSINESS ANALYTICS Machine learning ―also known as data mining or data analytics― is a fundamental

136 8

English Pages [870] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Title Page
Copyright
Dedication
Foreword by Ravi Bapna
Foreword by Gareth James
Preface to the Second R Edition
Acknowledgments
PART I: Preliminaries
CHAPTER 1: Introduction
1.1 WHAT IS BUSINESS ANALYTICS?
1.2 WHAT IS MACHINE LEARNING?
1.3 MACHINE LEARNING, AI, AND RELATED TERMS
1.4 BIG DATA
1.5 DATA SCIENCE
1.6 WHY ARE THERE SO MANY DIFFERENT METHODS?
1.7 TERMINOLOGY AND NOTATION
1.8 ROAD MAPS TO THIS BOOK
CHAPTER 2: Overview of the Machine Learning Process
2.1 INTRODUCTION
2.2 CORE IDEAS IN MACHINE LEARNING
2.3 THE STEPS IN A MACHINE LEARNING PROJECT
2.4 PRELIMINARY STEPS
2.5 PREDICTIVE POWER AND OVERFITTING
2.6 BUILDING A PREDICTIVE MODEL
2.7 USING R FOR MACHINE LEARNING ON A LOCAL MACHINE
2.8 AUTOMATING MACHINE LEARNING SOLUTIONS
2.9 ETHICAL PRACTICE IN MACHINE LEARNING
PROBLEMS
NOTES
PART II: Data Exploration and Dimension Reduction
CHAPTER 3: Data Visualization
3.1 USES OF DATA VISUALIZATION
3.2 DATA EXAMPLES
3.3 BASIC CHARTS: BAR CHARTS, LINE CHARTS, AND SCATTER PLOTS
3.4 MULTIDIMENSIONAL VISUALIZATION
3.5 SPECIALIZED VISUALIZATIONS
3.6 SUMMARY: MAJOR VISUALIZATIONS AND OPERATIONS, BY MACHINE LEARNING GOAL
PROBLEMS
NOTES
CHAPTER 4: Dimension Reduction
4.1 INTRODUCTION
4.2 CURSE OF DIMENSIONALITY
4.3 PRACTICAL CONSIDERATIONS
4.4 DATA SUMMARIES
4.5 CORRELATION ANALYSIS
4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL VARIABLES
4.7 CONVERTING A CATEGORICAL VARIABLE TO A NUMERICAL VARIABLE
4.8 PRINCIPAL COMPONENT ANALYSIS
4.9 DIMENSION REDUCTION USING REGRESSION MODELS
4.10 DIMENSION REDUCTION USING CLASSIFICATION AND REGRESSION TREES
PROBLEMS
Note
PART III: Performance Evaluation
CHAPTER 5: Evaluating Predictive Performance
5.1 INTRODUCTION
5.2 EVALUATING PREDICTIVE PERFORMANCE
5.3 JUDGING CLASSIFIER PERFORMANCE
5.4 JUDGING RANKING PERFORMANCE
5.5 OVERSAMPLING
PROBLEMS
NOTES
PART IV: Prediction and Classification Methods
CHAPTER 6: Multiple Linear Regression
6.1 INTRODUCTION
6.2 EXPLANATORY VS. PREDICTIVE MODELING
6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION
6.4 VARIABLE SELECTION IN LINEAR REGRESSION
PROBLEMS
NOTES
CHAPTER 7: k ‐Nearest Neighbors ( k ‐NN)
7.1 THE K ‐NN CLASSIFIER (CATEGORICAL OUTCOME)
7.2 K ‐NN FOR A NUMERICAL OUTCOME
7.3 ADVANTAGES AND SHORTCOMINGS OF K ‐NN ALGORITHMS
PROBLEMS
NOTES
CHAPTER 8: The Naive Bayes Classifier
8.1 INTRODUCTION
8.2 APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER
8.3 SOLUTION: NAIVE BAYES
8.4 ADVANTAGES AND SHORTCOMINGS OF THE NAIVE BAYES CLASSIFIER
PROBLEMS
CHAPTER 9: Classification and Regression Trees
9.1 INTRODUCTION
9.2 CLASSIFICATION TREES
9.3 EVALUATING THE PERFORMANCE OF A CLASSIFICATION TREE
9.4 AVOIDING OVERFITTING
9.5 CLASSIFICATION RULES FROM TREES
9.6 CLASSIFICATION TREES FOR MORE THAN TWO CLASSES
9.7 REGRESSION TREES
9.8 ADVANTAGES AND WEAKNESSES OF A TREE
9.9 IMPROVING PREDICTION: RANDOM FORESTS AND BOOSTED TREES
PROBLEMS
NOTES
CHAPTER 10: Logistic Regression
10.1 INTRODUCTION
10.2 THE LOGISTIC REGRESSION MODEL
10.3 EXAMPLE: ACCEPTANCE OF PERSONAL LOAN
10.4 EVALUATING CLASSIFICATION PERFORMANCE
10.5 VARIABLE SELECTION
10.6 LOGISTIC REGRESSION FOR MULTI‐CLASS CLASSIFICATION
10.7 EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS
PROBLEMS
NOTES
CHAPTER 11: Neural Nets
11.1 INTRODUCTION
11.2 CONCEPT AND STRUCTURE OF A NEURAL NETWORK
11.3 FITTING A NETWORK TO DATA
11.4 REQUIRED USER INPUT
11.5 EXPLORING THE RELATIONSHIP BETWEEN PREDICTORS AND OUTCOME
11.6 DEEP LEARNING
11.7 ADVANTAGES AND WEAKNESSES OF NEURAL NETWORKS
PROBLEMS
NOTES
CHAPTER 12: Discriminant Analysis
12.1 INTRODUCTION
12.2 DISTANCE OF A RECORD FROM A CLASS
12.3 FISHER'S LINEAR CLASSIFICATION FUNCTIONS
12.4 CLASSIFICATION PERFORMANCE OF DISCRIMINANT ANALYSIS
12.5 PRIOR PROBABILITIES
12.6 UNEQUAL MISCLASSIFICATION COSTS
12.7 CLASSIFYING MORE THAN TWO CLASSES
12.8 ADVANTAGES AND WEAKNESSES
PROBLEMS
NOTES
CHAPTER 13: Generating, Comparing, and Combining Multiple Models
13.1 ENSEMBLES
13.2 AUTOMATED MACHINE LEARNING (AUTOML)
13.3 EXPLAINING MODEL PREDICTIONS
13.4 SUMMARY
PROBLEMS
NOTES
PART V: Intervention and User Feedback
CHAPTER 14: Interventions: Experiments, Uplift Models, and Reinforcement Learning
14.1 A/B TESTING
14.2 UPLIFT (PERSUASION) MODELING
14.3 REINFORCEMENT LEARNING
14.4 SUMMARY
PROBLEMS
NOTES
PART VI: Mining Relationships Among Records
CHAPTER 15: Association Rules and Collaborative Filtering
15.1 ASSOCIATION RULES
15.2 COLLABORATIVE FILTERING
15.3 SUMMARY
PROBLEMS
NOTES
CHAPTER 16: Cluster Analysis
16.1 INTRODUCTION
16.2 MEASURING DISTANCE BETWEEN TWO RECORDS
16.3 MEASURING DISTANCE BETWEEN TWO CLUSTERS
16.4 HIERARCHICAL (AGGLOMERATIVE) CLUSTERING
16.5 NON‐HIERARCHICAL CLUSTERING: THE ‐MEANS ALGORITHM
PROBLEMS
PART VII: Forecasting Time Series
CHAPTER 17: Handling Time Series
17.1 INTRODUCTION
17.2 DESCRIPTIVE VS. PREDICTIVE MODELING
17.3 POPULAR FORECASTING METHODS IN BUSINESS
17.4 TIME SERIES COMPONENTS
17.5 DATA PARTITIONING AND PERFORMANCE EVALUATION
PROBLEMS
NOTES
CHAPTER 18: Regression‐Based Forecasting
18.1 A MODEL WITH TREND
18.2 A MODEL WITH SEASONALITY
18.3 A MODEL WITH TREND AND SEASONALITY
18.4 AUTOCORRELATION AND ARIMA MODELS
PROBLEMS
NOTES
CHAPTER 19: Smoothing and Deep Learning Methods for Forecasting
19.1 SMOOTHING METHODS: INTRODUCTION
19.2 MOVING AVERAGE
19.3 SIMPLE EXPONENTIAL SMOOTHING
19.4 ADVANCED EXPONENTIAL SMOOTHING
19.5 DEEP LEARNING FOR FORECASTING
PROBLEMS
NOTES
PART VIII: Data Analytics
CHAPTER 20: Social Network Analytics
20.1 INTRODUCTION
20.2 DIRECTED VS. UNDIRECTED NETWORKS
20.3 VISUALIZING AND ANALYZING NETWORKS
20.4 SOCIAL DATA METRICS AND TAXONOMY
20.5 USING NETWORK METRICS IN PREDICTION AND CLASSIFICATION
20.6 COLLECTING SOCIAL NETWORK DATA WITH R
20.7 ADVANTAGES AND DISADVANTAGES
PROBLEMS
NOTES
CHAPTER 21: Text Mining
21.1 INTRODUCTION
21.2 THE TABULAR REPRESENTATION OF TEXT: TERM–DOCUMENT MATRIX AND “BAG‐OF‐WORDS”
21.3 BAG‐OF‐WORDS VS. MEANING EXTRACTION AT DOCUMENT LEVEL
21.4 PREPROCESSING THE TEXT
21.5 IMPLEMENTING MACHINE LEARNING METHODS
21.6 EXAMPLE: ONLINE DISCUSSIONS ON AUTOS AND ELECTRONICS
21.7 EXAMPLE: SENTIMENT ANALYSIS OF MOVIE REVIEWS
21.8 SUMMARY
PROBLEMS
NOTES
CHAPTER 22: Responsible Data Science
22.1 INTRODUCTION
22.2 UNINTENTIONAL HARM
22.3 LEGAL CONSIDERATIONS
22.4 PRINCIPLES OF RESPONSIBLE DATA SCIENCE
22.5 A RESPONSIBLE DATA SCIENCE FRAMEWORK
22.6 DOCUMENTATION TOOLS
22.7 EXAMPLE: APPLYING THE RDS FRAMEWORK TO THE COMPAS EXAMPLE
22.8 SUMMARY
PROBLEMS
NOTES
PART IX: Cases
CHAPTER 23: Cases
23.1 CHARLES BOOK CLUB
23.2 GERMAN CREDIT
23.3 TAYKO SOFTWARE CATALOGER
23.4 POLITICAL PERSUASION
23.5 TAXI CANCELLATIONS
23.6 SEGMENTING CONSUMERS OF BATH SOAP
23.7 DIRECT‐MAIL FUNDRAISING
23.8 CATALOG CROSS‐SELLING
23.9 TIME SERIES CASE: FORECASTING PUBLIC TRANSPORTATION DEMAND
23.10 LOAN APPROVAL
NOTES
References
R Packages Used in the Book
Data Files Used in the Book
Index
End User License Agreement
Recommend Papers

Machine Learning For Business Analytics: Concepts, Techniques, and Applications in R [2 ed.]
 1119835178, 9781119835172

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Table of Contents Cover Title Page Copyright Dedication Foreword by Ravi Bapna Foreword by Gareth James Preface to the Second R Edition Acknowledgments PART I: Preliminaries CHAPTER 1: Introduction 1.1 WHAT IS BUSINESS ANALYTICS? 1.2 WHAT IS MACHINE LEARNING? 1.3 MACHINE LEARNING, AI, AND RELATED TERMS 1.4 BIG DATA 1.5 DATA SCIENCE 1.6 WHY ARE THERE SO MANY DIFFERENT METHODS? 1.7 TERMINOLOGY AND NOTATION 1.8 ROAD MAPS TO THIS BOOK CHAPTER 2: Overview of the Machine Learning Process 2.1 INTRODUCTION 2.2 CORE IDEAS IN MACHINE LEARNING 2.3 THE STEPS IN A MACHINE LEARNING PROJECT 2.4 PRELIMINARY STEPS 2.5 PREDICTIVE POWER AND OVERFITTING 2.6 BUILDING A PREDICTIVE MODEL

2.7 USING R FOR MACHINE LEARNING ON A LOCAL MACHINE 2.8 AUTOMATING MACHINE LEARNING SOLUTIONS 2.9 ETHICAL PRACTICE IN MACHINE LEARNING PROBLEMS NOTES PART II: Data Exploration and Dimension Reduction CHAPTER 3: Data Visualization 3.1 USES OF DATA VISUALIZATION 3.2 DATA EXAMPLES 3.3 BASIC CHARTS: BAR CHARTS, LINE CHARTS, AND SCATTER PLOTS 3.4 MULTIDIMENSIONAL VISUALIZATION 3.5 SPECIALIZED VISUALIZATIONS 3.6 SUMMARY: MAJOR VISUALIZATIONS AND OPERATIONS, BY MACHINE LEARNING GOAL PROBLEMS NOTES CHAPTER 4: Dimension Reduction 4.1 INTRODUCTION 4.2 CURSE OF DIMENSIONALITY 4.3 PRACTICAL CONSIDERATIONS 4.4 DATA SUMMARIES 4.5 CORRELATION ANALYSIS 4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL VARIABLES 4.7 CONVERTING A CATEGORICAL VARIABLE TO A NUMERICAL VARIABLE 4.8 PRINCIPAL COMPONENT ANALYSIS 4.9 DIMENSION REDUCTION USING REGRESSION MODELS

4.10 DIMENSION REDUCTION USING CLASSIFICATION AND REGRESSION TREES PROBLEMS Note PART III: Performance Evaluation CHAPTER 5: Evaluating Predictive Performance 5.1 INTRODUCTION 5.2 EVALUATING PREDICTIVE PERFORMANCE 5.3 JUDGING CLASSIFIER PERFORMANCE 5.4 JUDGING RANKING PERFORMANCE 5.5 OVERSAMPLING PROBLEMS NOTES PART IV: Prediction and Classification Methods CHAPTER 6: Multiple Linear Regression 6.1 INTRODUCTION 6.2 EXPLANATORY VS. PREDICTIVE MODELING 6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 6.4 VARIABLE SELECTION IN LINEAR REGRESSION PROBLEMS NOTES CHAPTER 7: k ‐Nearest Neighbors ( k ‐NN) 7.1 THE K ‐NN CLASSIFIER (CATEGORICAL OUTCOME) 7.2 K ‐NN FOR A NUMERICAL OUTCOME 7.3 ADVANTAGES AND SHORTCOMINGS OF K ‐NN ALGORITHMS PROBLEMS NOTES

CHAPTER 8: The Naive Bayes Classifier 8.1 INTRODUCTION 8.2 APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 8.3 SOLUTION: NAIVE BAYES 8.4 ADVANTAGES AND SHORTCOMINGS OF THE NAIVE BAYES CLASSIFIER PROBLEMS CHAPTER 9: Classification and Regression Trees 9.1 INTRODUCTION 9.2 CLASSIFICATION TREES 9.3 EVALUATING THE PERFORMANCE OF A CLASSIFICATION TREE 9.4 AVOIDING OVERFITTING 9.5 CLASSIFICATION RULES FROM TREES 9.6 CLASSIFICATION TREES FOR MORE THAN TWO CLASSES 9.7 REGRESSION TREES 9.8 ADVANTAGES AND WEAKNESSES OF A TREE 9.9 IMPROVING PREDICTION: RANDOM FORESTS AND BOOSTED TREES PROBLEMS NOTES CHAPTER 10: Logistic Regression 10.1 INTRODUCTION 10.2 THE LOGISTIC REGRESSION MODEL 10.3 EXAMPLE: ACCEPTANCE OF PERSONAL LOAN 10.4 EVALUATING CLASSIFICATION PERFORMANCE 10.5 VARIABLE SELECTION 10.6 LOGISTIC REGRESSION FOR MULTI‐CLASS CLASSIFICATION

10.7 EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS PROBLEMS NOTES CHAPTER 11: Neural Nets 11.1 INTRODUCTION 11.2 CONCEPT AND STRUCTURE OF A NEURAL NETWORK 11.3 FITTING A NETWORK TO DATA 11.4 REQUIRED USER INPUT 11.5 EXPLORING THE RELATIONSHIP BETWEEN PREDICTORS AND OUTCOME 11.6 DEEP LEARNING 11.7 ADVANTAGES AND WEAKNESSES OF NEURAL NETWORKS PROBLEMS NOTES CHAPTER 12: Discriminant Analysis 12.1 INTRODUCTION 12.2 DISTANCE OF A RECORD FROM A CLASS 12.3 FISHER'S LINEAR CLASSIFICATION FUNCTIONS 12.4 CLASSIFICATION PERFORMANCE OF DISCRIMINANT ANALYSIS 12.5 PRIOR PROBABILITIES 12.6 UNEQUAL MISCLASSIFICATION COSTS 12.7 CLASSIFYING MORE THAN TWO CLASSES 12.8 ADVANTAGES AND WEAKNESSES PROBLEMS NOTES CHAPTER 13: Generating, Comparing, and Combining Multiple Models

13.1 ENSEMBLES 13.2 AUTOMATED MACHINE LEARNING (AUTOML) 13.3 EXPLAINING MODEL PREDICTIONS 13.4 SUMMARY PROBLEMS NOTES PART V: Intervention and User Feedback CHAPTER 14: Interventions: Experiments, Uplift Models, and Reinforcement Learning 14.1 A/B TESTING 14.2 UPLIFT (PERSUASION) MODELING 14.3 REINFORCEMENT LEARNING 14.4 SUMMARY PROBLEMS NOTES PART VI: Mining Relationships Among Records CHAPTER 15: Association Rules and Collaborative Filtering 15.1 ASSOCIATION RULES 15.2 COLLABORATIVE FILTERING 15.3 SUMMARY PROBLEMS NOTES CHAPTER 16: Cluster Analysis 16.1 INTRODUCTION 16.2 MEASURING DISTANCE BETWEEN TWO RECORDS 16.3 MEASURING DISTANCE BETWEEN TWO CLUSTERS 16.4 HIERARCHICAL (AGGLOMERATIVE) CLUSTERING

16.5 NON‐HIERARCHICAL CLUSTERING: THE 

‐MEANS ALGORITHM PROBLEMS PART VII: Forecasting Time Series CHAPTER 17: Handling Time Series 17.1 INTRODUCTION 17.2 DESCRIPTIVE VS. PREDICTIVE MODELING 17.3 POPULAR FORECASTING METHODS IN BUSINESS 17.4 TIME SERIES COMPONENTS 17.5 DATA PARTITIONING AND PERFORMANCE EVALUATION PROBLEMS NOTES CHAPTER 18: Regression‐Based Forecasting 18.1 A MODEL WITH TREND 18.2 A MODEL WITH SEASONALITY 18.3 A MODEL WITH TREND AND SEASONALITY 18.4 AUTOCORRELATION AND ARIMA MODELS PROBLEMS NOTES CHAPTER 19: Smoothing and Deep Learning Methods for Forecasting 19.1 SMOOTHING METHODS: INTRODUCTION 19.2 MOVING AVERAGE 19.3 SIMPLE EXPONENTIAL SMOOTHING 19.4 ADVANCED EXPONENTIAL SMOOTHING 19.5 DEEP LEARNING FOR FORECASTING PROBLEMS NOTES PART VIII: Data Analytics

CHAPTER 20: Social Network Analytics 20.1 INTRODUCTION 20.2 DIRECTED VS. UNDIRECTED NETWORKS 20.3 VISUALIZING AND ANALYZING NETWORKS 20.4 SOCIAL DATA METRICS AND TAXONOMY 20.5 USING NETWORK METRICS IN PREDICTION AND CLASSIFICATION 20.6 COLLECTING SOCIAL NETWORK DATA WITH R 20.7 ADVANTAGES AND DISADVANTAGES PROBLEMS NOTES CHAPTER 21: Text Mining 21.1 INTRODUCTION 21.2 THE TABULAR REPRESENTATION OF TEXT: TERM–DOCUMENT MATRIX AND “BAG‐OF‐ WORDS” 21.3 BAG‐OF‐WORDS VS. MEANING EXTRACTION AT DOCUMENT LEVEL 21.4 PREPROCESSING THE TEXT 21.5 IMPLEMENTING MACHINE LEARNING METHODS 21.6 EXAMPLE: ONLINE DISCUSSIONS ON AUTOS AND ELECTRONICS 21.7 EXAMPLE: SENTIMENT ANALYSIS OF MOVIE REVIEWS 21.8 SUMMARY PROBLEMS NOTES CHAPTER 22: Responsible Data Science 22.1 INTRODUCTION 22.2 UNINTENTIONAL HARM

22.3 LEGAL CONSIDERATIONS 22.4 PRINCIPLES OF RESPONSIBLE DATA SCIENCE 22.5 A RESPONSIBLE DATA SCIENCE FRAMEWORK 22.6 DOCUMENTATION TOOLS 22.7 EXAMPLE: APPLYING THE RDS FRAMEWORK TO THE COMPAS EXAMPLE 22.8 SUMMARY PROBLEMS NOTES PART IX: Cases CHAPTER 23: Cases 23.1 CHARLES BOOK CLUB 23.2 GERMAN CREDIT 23.3 TAYKO SOFTWARE CATALOGER 23.4 POLITICAL PERSUASION 23.5 TAXI CANCELLATIONS 23.6 SEGMENTING CONSUMERS OF BATH SOAP 23.7 DIRECT‐MAIL FUNDRAISING 23.8 CATALOG CROSS‐SELLING 23.9 TIME SERIES CASE: FORECASTING PUBLIC TRANSPORTATION DEMAND 23.10 LOAN APPROVAL NOTES References R Packages Used in the Book Data Files Used in the Book Index End User License Agreement

List of Tables

Chapter 1 TABLE 1.1 ORGANIZATION OF MACHINE LEARNING METHODS IN THIS BOOK, ACCORDING ... Chapter 2 TABLE 2.1 FIRST 10 RECORDS IN THE WEST ROXBURY HOME VALUES DATASET TABLE 2.2 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATA... TABLE 2.3 WORKING WITH FILES IN R TABLE 2.4 SAMPLING IN R TABLE 2.5 REVIEWING VARIABLES IN R TABLE 2.6 CREATING DUMMY VARIABLES IN R TABLE 2.7 IMPUTING MISSING DATA TABLE 2.8 HYPOTHETICAL DATA ON ADVERTISING EXPENDITURES AND SUBSEQUENT SALE... TABLE 2.9 DATA PARTITIONING IN R TABLE 2.10 OUTLIER IN WEST ROXBURY DATA TABLE 2.11 CLEANING AND PREPROCESSING DATA TABLE 2.12 TRAINING A REGRESSION MODEL AND GENERATING PREDICTIONS (FITTED V... TABLE 2.13 GENERATING PREDICTIONS FOR THE HOLDOUT DATA TABLE 2.14 PREDICTION ERROR METRICS FOR TRAINING AND HOLDOUT DATA (ERROR FI... TABLE 2.15 DATA FRAME WITH THREE RECORDS TO BE SCORED TABLE 2.16 SAMPLE FROM A DATABASE OF CREDIT APPLICATIONS TABLE 2.17 SAMPLE FROM A BANK DATABASE TABLE 2.18

Chapter 3 TABLE 3.1 DESCRIPTION OF VARIABLES IN BOSTON HOUSING DATASET TABLE 3.2 FIRST NINE RECORDS IN THE BOSTON HOUSING DATA Chapter 4 TABLE 4.1 DESCRIPTION OF VARIABLES IN THE BOSTON HOUSING DATASET TABLE 4.2 FIRST NINE RECORDS IN THE BOSTON HOUSING DATA TABLE 4.3 SUMMARY STATISTICS FOR THE BOSTON HOUSING DATA TABLE 4.4 CORRELATION TABLE FOR BOSTON HOUSING DATA TABLE 4.5 NUMBER OF NEIGHBORHOODS THAT BOUND THE CHARLES RIVER VS. THOSE TH... TABLE 4.6 AVERAGE MEDV BY CHAS AND RM TABLE 4.7 PIVOT TABLES IN R TABLE 4.8 DESCRIPTION OF THE VARIABLES IN THE BREAKFAST CEREAL DATASET TABLE 4.9 CEREAL CALORIES AND RATINGS TABLE 4.10 PCA ON THE TWO VARIABLES CALORIES AND RATING TABLE 4.11 PCA OUTPUT USING ALL 13 NUMERICAL VARIABLES IN THE BREAKFAST CER... TABLE 4.12 PCA OUTPUT USING ALL NORMALIZED 13 NUMERICAL VARIABLES IN THE BR... TABLE 4.13 PRINCIPAL COMPONENTS OF NON‐ NORMALIZED WINE DATA Chapter 5

TABLE 5.1 PREDICTION ERROR METRICS FROM A MODEL FOR TOYOTA CAR PRICES. TRAI... TABLE 5.2 CONFUSION MATRIX BASED ON 3000 RECORDS AND TWO CLASSES TABLE 5.3 CONFUSION MATRIX: MEANING OF EACH CELL TABLE 5.4 24 RECORDS WITH THEIR ACTUAL CLASS AND THE PROBABILITY (PROPENSIT... TABLE 5.5 CONFUSION MATRICES BASED ON THRESHOLDS OF 0.5, 0.25, AND 0.75 (RI... TABLE 5.6 RECORDS SORTED BY PROPENSITY OF OWNERSHIP (HIGH TO LOW) FOR THE M... TABLE 5.7 PROPENSITIES AND ACTUAL CLASS MEMBERSHIP FOR HOLDOUT DATA Chapter 6 TABLE 6.1 VARIABLES IN THE TOYOTA COROLLA EXAMPLE TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS (SELECTED ROWS... TABLE 6.3 LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES TABLE 6.4 PREDICTED PRICES (AND ERRORS) FOR 20 CARS IN HOLDOUT SET AND SUMM... TABLE 6.5 CROSS‐VALIDATION IN TOYOTA COROLLA EXAMPLE TABLE 6.6 EXHAUSTIVE SEARCH FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMP... TABLE 6.7 SUBSET SELECTION ALGORITHMS REDUCING PREDICTORS IN TOYOTA COROLLA... TABLE 6.8 RIDGE REGRESSION FOR SHRINKING PREDICTORS IN TOYOTA COROLLA EXAMP...

TABLE 6.9 LASSO REGRESSION FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPL... TABLE 6.10 COMPARING PERFORMANCE OF THE DIFFERENT VARIABLE SELECTION METHOD... TABLE 6.11 DESCRIPTION OF VARIABLES FOR BOSTON HOUSING EXAMPLE TABLE 6.12 DESCRIPTION OF VARIABLES FOR TAYKO SOFTWARE EXAMPLE TABLE 6.13 DESCRIPTION OF VARIABLES FOR AIRFARE EXAMPLE Chapter 7 TABLE 7.1 LOT SIZE, INCOME, AND OWNERSHIP OF A RIDING MOWER FOR 24 HOUSEHOL... TABLE 7.2 RUNNING K ‐NN TABLE 7.3 ACCURACY (OR CORRECT RATE) OF K‐NN PREDICTIONS IN VALIDATION SET ... TABLE 7.4 CLASSIFYING A NEW HOUSEHOLD USING THE “BEST K ” = 7 Chapter 8 TABLE 8.1 PIVOT TABLE FOR FINANCIAL REPORTING EXAMPLE TABLE 8.2 INFORMATION ON 10 COMPANIES TABLE 8.3 DESCRIPTION OF VARIABLES FOR FLIGHT DELAYS EXAMPLE TABLE 8.4 NAIVE BAYES CLASSIFIER APPLIED TO FLIGHT DELAYS (TRAINING) DATA TABLE 8.5 PIVOT TABLE OF FLIGHT STATUS BY DESTINATION AIRPORT (TRAINING DAT... TABLE 8.6 SCORING THE EXAMPLE FLIGHT (PROBABILITY AND CLASS)

TABLE 8.7 CONFUSION MATRICES FOR FLIGHT DELAY USING A NAIVE BAYES CLASSIFIE... TABLE 8.8 MEANS AND STANDARD DEVIATIONS OF DISTANCE (A CONTINUOUS PREDICTOR... Chapter 9 TABLE 9.1 LOT SIZE, INCOME, AND OWNERSHIP OF A RIDING MOWER FOR 24 HOUSEHOL... TABLE 9.2 SAMPLE OF DATA FOR 20 CUSTOMERS OF UNIVERSAL BANK TABLE 9.3 CONFUSION MATRICES AND ACCURACY FOR THE DEFAULT (SMALL) AND DEEPE... TABLE 9.4 TABLE OF COMPLEXITY PARAMETER () VALUES AND ASSOCIATED TREE ERROR... TABLE 9.5 EXTRACTING THE RULES FROM THE BEST‐ PRUNED TREE TABLE 9.6 SPECIFICATIONS FOR A PARTICULAR TOYOTA COROLLA Chapter 10 TABLE 10.1 DESCRIPTION OF PREDICTORS FOR ACCEPTANCE OF PERSONAL LOAN EXAMPL... TABLE 10.2 LOGISTIC REGRESSION MODEL FOR LOAN ACCEPTANCE (TRAINING DATA) TABLE 10.3 PROPENSITIES FOR THE FIRST FIVE CUSTOMERS IN HOLDOUT DATA TABLE 10.4 ORDINAL AND NOMINAL MULTINOMIAL REGRESSION IN R TABLE 10.5 DESCRIPTION OF PREDICTORS FOR FLIGHT DELAYS EXAMPLE TABLE 10.6 SAMPLE OF 20 FLIGHTS TABLE 10.7 ESTIMATED LOGISTIC REGRESSION MODEL FOR DELAYED FLIGHTS (BASED O...

TABLE 10.8 NUMBER OF FLIGHTS BY CARRIER AND ORIGIN TABLE 10.9 LOGISTIC REGRESSION MODEL WITH FEWER PREDICTORS TABLE 10.10 REGULARIZED LOGISTIC REGRESSION MODEL FOR DELAYED FLIGHTS Chapter 11 TABLE 11.1 TINY EXAMPLE ON TASTING SCORES FOR SIX CONSUMERS AND TWO PREDICT... TABLE 11.2 NEURAL NETWORK WITH A SINGLE HIDDEN LAYER (THREE NODES) FOR THE ... TABLE 11.3 CONFUSION MATRIX FOR THE TINY EXAMPLE TABLE 11.4 SUBSET FROM THE ACCIDENTS DATA, FOR A HIGH‐FATALITY REGION TABLE 11.5 DESCRIPTION OF VARIABLES FOR AUTOMOBILE ACCIDENT EXAMPLE TABLE 11.6 A NEURAL NETWORK WITH TWO NODES IN THE HIDDEN LAYER (ACCIDENTS D... TABLE 11.7 DATA PREPROCESSING OF FASHION MNIST FOR DEEP LEARNING TABLE 11.8 MODEL DEFINITION AND TRAINING OF FASHION MNIST CONVOLUTIONAL NET... TABLE 11.9 PREDICTING IMAGES TABLE 11.10 DATA FOR CREDIT CARD EXAMPLE AND VARIABLE DESCRIPTIONS Chapter 12 TABLE 12.1 DISCRIMINANT ANALYSIS FOR RIDING‐ MOWER DATA, DISPLAYING THE ESTI... TABLE 12.2 CLASSIFICATION SCORES, PREDICTED CLASSES, AND PROBABILITIES FOR ...

TABLE 12.3 DISCRIMINANT ANALYSIS FOR RIDING‐ MOWER DATA WITH MODIFIED PRIOR ... TABLE 12.4 SAMPLE OF 20 AUTOMOBILE ACCIDENTS FROM THE 2001 DEPARTMENT OF TR... TABLE 12.5 DISCRIMINANT ANALYSIS FOR THE THREE‐CLASS INJURY EXAMPLE: CLASSI... TABLE 12.6 CLASSIFICATION SCORES, MEMBERSHIP PROBABILITIES, AND CLASSIFICAT... TABLE 12.7 CLASSIFICATION AND MEMBERSHIP PROBABILITIES Chapter 13 TABLE 13.1 EXAMPLE OF BAGGING AND BOOSTING CLASSIFICATION TREES ON THE PERS... TABLE 13.2 DATA PREPROCESSING AND MAKING AVAILABLE IN H2O.AI TABLE 13.3 TRAINING AND COMPARING AUTOML MODELS TABLE 13.4 OVERVIEW OF AUTO MODEL RESULTS: COMPARING ACCURACY AND RUN TIME ... Chapter 14 TABLE 14.1 EXAMPLE OF RAW DATA RESULTING FROM A/B TEST TABLE 14.2 RESULTS FROM A/B TEST AT PHOTOTO, EVALUATING THE EFFECT OF A NEW... TABLE 14.3 COMPUTING P‐VALUES FOR T‐TESTS IN R TABLE 14.4 DATA ON VOTERS (SMALL SUBSET OF VARIABLES AND RECORDS) AND DATA ... TABLE 14.5 RESULTS OF SENDING A PRO‐ DEMOCRATIC MESSAGE TO VOTERS TABLE 14.6 OUTCOME VARIABLE (MOVED_AD) AND TREATMENT VARIABLE (MESSAGE) ADD...

TABLE 14.7 CLASSIFICATIONS AND PROPENSITIES FROM PREDICTIVE MODEL (SMALL EX... TABLE 14.8 CLASSIFICATIONS AND PROPENSITIES FROM PREDICTIVE MODEL (SMALL EX... TABLE 14.9 UPLIFT: CHANGE IN PROPENSITIES FROM SENDING MESSAGE VS. NOT SEND... TABLE 14.10 UPLIFT IN R APPLIED TO THE VOTERS DATA TABLE 14.11 USING A CONTEXTUAL MULTI‐ARMED BANDIT (DATA PREPARATION)... Chapter 15 TABLE 15.1 TRANSACTIONS DATABASE FOR PURCHASES OF DIFFERENT‐COLORED CELLULA... TABLE 15.2 PHONE FACEPLATE DATA IN BINARY INCIDENCE MATRIX FORMAT TABLE 15.3 ITEMSETS WITH SUPPORT COUNT OF AT LEAST TWO TABLE 15.4 BINARY INCIDENCE MATRIX, TRANSACTIONS DATABASE, AND RULES FOR FA... TABLE 15.5 FIFTY TRANSACTIONS OF RANDOMLY ASSIGNED ITEMS TABLE 15.6 ASSOCIATION RULES OUTPUT FOR RANDOM DATA TABLE 15.7 SUBSET OF BOOK PURCHASE TRANSACTIONS IN BINARY MATRIX FORMAT TABLE 15.8 RULES FOR BOOK PURCHASE TRANSACTIONS TABLE 15.9 SCHEMATIC OF MATRIX FORMAT WITH RATING DATA TABLE 15.10 SAMPLE OF RECORDS FROM THE NETFLIX PRIZE CONTEST, FOR A SUBSET ...

TABLE 15.11 DESCRIPTION OF VARIABLES IN MOVIELENS DATASET TABLE 15.12 DATA PREPROCESSING FOR THE MOVIELENS DATA TABLE 15.13 TRAIN RECOMMENDERLAB MODELS AND PREDICT NEW USERS TABLE 15.14 EVALUATE PERFORMANCE OF RECOMMENDERLAB MODELS TABLE 15.15 SAMPLE OF DATA ON SATELLITE RADIO CUSTOMERS TABLE 15.16 DATA ON PURCHASES OF ONLINE STATISTICS COURSES TABLE 15.17 EXCERPT FROM DATA ON COSMETICS PURCHASES IN BINARY MATRIX FORM... TABLE 15.18 ASSOCIATION RULES FOR COSMETICS PURCHASES DATA TABLE 15.19 RATINGS OF ONLINE STATISTICS COURSES: 4 = BEST, 1 = WORST, BLAN... Chapter 16 TABLE 16.1 DATA ON 22 PUBLIC UTILITIES TABLE 16.2 DISTANCE MATRIX BETWEEN PAIRS OF UTILITIES, USING EUCLIDEAN DIST... TABLE 16.3 ORIGINAL AND NORMALIZED MEASUREMENTS FOR SALES AND FUEL COST TABLE 16.4 DISTANCE MATRIX BETWEEN PAIRS OF UTILITIES, USING EUCLIDEAN DIST... TABLE 16.5 DISTANCE MATRIX AFTER ARIZONA AND COMMONWEALTH CONSOLIDATION CLU... TABLE 16.6 COMPUTING CLUSTER MEMBERSHIP BY “CUTTING” THE DENDROGRAM... TABLE 16.7 DISTANCE OF EACH RECORD FROM EACH CENTROID

TABLE 16.8 DISTANCE OF EACH RECORD FROM EACH NEWLY CALCULATED CENTROID TABLE 16.9 k‐MEANS CLUSTERING OF 22 UTILITIES INTO CLUSTERS (SORTED BY CLU... TABLE 16.10 CLUSTER CENTROIDS AND SQUARED DISTANCES FOR k ‐MEANS WITH

TABLE 16.11 EUCLIDEAN DISTANCE BETWEEN CLUSTER CENTROIDS Chapter 17 TABLE 17.1 PREDICTIVE ACCURACY OF NAIVE AND SEASONAL NAIVE FORECASTS IN THE... Chapter 18 TABLE 18.1 OUTCOME VARIABLE (MIDDLE) AND PREDICTOR VARIABLE (RIGHT) USED TO... TABLE 18.2 SUMMARY OF OUTPUT FROM A LINEAR REGRESSION MODEL APPLIED TO THE ... TABLE 18.3 NEW CATEGORICAL VARIABLE (RIGHT) TO BE USED (VIA DUMMIES) AS PRE... TABLE 18.4 SUMMARY OF OUTPUT FROM FITTING ADDITIVE SEASONALITY TO THE AMTRA... TABLE 18.5 SUMMARY OF OUTPUT FROM FITTING TREND AND SEASONALITY TO AMTRAK R... TABLE 18.6 FIRST 24 MONTHS OF AMTRAK RIDERSHIP SERIES WITH LAG‐1 AND LAG‐2 ... TABLE 18.7 OUTPUT FOR AR(1) MODEL ON RIDERSHIP RESIDUALS TABLE 18.8 COMPUTING AUTOCORRELATIONS OF LAG‐1 DIFFERENCED S&P500 MONTHLY C... TABLE 18.9 OUTPUT FOR AR(1) MODEL ON S&P500 MONTHLY CLOSING PRICES

TABLE 18.10 REGRESSION MODEL FITTED TO TOYS ”R” US TIME SERIES AND ITS PRED... TABLE 18.11 OUTPUT FROM REGRESSION MODEL FIT TO DEPARTMENT STORE SALES IN T... Chapter 19 TABLE 19.1 APPLYING MA TO THE RESIDUALS FROM THE REGRESSION MODEL (WHICH LA... TABLE 19.2 SUMMARY OF HOLT–WINTERS EXPONENTIAL SMOOTHING APPLIED TO THE AMT... TABLE 19.3 CONVERTING A TIME SERIES OF LENGTH t INTO A SEQUENCE OF SUB‐SERI... TABLE 19.4 PREPARATION OF TRAINING DATA FOR FORECASTING A SERIES WITH LSTM... TABLE 19.5 DEFINING AN LSTM MODEL FOR FORECASTING A SERIES TABLE 19.3 FORECASTS FOR TEST SERIES USING EXPONENTIAL SMOOTHING Chapter 20 TABLE 20.1 EDGE LIST EXCERPT CORRESPONDING TO THE DRUG‐LAUNDERING NETWORK I... TABLE 20.2 ADJACENCY MATRIX EXCERPT CORRESPONDING TO THE TWITTER DATA IN FI... TABLE 20.3 COMPUTING CENTRALITY IN R TABLE 20.4 DEGREE DISTRIBUTION OF TINY LINKEDIN NETWORK TABLE 20.5 COMPUTING NETWORK MEASURES IN R TABLE 20.6 FOUR MEASUREMENTS FOR USERS A, B, C, AND D TABLE 20.7 NORMALIZED MEASUREMENTS FOR USERS A, B, C, AND D

TABLE 20.8 EUCLIDEAN DISTANCE BETWEEN EACH PAIR OF USERS TABLE 20.9 NETWORK METRICS TABLE 20.10 COMBINING THE NETWORK AND NON‐ NETWORK METRICS TABLE 20.11 TWITTER INTERFACE IN R Chapter 21 TABLE 21.1 TERM–DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1–S3... TABLE 21.2 TERM–DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1–S4... TABLE 21.3 TOKENIZATION OF S1–S4 EXAMPLE TABLE 21.4 STOPWORDS IN R TABLE 21.5 TEXT REDUCTION OF S1–S4 (AFTER TOKENIZATION) TABLE 21.6 TF‐IDF MATRIX FOR S1–S4 EXAMPLE (AFTER TOKENIZATION AND TEXT RED... TABLE 21.7 IMPORTING AND LABELING THE RECORDS, PREPROCESSING TEXT, AND PROD... TABLE 21.8 FITTING A PREDICTIVE MODEL TO THE AUTOS AND ELECTRONICS DISCUSSI... TABLE 21.9 PREPARE DATA FOR SENTIMENT ANALYSIS OF MOVIE REVIEWS DATA TABLE 21.10 CREATE WORD AND SENTENCE VECTORS USING GLOVE TABLE 21.11 TRAIN SENTIMENT ANALYSIS MODEL Chapter 22 TABLE 22.1 LOGISTIC REGRESSION MODEL FOR COMPAS DATA

TABLE 22.2 LOGISTIC REGRESSION AND RANDOM FOREST MODELS FOR COMPAS DATA TABLE 22.3 OVERALL AND BY‐RACE PERFORMANCE OF LOGISTIC REGRESSION AND RANDO... TABLE 22.4 TOP 6 MOST IMPORTANT FEATURES USING PERMUTATION FEATURE IMPORTAN... TABLE 22.5 JIGSAW TOXICITY SCORES FOR CERTAIN PHRASES Chapter 23 TABLE 23.1 LIST OF VARIABLES IN CHARLES BOOK CLUB DATASET TABLE 23.2 FIRST FOUR RECORDS FROM GERMAN CREDIT DATASET TABLE 23.3 OPPORTUNITY COST TABLE (DEUTSCHE MARKS) TABLE 23.4 AVERAGE NET PROFIT (DEUTSCHE MARKS) TABLE 23.5 VARIABLES FOR THE GERMAN CREDIT DATASET TABLE 23.6 FIRST 10 RECORDS FROM TAYKO DATASET TABLE 23.7 DESCRIPTION OF VARIABLES FOR TAYKO DATASET TABLE 23.8 DESCRIPTION OF VARIABLES FOR EACH HOUSEHOLD TABLE 23.9 DESCRIPTION OF VARIABLES FOR THE FUNDRAISING DATASET

List of Illustrations Chapter 1 FIGURE 1.1 TWO METHODS FOR SEPARATING OWNERS FROM NONOWNERS

FIGURE 1.2 MACHINE LEARNING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHES... FIGURE 1.3 RSTUDIO SCREEN Chapter 2 FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS FIGURE 2.2 SCATTER PLOT FOR ADVERTISING AND SALES DATA FIGURE 2.3 OVERFITTING: THIS FUNCTION FITS THE DATA WITH NO ERROR FIGURE 2.4 THREE DATA PARTITIONS AND THEIR ROLE IN THE MACHINE LEARNING PROC... FIGURE 2.5 LAYERS OF TOOLS SUPPORTING MACHINE LEARNING AUTOMATION Chapter 3 FIGURE 3.1 BASIC PLOTS: LINE GRAPH (TOP LEFT), SCATTER PLOT (TOP RIGHT), BAR... FIGURE 3.2 DISTRIBUTION PLOTS FOR NUMERICAL VARIABLE MEDV. LEFT: HISTOGRAM A... FIGURE 3.3 SIDE‐BY‐SIDE BOXPLOTS FOR EXPLORING THE CAT.MEDV OUTPUT VARIABLE ... FIGURE 3.4 HEATMAP OF A CORRELATION TABLE. DARK RED VALUES DENOTE STRONG POS... FIGURE 3.5 HEATMAP OF MISSING VALUES IN A DATASET ON MOTOR VEHICLE COLLISION... FIGURE 3.6 ADDING CATEGORICAL VARIABLES BY COLOR‐CODING AND MULTIPLE PANELS.... FIGURE 3.7 SCATTER PLOT MATRIX FOR MEDV AND THREE NUMERICAL PREDICTORS FIGURE 3.8 RESCALING CAN ENHANCE PLOTS AND REVEAL PATTERNS. LEFT: ORIGINAL S...

FIGURE 3.9 TIME SERIES LINE CHARTS USING DIFFERENT AGGREGATIONS (RIGHT PANEL... FIGURE 3.10 SCATTER PLOT WITH LABELED POINTS FIGURE 3.11 SCATTER PLOT OF LARGE DATASET WITH REDUCED MARKER SIZE, JITTERIN... FIGURE 3.12 PARALLEL COORDINATES PLOT FOR BOSTON HOUSING DATA. EACH OF THE V... FIGURE 3.13 MULTIPLE INTER‐LINKED PLOTS IN A SINGLE VIEW. NOTE THE MARKED OB... FIGURE 3.14 NETWORK PLOT OF EBAY SELLERS (DARK BLUE CIRCLES) AND BUYERS (LIG... FIGURE 3.15 TREEMAP SHOWING NEARLY 11,000 EBAY AUCTIONS, ORGANIZED BY ITEM C... FIGURE 3.16 MAP CHART OF STATISTICS.COM STUDENTS' AND INSTRUCTORS' LOCATIONS... FIGURE 3.17 WORLD MAPS COMPARING “WELL‐ BEING” (TOP) TO GDP (BOTTOM). SHADING... Chapter 4 FIGURE 4.1 DISTRIBUTION OF CAT.MEDV (BLACK DENOTES CAT.MEDV = 0) BY ZN. SIMI... FIGURE 4.2 QUARTERLY REVENUES OF TOYS “R” US, 1992–1995 FIGURE 4.3 SAMPLE FROM THE 77 BREAKFAST CEREALS DATASET FIGURE 4.4 SCATTER PLOT OF RATING VS. CALORIES FOR 77 BREAKFAST CEREALS, WIT... FIGURE 4.5 SCATTER PLOT OF THE SECOND VS. FIRST PRINCIPAL COMPONENTS SCORES ... Chapter 5 FIGURE 5.1 HISTOGRAMS AND BOXPLOTS OF TOYOTA PRICE PREDICTION ERRORS, FOR TR...

FIGURE 5.2 CUMULATIVE GAINS CHART (LEFT) AND DECILE LIFT CHART (RIGHT) FOR C... FIGURE 5.3 HIGH (TOP) AND LOW (BOTTOM) LEVELS OF SEPARATION BETWEEN TWO CLAS... FIGURE 5.4 PLOTTING ACCURACY AND OVERALL ERROR AS A FUNCTION OF THE THRESHOL... FIGURE 5.5 ROC CURVE FOR RIDING‐MOWERS EXAMPLE FIGURE 5.6 PRECISION, RECALL, AND F1‐SCORE AS A FUNCTION OF THE THRESHOLD VA... FIGURE 5.7 CUMULATIVE GAINS CHARTS FOR THE MOWER EXAMPLE USING ROCR, CARET, ... FIGURE 5.8 DECILE‐WISE LIFT CHART FIGURE 5.9 CUMULATIVE GAINS CURVE INCORPORATING COSTS FIGURE 5.10 CLASSIFICATION ASSUMING EQUAL COSTS OF MISCLASSIFICATION FIGURE 5.11 CLASSIFICATION ASSUMING UNEQUAL COSTS OF MISCLASSIFICATION FIGURE 5.12 CLASSIFICATION USING OVERSAMPLING TO ACCOUNT FOR UNEQUAL COSTS FIGURE 5.13 DECILE‐WISE LIFT CHART FOR TRANSACTION DATA FIGURE 5.14 CUMULATIVE GAINS AND DECILE‐WISE LIFT CHARTS FOR SOFTWARE SERVIC... Chapter 6 FIGURE 6.1 HISTOGRAM OF MODEL ERRORS (BASED ON HOLDOUT SET) FIGURE 6.2 CROSS‐VALIDATED RMSE OF TOYOTA PRICE PREDICTION MODELS AS A FUNCT... Chapter 7

FIGURE 7.1 SCATTER PLOT OF LOT SIZE VS. INCOME FOR THE 14 HOUSEHOLDS IN THE ... Chapter 8 FIGURE 8.1 LIFT CHART OF NAIVE BAYES CLASSIFIER APPLIED TO FLIGHT DELAYS DAT... Chapter 9 FIGURE 9.1 EXAMPLE OF A TREE FOR CLASSIFYING BANK CUSTOMERS AS LOAN ACCEPTOR... FIGURE 9.2 SCATTER PLOT OF LOT SIZE VS. INCOME FOR 24 OWNERS AND NONOWNERS O... FIGURE 9.3 SPLITTING THE 24 RECORDS BY INCOME VALUE OF 60 FIGURE 9.4 VALUES OF THE GINI INDEX AND ENTROPY MEASURE FOR A TWO‐CLASS CASE... FIGURE 9.5 SPLITTING THE 24 RECORDS FIRST BY INCOME VALUE OF 60 AND THEN BY ... FIGURE 9.6 FINAL STAGE OF RECURSIVE PARTITIONING; EACH RECTANGLE CONSISTING ... FIGURE 9.7 TREE REPRESENTATION OF FIRST SPLIT (CORRESPONDS TO FIGURE 9.3) FIGURE 9.8 TREE REPRESENTATION AFTER ALL SPLITS (CORRESPONDS TO FIGURE 9.6).... FIGURE 9.9 DEFAULT CLASSIFICATION TREE FOR THE LOAN ACCEPTANCE DATA USING TH... FIGURE 9.10 A FULL TREE FOR THE LOAN ACCEPTANCE DATA USING THE TRAINING SET ... FIGURE 9.11 ERROR RATE AS A FUNCTION OF THE NUMBER OF SPLITS FOR TRAINING VS... FIGURE 9.12 PRUNED CLASSIFICATION TREE FOR THE LOAN ACCEPTANCE DATA USING CP...

FIGURE 9.13 BEST‐PRUNED TREE OBTAINED BY FITTING A FULL TREE TO THE TRAINING... FIGURE 9.14 BEST‐PRUNED REGRESSION TREE FOR TOYOTA COROLLA PRICES FIGURE 9.15 SCATTER PLOT DESCRIBING A TWO‐ PREDICTOR CASE WITH TWO CLASSES. T... FIGURE 9.16 VARIABLE IMPORTANCE PLOT FROM RANDOM FOREST (PERSONAL LOAN EXAMP... FIGURE 9.17 ROC CURVES COMPARING PERFORMANCE OF BOOSTED TREE, RANDOM FOREST,... FIGURE 9.18 ROC CURVES OF BOOSTED TREE AND FOCUSED BOOSTED TREE (RIGHT PLOT ... Chapter 10 FIGURE 10.1 (A) ODDS AND (B) LOGIT AS A FUNCTION OF P FIGURE 10.2 PLOT OF DATA POINTS (PERSONAL LOAN AS A FUNCTION OF INCOME) AND ... FIGURE 10.3 (LEFT) CUMULATIVE GAINS CHART AND (RIGHT) DECILE‐WISE LIFT CHART... FIGURE 10.4 PROPORTION OF DELAYED FLIGHTS BY EACH OF THE SIX PREDICTORS. TIM... FIGURE 10.5 PERCENT OF DELAYED FLIGHTS (DARKER = HIGHER %DELAYS) BY DAY OF W... FIGURE 10.6 CONFUSION MATRIX AND CUMULATIVE GAINS CHART FOR THE FLIGHT DELAY... FIGURE 10.7 GAINS CHART FOR THE LOGISTIC REGRESSION MODEL WITH FEWER PREDICT... Chapter 11 FIGURE 11.1 MULTILAYER FEEDFORWARD NEURAL NETWORK

FIGURE 11.2 NEURAL NETWORK FOR THE TINY EXAMPLE. CIRCLES REPRESENT NODES (“N... FIGURE 11.3 COMPUTING NODE OUTPUTS (ON THE RIGHT WITHIN EACH NODE) USING THE... FIGURE 11.4 NEURAL NETWORK FOR THE TINY EXAMPLE WITH FINAL WEIGHTS FROM R OU... FIGURE 11.5 LINE DRAWING, FROM A 1893 FUNK AND WAGNALLS PUBLICATION FIGURE 11.6 FOCUSING ON THE LINE OF THE MAN'S CHIN FIGURE 11.7

PIXEL REPRESENTATION OF LINE ON MAN'S CHIN USING SHADING (A) A... FIGURE 11.8 CONVOLUTION NETWORK PROCESS, SUPERVISED LEARNING: THE REPEATED F... FIGURE 11.9 AUTOENCODER NETWORK PROCESS: THE REPEATED FILTERING IN THE NETWO... FIGURE 11.10 FASHION MNIST DATA: SAMPLE OF 10 IMAGES FROM EACH CLASS FIGURE 11.11 LEARNING CURVE FOR THE DEEP LEARNING NETWORK: TRAINING SET (BLU... Chapter 12 FIGURE 12.1 SCATTER PLOT OF LOT SIZE VS. INCOME FOR 24 OWNERS AND NONOWNERS ... FIGURE 12.2 PERSONAL LOAN ACCEPTANCE AS A FUNCTION OF INCOME AND CREDIT CARD... FIGURE 12.3 CLASS SEPARATION OBTAINED FROM THE DISCRIMINANT MODEL COMPARED T... Chapter 13 FIGURE 13.1 EXAMPLE OF BAGGING AND BOOSTING CLASSIFICATION TREES ON THE PERS...

FIGURE 13.2 EXPLAINING DECISIONS USING LIME. TOP: TWO CASES WHERE THE MODEL ... Chapter 14 FIGURE 14.1 REINFORCEMENT LEARNING USING A CONTEXTUAL MULTI‐ARM BANDIT (SIMU... FIGURE 14.2 SCHEMATIC OF REINFORCEMENT LEARNING, WHERE THE SOFTWARE AGENT “L... Chapter 15 FIGURE 15.1 RECOMMENDATIONS UNDER “FREQUENTLY BOUGHT TOGETHER” ARE BASED ON ... Chapter 16 FIGURE 16.1 SCATTER PLOT OF FUEL COST VS. SALES FOR THE 22 UTILITIES FIGURE 16.2 TWO‐DIMENSIONAL REPRESENTATION OF SEVERAL DIFFERENT DISTANCE MEA... FIGURE 16.3 DENDROGRAM: SINGLE LINKAGE (TOP) AND AVERAGE LINKAGE (BOTTOM) FO... FIGURE 16.4 HEATMAP FOR THE 22 UTILITIES (IN ROWS). ROWS ARE SORTED BY THE S... FIGURE 16.5 VISUAL PRESENTATION (PROFILE PLOT) OF CLUSTER CENTROIDS FIGURE 16.6 COMPARING DIFFERENT CHOICES OF k IN TERMS OF OVERALL AVERAGE WIT... Chapter 17 FIGURE 17.1 MONTHLY RIDERSHIP ON AMTRAK TRAINS (IN THOUSANDS) FROM JANUARY 1... FIGURE 17.2 TIME PLOTS OF THE DAILY NUMBER OF VEHICLES PASSING THROUGH THE B... FIGURE 17.3 PLOTS THAT ENHANCE THE DIFFERENT COMPONENTS OF THE TIME SERIES. ...

FIGURE 17.4 NAIVE AND SEASONAL NAIVE FORECASTS IN A 3‐YEAR TEST SET FOR AMTR... FIGURE 17.5 AVERAGE ANNUAL WEEKLY HOURS SPENT BY CANADIAN MANUFACTURING WORK... Chapter 18 FIGURE 18.1 A LINEAR TREND FIT TO AMTRAK RIDERSHIP FIGURE 18.2 A LINEAR TREND FIT TO AMTRAK RIDERSHIP IN THE TRAINING PERIOD AN... FIGURE 18.3 EXPONENTIAL (AND LINEAR) TREND USED TO FORECAST AMTRAK RIDERSHIP FIGURE 18.4 QUADRATIC TREND MODEL USED TO FORECAST AMTRAK RIDERSHIP. PLOTS O... FIGURE 18.5 REGRESSION MODEL WITH SEASONALITY APPLIED TO THE AMTRAK RIDERSHI... FIGURE 18.6 REGRESSION MODEL WITH TREND AND SEASONALITY APPLIED TO AMTRAK RI... FIGURE 18.7 AUTOCORRELATION PLOT FOR LAGS 1–12 (FOR FIRST 24 MONTHS OF AMTRA... FIGURE 18.8 AUTOCORRELATION PLOT OF FORECAST ERRORS SERIES FROM FIGURE 18.6... FIGURE 18.9 FITTING AN AR(1) MODEL TO THE RESIDUAL SERIES FROM FIGURE 18.6 FIGURE 18.10 AUTOCORRELATIONS OF RESIDUALS‐ OF‐RESIDUALS SERIES FIGURE 18.11 S&P500 MONTHLY CLOSING PRICES SERIES FIGURE 18.12 SEASONALLY ADJUSTED PRE‐ SEPTEMBER‐11 AIR SERIES FIGURE 18.13 AVERAGE ANNUAL WEEKLY HOURS SPENT BY CANADIAN MANUFACTURING WOR...

FIGURE 18.14 QUARTERLY REVENUES OF TOYS “R” US, 1992–1995... FIGURE 18.15 DAILY CLOSE PRICE OF WALMART STOCK, FEBRUARY 2001–2002 FIGURE 18.16 (TOP) AUTOCORRELATIONS OF WALMART STOCK PRICES AND (BOTTOM) AUT... FIGURE 18.17 DEPARTMENT STORE QUARTERLY SALES SERIES FIGURE 18.18 FIT OF REGRESSION MODEL FOR DEPARTMENT STORE SALES FIGURE 18.19 MONTHLY SALES AT AUSTRALIAN SOUVENIR SHOP IN DOLLARS (TOP) AND ... FIGURE 18.20 QUARTERLY SHIPMENTS OF US HOUSEHOLD APPLIANCES OVER 5 YEARS FIGURE 18.21 MONTHLY SALES OF SIX TYPES OF AUSTRALIAN WINES BETWEEN 1980 AND... Chapter 19 FIGURE 19.1 SCHEMATIC OF CENTERED MOVING AVERAGE (TOP) AND TRAILING MOVING A... FIGURE 19.2 CENTERED MOVING AVERAGE (GREEN LINE) AND TRAILING MOVING AVERAGE... FIGURE 19.3 TRAILING MOVING AVERAGE FORECASTER WITH

APPLIED TO AMTRAK RIDER... FIGURE 19.4 OUTPUT FOR SIMPLE EXPONENTIAL SMOOTHING FORECASTER WITH

, APPLI... FIGURE 19.5 OUTPUT FOR HOLT–WINTERS EXPONENTIAL SMOOTHING APPLIED TO AMTRAK ... FIGURE 19.6 SCHEMATIC OF RECURRENT NEURAL NETWORK (RNN) FIGURE 19.7 ACTUAL AND FORECASTED VALUES USING THE DEEP LEARNING LSTM MODEL ...

FIGURE 19.8 SEASONALLY ADJUSTED PRE‐ SEPTEMBER‐11 AIR SERIES FIGURE 19.9 DEPARTMENT STORE QUARTERLY SALES SERIES FIGURE 19.10 FORECASTS AND ACTUALS (TOP) AND FORECAST ERRORS (BOTTOM) USING ... FIGURE 19.11 QUARTERLY SHIPMENTS OF US HOUSEHOLD APPLIANCES OVER 5 YEARS FIGURE 19.12 MONTHLY SALES OF A CERTAIN SHAMPOO FIGURE 19.13 QUARTERLY SALES OF NATURAL GAS OVER 4 YEARS FIGURE 19.14 MONTHLY SALES OF SIX TYPES OF AUSTRALIAN WINES BETWEEN 1980 AND... Chapter 20 FIGURE 20.1 TINY HYPOTHETICAL LINKEDIN NETWORK; THE EDGES REPRESENT CONNECTI... FIGURE 20.2 TINY HYPOTHETICAL TWITTER NETWORK WITH DIRECTED EDGES (ARROWS) S... FIGURE 20.3 EDGE WEIGHTS REPRESENTED BY LINE THICKNESS. FIGURE 20.4 DRUG LAUNDRY NETWORK IN SAN ANTONIO, TX FIGURE 20.5 TWO DIFFERENT LAYOUTS OF THE TINY LINKEDIN NETWORK PRESENTED IN ... FIGURE 20.6 THE DEGREE 1 (LEFT) AND DEGREE 2 (RIGHT) EGOCENTRIC NETWORKS FOR... FIGURE 20.7 A RELATIVELY SPARSE NETWORK FIGURE 20.8 A RELATIVELY DENSE NETWORK FIGURE 20.9 THE NETWORKS FOR SUSPECT A (TOP), SUSPECT AA (MIDDLE), AND SUSPE...

FIGURE 20.10 NETWORK FOR LINK PREDICTION EXERCISE Chapter 21 FIGURE 21.1 DECILE‐WISE LIFT CHART FOR AUTOS‐ ELECTRONICS DOCUMENT CLASSIFICA... FIGURE 21.2 ROC CURVE FOR THE SENTIMENT ANALYSIS MODEL TRAINED USING GLOVE W... Chapter 22 FIGURE 22.1 MODEL ACCURACY AND AUC FOR DIFFERENT GROUPS FIGURE 22.2 FALSE POSITIVE AND FALSE NEGATIVE RATES FOR DIFFERENT GROUPS FIGURE 22.3 PARTIAL DEPENDENCE PLOTS (PDP) FOR PREDICTOR PRIORS_COUNT

MACHINE LEARNING FOR BUSINESS ANALYTICS Concepts, Techniques, and Applications in R     Second Edition   GALIT SHMUELI

National Tsing Hua University PETER C. BRUCE

statistics.com PETER GEDECK

Collaborative Drug Discovery INBAL YAHAV

Tel Aviv University NITIN R. PATEL

Cytel, Inc.  

This edition first published 2023

© 2023 John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions . The right of Galit Shmueli, Peter C. Bruce, Peter Gedeck, Inbal Yahav, and Nitin R. Patel to be identified as the authors of this work has been asserted in accordance with law. Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com . Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats. Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty

The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties; including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of on-going research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from. Library of Congress Cataloging‐in‐Publication Data Applied for Hardback: 9781119835172 Cover Design: Wiley

Cover Image: © Hiroshi Watanabe/Getty Images

      The beginning of wisdom is this: Get wisdom, and whatever else you get, get insight.     –Proverbs 4:7

Foreword by Ravi Bapna Converting data into an asset is the new business imperative facing modern managers. Each day the gap between what analytics capabilities make possible and companies' absorptive capacity of creating value from such capabilities increases. In many ways, data is the new gold—and mining this gold to create business value in today's context of a highly networked and digital society requires a skillset that we haven't traditionally delivered in business or statistics or engineering programs on their own. For those businesses and organizations that feel overwhelmed by today's Big Data, the phrase you ain't seen nothing yet comes to mind. Yesterday's three major sources of Big Data—the 20+ years of investment in enterprise systems (ERP, CRM, SCM, etc.), the three billion plus people on the online social grid, and the close to five billion people carrying increasingly sophisticated mobile devices—are going to be dwarfed by tomorrow's smarter physical ecosystems fueled by the Internet of Things (IoT) movement. The idea that we can use sensors to connect physical objects such as homes, automobiles, roads, and even garbage bins and streetlights to digitally optimized systems of governance goes hand in glove with bigger data and the need for deeper analytical capabilities. We are not far away from a smart refrigerator sensing that you are short on, say, eggs, populating your grocery store's mobile app's shopping list, and arranging a Task Rabbit to do a grocery run for you. Or the refrigerator negotiating a deal with an Uber driver to deliver an evening meal to you. Nor are we far away from sensors embedded in roads and vehicles that can compute traffic congestion, track roadway wear and tear, record vehicle use, and factor these into dynamic usage‐based pricing, insurance rates, and even taxation. This brave new world is going to be fueled by analytics and the ability to harness data for competitive advantage. Business Analytics is an emerging discipline that is going to help us ride this new wave. This new Business Analytics discipline requires individuals who are grounded in the fundamentals of business such that they know the right questions to ask; who have the ability to

harness, store, and optimally process vast datasets from a variety of structured and unstructured sources; and who can then use an array of techniques from machine learning and statistics to uncover new insights for decision‐making. Such individuals are a rare commodity today, but their creation has been the focus of this book for a decade now. This book's forte is that it relies on explaining the core set of concepts required for today's business analytics professionals using real‐world data‐rich cases in a hands‐on manner, without sacrificing academic rigor. It provides a modern‐day foundation for Business Analytics, the notion of linking the x 's to the y 's of interest in a predictive sense. I say this with the confidence of someone who was probably the first adopter of the zeroth edition of this book (Spring 2006 at the Indian School of Business). The updated R version is much awaited. R is used by a wide variety of instructors in our MS‐Business Analytics program. The open‐ innovation paradigm used by R is one key part of the analytics perfect storm, the other components being the advances in computing and the business appetite for data‐driven decision‐ making. The new addition also covers causal analytics as experimentation (often called A/B testing in the industry), which is now becoming mainstream in the tech companies. Further, the authors have added a new chapter on Responsible Data Science, a new part on AutoML, more on deep learning and beefed up deep learning examples in the text mining and forecasting chapters. These updates make this new edition “state of the art” with respect to modern business analytics and AI. I look forward to using the book in multiple fora, in executive education, in MBA classrooms, in MS‐Business Analytics programs, and in Data Science bootcamps. I trust you will too!   RAVI BAPNA Carlson School of Management, University of Minnesota, 2022

Foreword by Gareth James The field of statistics has existed in one form or another for 200 years and by the second half of the 20th century, had evolved into a well‐respected and essential academic discipline. However, its prominence expanded rapidly in the 1990s with the explosion of new, and enormous, data sources. For the first part of this century, much of this attention was focused on biological applications, in particular, genetics data generated as a result of the sequencing of the human genome. However, the last decade has seen a dramatic increase in the availability of data in the business disciplines and a corresponding interest in business‐related statistical applications. The impact has been profound. Fifteen years ago, when I was able to attract a full class of MBA students to my new statistical learning elective, my colleagues were astonished because our department struggled to fill most electives. Today, we offer a Masters in Business Analytics, which is the largest specialized masters program in the school and has application volume rivaling those of our MBA programs. Our department's faculty size and course offerings have increased dramatically, yet the MBA students are still complaining that the classes are all full. Google's chief economist, Hal Varian, was indeed correct in 2009 when he stated that “the sexy job in the next 10 years will be statisticians.” This demand is driven by a simple, but undeniable, fact. Business analytics solutions have produced significant and measurable improvements in business performance, on multiple dimensions, and in numerous settings, and as a result, there is a tremendous demand for individuals with the requisite skill set. However, training students in these skills is challenging given that, in addition to the obvious required knowledge of statistical methods, they need to understand business‐related issues, possess strong communication skills, and be comfortable dealing with multiple computational packages. Most statistics texts concentrate on abstract training in classical methods, without much emphasis on practical, let alone business, applications.

This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression to modern methods like neural networks, bagging and boosting, and even much more business‐specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject. However, just as important as the list of topics, is the way that they are all presented in an applied fashion using business applications. Indeed the last chapter is entirely dedicated to 10 separate cases where business analytics approaches can be applied. In this latest edition, the authors have added an important new dimension in the form of the R software package. Easily the most widely used and influential open source statistical software, R has become the go‐to tool for such purposes. With literally hundreds of freely available add‐on packages, R can be used for almost any business analytics related problem. The book provides detailed descriptions and code involving applications of R in numerous business settings, ensuring that the reader will actually be able to apply their knowledge to real‐life problems. I would strongly recommend this book. I'm confident that it will be an indispensable tool for any MBA or business analytics course.   GARETH JAMES Goizueta Business School, Emory University, 2022

Preface to the Second R Edition This textbook first appeared in early 2007 and has been used by numerous students and practitioners and in many courses, including our own experience teaching this material both online and in person for more than 15 years. The first edition, based on the Excel add‐in Analytic Solver Data Mining (previously XLMiner), was followed by two more Analytic Solver editions, a JMP edition, an R edition, a Python edition, a RapidMiner edition, and now this new R edition, with its companion website, www.dataminingbook.com . This new R edition, which relies on the free and open source R software, presents output from R, as well as the code used to produce that output, including specification of a variety of packages and functions. Unlike computer‐science or statistics‐oriented textbooks, the focus in this book is on machine learning concepts and how to implement the associated algorithms in R. We assume a basic familiarity with R. For this new R edition, a new co‐author, Peter Gedeck, comes on board bringing extensive data science experience in business. The new edition provides significant updates both in terms of R and in terms of new topics and content. In addition to updating R code and routines that have changed or become available since the first edition, the new edition provides the following: A stronger focus on model selection using cross‐validation with the use of the caret package Streamlined data preprocessing using tidyverse style Data visualization using ggplot Names of R packages, functions, and arguments are highlighted in the text, for easy readability. This edition also incorporates updates and new material based on feedback from instructors teaching MBA, MS, undergraduate,

diploma, and executive courses, and from their students. Importantly, this edition includes several new topics: A dedicated section on deep learning in Chapter 11 , with additional deep learning examples in text mining ( Chapter 21 ) and time series forecasting ( Chapter 19 ). A new chapter on Responsible Data Science ( Chapter 22 ) covering topics of fairness, transparency, model cards and datasheets, legal considerations, and more, with an illustrative example. The Performance Evaluation exposition in Chapter 5 was expanded to include further metrics (precision and recall, F1). A new chapter on Generating, Comparing, and Combining Multiple Models ( Chapter 13 ) that covers ensembles, AutoML, and explaining model predictions. A new chapter dedicated to Interventions and User Feedback ( Chapter 14 ), that covers A/B tests, uplift modeling, and reinforcement learning. A new case (Loan Approval) that touches on regulatory and ethical issues. A note about the book's title: The first two editions of the book used the title Data Mining for Business Intelligence . Business intelligence today refers mainly to reporting and data visualization (“what is happening now”), while business analytics has taken over the “advanced analytics,” which include predictive analytics and data mining. Later editions were therefore renamed Data Mining for Business Analytics . However, the recent AI transformation has made the term machine learning more popularly associated with the methods in this textbook. In this new edition, we therefore use the updated terms Machine Learning and Business Analytics . Since the appearance of the (Analytic Solver‐based) second edition, the landscape of the courses using the textbook has greatly expanded: whereas initially the book was used mainly in semester‐ long elective MBA‐level courses, it is now used in a variety of courses in business analytics degrees and certificate programs, ranging from

undergraduate programs to postgraduate and executive education programs. Courses in such programs also vary in their duration and coverage. In many cases, this textbook is used across multiple courses. The book is designed to continue supporting the general “predictive analytics” or “data mining” course as well as supporting a set of courses in dedicated business analytics programs. A general “business analytics,” “predictive analytics,” or “machine learning” course, common in MBA and undergraduate programs as a one‐semester elective, would cover Parts I – III , and choose a subset of methods from Parts IV and V . Instructors can choose to use cases as team assignments, class discussions, or projects. For a two‐ semester course, Part VII might be considered, and we recommend introducing Part VIII (Data Analytics). For a set of courses in a dedicated business analytics program, here are a few courses that have been using our book: Predictive Analytics—Supervised Learning : In a dedicated business analytics program, the topic of predictive analytics is typically instructed across a set of courses. The first course would cover Parts I – III , and instructors typically choose a subset of methods from Part IV according to the course length. We recommend including Part VIII : Data Analytics. Predictive Analytics—Unsupervised Learning : This course introduces data exploration and visualization, dimension reduction, mining relationships, and clustering ( Parts II and VI ). If this course follows the Predictive Analytics: Supervised Learning course, then it is useful to examine examples and approaches that integrate unsupervised and supervised learning, such as Part  VIII on Data Analytics. Forecasting Analytics : A dedicated course on time series forecasting would rely on Part VI . Advanced Analytics : A course that integrates the learnings from predictive analytics (supervised and unsupervised learning) can focus on Part VIII : Data Analytics, where social network analytics and text mining are introduced, and responsible data science is discussed. Such a course might also

include Chapter 13 , Generating, Comparing, and Combining Multiple Models from Part IV , as well as Part V , which covers experiments, uplift modeling, and reinforcement learning. Some instructors choose to use the cases ( Chapter 23 ) in such a course. In all courses, we strongly recommend including a project component, where data are either collected by students according to their interest or provided by the instructor (e.g., from the many machine learning competition datasets available). From our experience and other instructors' experience, such projects enhance the learning and provide students with an excellent opportunity to understand the strengths of machine learning and the challenges that arise in the process. GALIT SHMUELI, PETER C. BRUCE, PETER GEDECK, INBAL YAHAV, AND NITIN R. PATEL 2022

Acknowledgments We thank the many people who assisted us in improving the book from its inception as Data Mining for Business Intelligence in 2006 (using XLMiner, now Analytic Solver), its reincarnation as Data Mining for Business Analytics , and now Machine Learning for Business Analytics , including translations in Chinese and Korean and versions supporting Analytic Solver Data Mining, R, Python, SAS JMP, and RapidMiner. Anthony Babinec, who has been using earlier editions of this book for years in his data mining courses at Statistics.com , provided us with detailed and expert corrections. Dan Toy and John Elder IV greeted our project with early enthusiasm and provided detailed and useful comments on initial drafts. Ravi Bapna, who used an early draft in a data mining course at the Indian School of Business, and later at University of Minnesota, has provided invaluable comments and helpful suggestions since the book's start. Many of the instructors, teaching assistants, and students using earlier editions of the book have contributed invaluable feedback both directly and indirectly, through fruitful discussions, learning journeys, and interesting data mining projects that have helped shape and improve the book. These include MBA students from the University of Maryland, MIT, the Indian School of Business, National Tsing Hua University, and Statistics.com . Instructors from many universities and teaching programs, too numerous to list, have supported and helped improve the book since its inception. Several professors have been especially helpful with the first R edition: Hayri Tongarlak, Prashant Joshi (UKA Tarsadia University), Jay Annadatha, Roger Bohn, Sridhar Vaithianathan, Travis Greene, and Dianne Cook provided detailed comments and/or R code files for the companion website; Scott Nestler has been a helpful friend of this book project from the beginning. Kuber Deokar, instructional operations supervisor at Statistics.com , has been unstinting in his assistance, support, and detailed attention. We also thank Anuja Kulkarni, Poonam Patil, and Shweta Jadhav,

assistant teachers. Valerie Troiano has shepherded many instructors and students through the Statistics.com courses that have helped nurture the development of these books. Colleagues and family members have been providing ongoing feedback and assistance with this book project. Vijay Kamble at UIC and Travis Greene at NTHU have provided valuable help with the section on reinforcement learning. Boaz Shmueli and Raquelle Azran gave detailed editorial comments and suggestions on the first two editions; Bruce McCullough and Adam Hughes did the same for the first edition. Noa Shmueli provided careful proofs of the third edition. Ran Shenberger offered design tips. Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, provided the scenario and data for the section on uplift modeling. We also thank Jen Golbeck, Professor in the College of Information Studies at the University of Maryland and author of Analyzing the Social Web , whose book inspired our presentation in the chapter on social network analytics. Randall Pruim contributed extensively to the chapter on visualization. Marietta Tretter at Texas A&M shared comments and thoughts on the time series chapters, and Stephen Few and Ben Shneiderman provided feedback and suggestions on the data visualization chapter and overall design tips. Susan Palocsay and Mia Stephens have provided suggestions and feedback on numerous occasions, as has Margret Bjarnadottir. We also thank Catherine Plaisant at the University of Maryland's Human–Computer Interaction Lab, who helped out in a major way by contributing exercises and illustrations to the data visualization chapter. Gregory Piatetsky‐Shapiro, founder of KDNuggets.com , was generous with his time and counsel in the early years of this project. We thank colleagues at the Sloan School of Management at MIT for their support during the formative stage of this book—Dimitris Bertsimas, James Orlin, Robert Freund, Roy Welsch, Gordon Kaufmann, and Gabriel Bitran. As teaching assistants for the data mining course at Sloan, Adam Mersereau gave detailed comments on the notes and cases that were the genesis of this book, Romy Shioda

helped with the preparation of several cases and exercises used here, and Mahesh Kumar helped with the material on clustering. Colleagues at the University of Maryland's Smith School of Business: Shrivardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and comments. We thank Robert Windle and University of Maryland MBA students Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the heatmap and map charts. Anand Bodapati provided both data and advice. Jake Hofman from Microsoft Research and Sharad Borle assisted with data access. Suresh Ankolekar and Mayank Shah helped develop several cases and provided valuable pedagogical comments. Vinni Bhandari helped write the Charles Book Club case. We are grateful to colleagues at UMass Lowell's Manning School of Business for their encouragement and support in developing data analytics courses at the undergraduate and graduate levels that led to the development of this edition: Luvai Motiwalla, Harry Zhu, Thomas Sloan, Bob Li, and Sandra Richtermeyer. We also thank Michael Goul (late), Dan Power (late), Ramesh Sharda, Babita Gupta, Ashish Gupta, and Haya Ajjan from the Association for Information System's Decision Support and Analytics (SIGDSA) community for ideas and advice that helped the development of the book. We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at Harvard, as well as Anil Gore at Pune University, for thought‐ provoking discussions on the relationship between statistics and data mining. Our thanks to Richard Larson of the Engineering Systems Division, MIT, for sparking many stimulating ideas on the role of data mining in modeling complex systems. Over two decades ago, they helped us develop a balanced philosophical perspective on the emerging field of machine learning. Lastly, we thank the folks at Wiley for this successful journey of nearly two decades. Steve Quigley at Wiley showed confidence in this book from the beginning and helped us navigate through the publishing process with great speed. Curt Hinrichs' vision, tips, and

encouragement helped bring the first edition of this book to the starting gate. Jon Gurstelle guided us through additional editions and translations. Brett Kurzman has taken over the reins and is now shepherding the project. Becky Cowan, Sarah Lemore, Kathleen Pagliaro, Katrina Maceda, and Kavya Ramu greatly assisted us in pushing ahead and finalizing this and the earlier R edition. We are also especially grateful to Amy Hendrickson, who assisted with typesetting and making this book beautiful.

PART I

Preliminaries  

CHAPTER 1

Introduction 1.1 WHAT IS BUSINESS ANALYTICS? Business Analytics (BA) is the practice and art of bringing quantitative data to bear on decision‐making. The term means different things to different organizations. Consider the role of analytics in helping newspapers survive the transition to a digital world. One tabloid newspaper with a working‐ class readership in Britain had launched a web version of the paper and did tests on its home page to determine which images produced more hits: cats, dogs, or monkeys. This simple application, for this company, was considered analytics. By contrast, the Washington Post has a highly influential audience that is of interest to big defense contractors: it is perhaps the only newspaper where you routinely see advertisements for aircraft carriers. In the digital environment, the Post can track readers by time of day, location, and user subscription information. In this fashion, the display of the aircraft carrier advertisement in the online paper may be focused on a very small group of individuals—say, the members of the House and Senate Armed Services Committees who will be voting on the Pentagon's budget. Business Analytics, or more generically, analytics , include a range of data analysis methods. Many powerful applications involve little more than counting, rule‐checking, and basic arithmetic. For some organizations, this is what is meant by analytics. The next level of business analytics, now termed Business Intelligence (BI), refers to data visualization and reporting for understanding “what happened and what is happening.” This is done by use of charts, tables, and dashboards to display, examine, and explore data. BI, which earlier consisted mainly of generating static reports, has evolved into more user‐friendly and effective tools and practices, such as creating interactive dashboards that allow the user

not only to access real‐time data, but also to directly interact with it. Effective dashboards are those that tie directly into company data and give managers a tool to quickly see what might not readily be apparent in a large complex database. One such tool for industrial operations managers displays customer orders in a single two‐ dimensional display, using color and bubble size as added variables, showing customer name, type of product, size of order, and length of time to produce. Business Analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical models and machine learning algorithms used for exploring data, quantifying and explaining relationships between measurements, and predicting new records. Methods like regression models are used to describe and quantify “on average” relationships (e.g., between advertising and sales), to predict new records (e.g., whether a new patient will react positively to a medication), and to forecast future values (e.g., next week's web traffic). Readers familiar with earlier editions of this book may have noticed that the book title has changed from Data Mining for Business Intelligence to Data Mining for Business Analytics and, finally, in this edition to Machine Learning for Business Analytics . The first change reflected the advent of the term BA, which overtook the earlier term BI to denote advanced analytics. Today, BI is used to refer to data visualization and reporting. The second change reflects how the term machine learning has overtaken the older term data mining .

WHO USES PREDICTIVE ANALYTICS? The widespread adoption of predictive analytics, coupled with the accelerating availability of data, has increased organizations' capabilities throughout the economy. A few examples are as follows: Credit scoring: One long‐established use of predictive modeling techniques for business prediction is credit scoring. A credit score is not some arbitrary judgment of creditworthiness; it is based mainly on a predictive model that uses prior data to predict repayment behavior. Future purchases: A more recent (and controversial) example is Target's use of predictive modeling to classify sales prospects as “pregnant” or “not‐pregnant.” Those classified as pregnant could then be sent sales promotions at an early stage of pregnancy, giving Target a head start on a significant purchase stream. Tax evasion: The US Internal Revenue Service found it was 25 times more likely to find tax evasion when enforcement activity was based on predictive models, allowing agents to focus on the most‐likely tax cheats (Siegel, 2013 ). The Business Analytics toolkit also includes statistical experiments, the most common of which is known to marketers as A/B testing. These are often used for pricing decisions: Orbitz, the travel site, found that it could price hotel options higher for Mac users than Windows users. Staples online store found it could charge more for staplers if a customer lived far from a Staples store. Beware the organizational setting where analytics is a solution in search of a problem: a manager, knowing that business analytics and machine learning are hot areas, decides that her organization must deploy them too, to capture that hidden value that must be lurking somewhere. Successful use of analytics and machine learning requires both an understanding of the business context where value

is to be captured and an understanding of exactly what the machine learning methods do.

1.2 WHAT IS MACHINE LEARNING? In this book, machine learning (or data mining) refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and methods based on business rules. While we do introduce data visualization, which is commonly the first step into more advanced analytics, the book focuses mostly on the more advanced data analytics tools. Specifically, it includes statistical and machine learning methods that inform decision‐making, often in an automated fashion. Prediction is typically an important component, often at the individual level. Rather than “what is the relationship between advertising and sales,” we might be interested in “what specific advertisement, or recommended product, should be shown to a given online shopper at this moment?” Or we might be interested in clustering customers into different “personas” that receive different marketing treatment and then assigning each new prospect to one of these personas. The era of Big Data has accelerated the use of machine learning. Machine learning methods, with their power and automaticity, have the ability to cope with huge amounts of data and extract value.

1.3 MACHINE LEARNING, AI, AND RELATED TERMS The field of analytics is growing rapidly, both in terms of the breadth of applications and in terms of the number of organizations using advanced analytics. As a result, there is considerable overlap and inconsistency of definitions. Terms have also changed over time. The older term data mining itself means different things to different people. To the general public, it may have a general, somewhat hazy and pejorative meaning of digging through vast stores of (often personal) data in search of something interesting. Data mining , as it refers to analytic techniques, has largely been superseded by the

term machine learning . Other terms that organizations use are predictive analytics , predictive modeling , and most recently machine learning and artificial intelligence (AI) . Many practitioners, particularly those from the IT and computer science communities, use the term AI to refer to all the methods discussed in this book. AI originally referred to the general capability of a machine to act like a human, and, in its earlier days, existed mainly in the realm of science fiction and the unrealized ambitions of computer scientists. More recently, it has come to encompass the methods of statistical and machine learning discussed in this book, as the primary enablers of that grand vision, and sometimes the term is used loosely to mean the same thing as machine learning . More broadly, it includes generative capabilities such as the creation of images, audio, and video.

Statistical Modeling vs. Machine Learning A variety of techniques for exploring data and building models have been around for a long time in the world of statistics: linear regression, logistic regression, discriminant analysis, and principal components analysis, for example. However, the core tenets of classical statistics—computing is difficult and data are scarce—do not apply in machine learning applications where both data and computing power are plentiful. This gives rise to Daryl Pregibon's description of “data mining” (in the sense of machine learning) as “statistics at scale and speed” (Pregibon, 1999 ). Another major difference between the fields of statistics and machine learning is the focus in statistics on inference from a sample to the population regarding an “average effect”—for example, “a $1 price increase will reduce average demand by 2 boxes.” In contrast, the focus in machine learning is on predicting individual records—“the predicted demand for person i given a $1 price increase is 1 box, while for person j it is 3 boxes.” The emphasis that classical statistics places on inference (determining whether a pattern or interesting result might have happened by chance in our sample) is absent from machine learning. Note also that the term inference is often used in the machine learning community to refer to the process of using a model to make predictions for new data,

also called scoring , in contrast to its meaning in the statistical community. In comparison with statistics, machine learning deals with large datasets in an open‐ended fashion, making it impossible to put the strict limits around the question being addressed that classical statistical inference would require. As a result, the general approach to machine learning is vulnerable to the danger of overfitting , where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data, but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal. In this book, we use the term machine learning algorithm to refer to methods that learn directly from data, especially local patterns, often in layered or iterative fashion. In contrast, we use statistical models to refer to methods that apply global structure to the data that can be written as a simple mathematical equation. A simple example is a linear regression model (statistical) vs. a k ‐nearest neighbors algorithm (machine learning). A given record would be treated by linear regression in accord with an overall linear equation that applies to all the records. In k ‐nearest neighbors, that record would be classified in accord with the values of a small number of nearby records.

1.4 BIG DATA Machine learning and Big Data go hand in hand. Big Data is a relative term—data today are big by reference to the past and to the methods and devices available to deal with them. The challenge Big Data presents is often characterized by the four V's—volume, velocity, variety, and veracity. Volume refers to the amount of data. Velocity refers to the flow rate—the speed at which it is being generated and changed. Variety refers to the different types of data being generated (time stamps, location, numbers, text, images, etc.). Veracity refers to the fact that data is being generated by organic distributed processes (e.g., millions of people signing up for services or free downloads) and not subject to the controls or quality checks that apply to data collected for a study.

Most large organizations face both the challenge and the opportunity of Big Data because most routine data processes now generate data that can be stored and, possibly, analyzed. The scale can be visualized by comparing the data in a traditional statistical analysis (say, 15 variables and 5000 records) to the Walmart database. If you consider the traditional statistical study to be the size of a period at the end of a sentence, then the Walmart database is the size of a football field. Moreover, that probably does not include other data associated with Walmart—social media data, for example, which comes in the form of unstructured text. If the analytical challenge is substantial, so can be the reward: OKCupid, the online dating site, uses statistical models with their data to predict what forms of message content are most likely to produce a response. Telenor, a Norwegian mobile phone service company, was able to reduce subscriber turnover 37% by using models to predict which customers were most likely to leave and then lavishing attention on them. Allstate, the insurance company, tripled the accuracy of predicting injury liability in auto claims by incorporating more information about vehicle type. The above examples are from Eric Siegel's book Predictive Analytics (2013, Wiley). Some extremely valuable tasks were not even feasible before the era of Big Data. Consider web searches, the technology on which Google was built. In early days, a search for “Ricky Ricardo Little Red Riding Hood” would have yielded various links to the I Love Lucy TV show, other links to Ricardo's career as a band leader, and links to the children's story of Little Red Riding Hood. Only once the Google database had accumulated sufficient data (including records of what users clicked on) would the search yield, in the top position, links to the specific I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish and English, Little Red Riding Hood for his infant son.

1.5 DATA SCIENCE The ubiquity, size, value, and importance of Big Data has given rise to a new profession: the data scientist . Data science is a mix of skills in the areas of statistics, machine learning, math, programming, business, and IT. The term itself is thus broader than the other concepts we discussed above, and it is a rare individual who combines deep skills in all the constituent areas. In their book Analyzing the Analyzers (Harris et al., 2013 ), the authors describe the skill sets of most data scientists as resembling a “T”—deep in one area (the vertical bar of the T) and shallower in other areas (the top of the T). At a large data science conference session (Strata+Hadoop World, October 2014), most attendees felt that programming was an essential skill, though there was a sizable minority who felt otherwise. Also, although Big Data is the motivating power behind the growth of data science, most data scientists do not actually spend most of their time working with terabyte‐size or larger data. Data of the terabyte or larger size would be involved at the deployment stage of a model. There are manifold challenges at that stage, most of them IT and programming issues related to data handling and tying together different components of a system. Much work must precede that phase. It is that earlier piloting and prototyping phase on which this book focuses—developing the statistical and machine learning models that will eventually be plugged into a deployed system. What methods do you use with what sorts of data and problems? How do the methods work? What are their requirements, their strengths, their weaknesses? How do you assess their performance?

1.6 WHY ARE THERE SO MANY DIFFERENT METHODS? As can be seen in this book or any other resource on machine learning, there are many different methods for prediction and classification. You might ask yourself why they coexist and whether some are better than others. The answer is that each method has

advantages and disadvantages. The usefulness of a method can depend on factors such as the size of the dataset, the types of patterns that exist in the data, whether the data meet some underlying assumptions of the method, how noisy the data are, and the particular goal of the analysis. A small illustration is shown in Figure 1.1 , where the goal is to find a combination of household income level and household lot size that separates buyers (solid circles) from nonbuyers (hollow circles) of riding mowers. The first method (left panel) looks only for horizontal and vertical lines to separate buyers from nonbuyers, whereas the second method (right panel) looks for a single diagonal line.

FIGURE 1.1 TWO METHODS FOR SEPARATING OWNERS FROM NONOWNERS Different methods can lead to different results, and their performance can vary. It is therefore customary in machine learning to apply several different methods and select the one that appears most useful for the goal at hand.

1.7 TERMINOLOGY AND NOTATION Because of the hybrid parentry of data science, its practitioners often use multiple terms to refer to the same thing. For example, in the machine learning and artificial intelligence fields, the variable being predicted is the output variable or target variable . A categorical target variable is often called a label . To a statistician or social scientist, the variable being predicted is the dependent variable or the response . Here is a summary of terms used:

Algorithm A specific procedure used to implement a particular machine learning technique: classification tree, discriminant analysis, and the like. Attribute see Predictor . Case see Observation . Categorical variable A variable that takes on one of several fixed values, e.g., a flight could be on‐time, delayed, or canceled. Confidence A performance measure in association rules of the type “IF A and B are purchased, THEN C is also purchased.” Confidence is the conditional probability that C will be purchased IF A and B are purchased. Confidence also has a broader meaning in statistics ( confidence interval ), concerning the degree of error in an estimate that results from selecting one sample as opposed to another. Dependent variable see Response . Estimation see Prediction . Factor variable see Categorical variable Feature see Predictor . Holdout data (or Holdout set ) A sample of data not used in fitting a model, but instead used only at the end of the model building and selection process to assess how well the final model might perform on new data. This book uses the term holdout set instead of validation set and test set . Inference In statistics, the process of accounting for chance variation when making estimates or drawing conclusions based on samples; in machine learning, the term often refers to the process of using a model to make predictions for new data (see Score ). Input variable see Predictor . Label A categorical variable being predicted in supervised learning.

Model An algorithm as applied to a dataset, complete with its settings (many of the algorithms have parameters that the user can adjust). Observation The unit of analysis on which the measurements are taken (a customer, a transaction, etc.), also called instance , sample , example , case , record , pattern , or row . In spreadsheets, each row typically represents a record; each column, a variable. Note that the use of the term “sample” here is different from its usual meaning in statistics, where it refers to a collection of observations. Outcome variable see Response . Output variable see Response .

The conditional probability of event A occurring given that event B has occurred, read as “the probability that A will occur given that B has occurred.” Prediction The prediction of the numerical value of a continuous output variable; also called estimation . Predictor A variable, usually denoted by X , used as an input into a predictive model, also called a feature , input variable , independent variable , or, from a database perspective, a field . Profile A set of measurements on an observation (e.g., the height, weight, and age of a person). Record see Observation . Response A variable, usually denoted by Y , which is the variable being predicted in supervised learning, also called dependent variable , output variable , target variable , or outcome variable . Sample In the statistical community, “sample” means a collection of observations. In the machine learning community, “sample” means a single observation. Score A predicted value or class. Scoring new data means using a model developed with training data to predict output values in new data.

Success class The class of interest in a binary outcome (e.g., purchasers in the outcome purchase/no purchase ); the outcome need not be favorable. Supervised learning The process of providing an algorithm (logistic regression, classification tree, etc.) with records in which an output variable of interest is known, and the algorithm “learns” how to predict this value for new records where the output is unknown. Target see Response . Test data (or Test set ) Sometimes used to refer to the portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on new data. This book uses the term holdout set instead and uses the term validation to refer to certain validation checks (e.g., Cross‐Validation ) during the model‐ tuning phase. Training data (or Training set ) The portion of the data used to fit a model. Unsupervised learning An analysis in which one attempts to learn patterns in the data other than predicting an output value of interest. Validation data (or Validation set ) The portion of the data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried. Variable Any measurement on the records, including both the input ( X ) variables and the output ( Y ) variable.

1.8 ROAD MAPS TO THIS BOOK The book covers many of the widely used predictive and classification methods as well as other machine learning tools. Figure 1.2 outlines machine learning from a process perspective and where the topics in this book fit in. Chapter numbers are indicated beside the topic. Table 1.1 provides a different perspective: it

organizes supervised and unsupervised machine learning procedures according to the type and structure of the data .

FIGURE 1.2 MACHINE LEARNING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHESES INDICATE CHAPTER NUMBERS

TABLE 1.1 ORGANIZATION OF MACHINE LEARNING METHODS IN THIS BOOK, ACCORDING TO THE NATURE OF THE DATA a Supervised Continuous response Continuous Linear regression (6) predictors Neural nets (11)

Categorical response Logistic regression (10) Neural nets (11)

k ‐Nearest neighbors (7)

Discriminant analysis (12)

Ensembles (13)

k ‐Nearest neighbors (7) Ensembles (13) Neural nets (11)

Categorical Linear regression (6) predictors Neural nets (11) Classification trees (9) Regression trees Logistic (9) regression (10) Ensembles (13) Naive Bayes (8) Ensembles (13)

Unsupervised No response Principal components (4) Cluster analysis (16) Collaborative filtering (15)

Association rules (15) Collaborative filtering (15)

a Numbers in parentheses indicate the chapter number.

Order of Topics The book is divided into nine parts: Part I ( Chapters 1 and 2 ) gives a general overview of machine learning and its components. Part II ( Chapters 3 and 4 ) focuses on the early stages of data exploration and dimension reduction. Part III ( Chapter 5 ) discusses performance evaluation. Although it contains only one chapter, we discuss a variety of topics, from predictive performance metrics to misclassification costs. The

principles covered in this part are crucial for the proper evaluation and comparison of supervised learning methods. Part IV includes eight chapters ( Chapters 6 – 13 ), covering a variety of popular supervised learning methods (for classification and/or prediction). Within this part, the topics are generally organized according to the level of sophistication of the algorithms, their popularity, and ease of understanding. The final chapter introduces ensembles and combinations of methods. Part V ( Chapter 14 ) introduces the notions of experiments, intervention, and user feedback. This single chapter starts with A/B testing and then its use in uplift modeling and finally expands into reinforcement learning, explaining the basic ideas and formulations that utilize user feedback for learning best treatment assignments. Part VI focuses on unsupervised mining of relationships. It presents association rules and collaborative filtering ( Chapter 15 ) and cluster analysis ( Chapter 16 ). Part VII includes three chapters ( Chapters 17 – 19 ), with the focus on forecasting time series. The first chapter covers general issues related to handling and understanding time series. The next two chapters present two popular forecasting approaches: regression‐ based forecasting and smoothing methods. Part VIII presents two broad data analytics topics: social network analysis ( Chapter 20 ) and text mining ( Chapter 21 ). These methods apply machine learning to specialized data structures: social networks and text. The final chapter on responsible data science ( Chapter 22 ) introduces key issues to consider for when carrying out a machine learning project in a responsible way. Finally, Part IX includes a set of cases. Although the topics in the book can be covered in the order of the chapters, each chapter stands alone. We advise, however, to read Parts I – III before proceeding to chapters in Parts IV – V . Similarly, Chapter 17 should precede other chapters in Part VI .

USING R AND RSTUDIO To facilitate a hands‐on machine learning experience, this book uses R, a free software environment for statistical computing and graphics, and RStudio, an integrated development environment (IDE) for R. The R programming language is widely used in academia and industry for machine learning and data analysis. R offers a variety of methods for analyzing data, provided by a variety of separate packages. Among the numerous packages, R has extensive coverage of statistical and machine learning techniques for classification, prediction, mining associations and text, forecasting, and data exploration and reduction. It offers a variety of supervised machine learning tools: neural nets, classification and regression trees, k ‐nearest‐neighbor classification, naive Bayes, logistic regression, linear regression, and discriminant analysis, all for predictive modeling. R's packages also cover unsupervised algorithms: association rules, collaborative filtering, principal components analysis, k ‐means clustering, and hierarchical clustering, as well as visualization tools and data‐handling utilities. Often, the same method is implemented in multiple packages, as we will discuss throughout the book. The illustrations, exercises, and cases in this book are written in relation to R. Download: To download R and RStudio, visit www.r-project.org and www.rstudio.com/products/RStudio and follow the instructions there. Installation: Install both R and RStudio. Note that R releases new versions fairly often. When a new version is released, some packages might require a new installation of R (this is rare). Use: To start using R, open RStudio, then open a new script under File New File R Script . RStudio contains four panels as shown in Figure 1.3 : Script (top left), Console (bottom left), Environment (top right), and additional information, such as plot and help (bottom right). To run a selected code line from the Script panel, press ctrl+r . Code lines starting with # are comments.

Package Installation: To start using an R package, you will first need to install it. Installation is done via the information panel (tab “packages”) or using command install.packages() . New packages might not support old R versions and require a new R installation. Source Code Availability: The code is available from the accompanying web sites https://www.dataminingbook.com/ and https://github.com/gedeck/mlba-R-code ; data are available in the R package mlba .

FIGURE 1.3 RSTUDIO SCREEN

CHAPTER 2

Overview of the Machine Learning Process In this chapter, we give an overview of the steps involved in machine learning (ML), starting from a clear goal definition and ending with model deployment. The general steps are shown schematically in Figure 2.1 . We also discuss issues related to data collection, cleaning, and preprocessing. We introduce the notion of data partitioning , where methods are trained on a set of training data and then their performance is evaluated on a separate set of holdout data, as well as explain how this practice helps avoid overfitting. Finally, we illustrate the steps of model building by applying them to data.

FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS

2.1 INTRODUCTION In Chapter 1 , we saw some very general definitions of business analytics and machine learning. In this chapter, we introduce a variety of machine learning methods. The core of this book focuses on what has come to be called predictive analytics , the tasks of classification and prediction as well as pattern discovery, which have become key elements of a “business analytics” function in most large firms. These terms are described next.

2.2 CORE IDEAS IN MACHINE LEARNING Classification Classification is perhaps the most basic form of predictive analytics. The recipient of an offer can respond or not respond. An applicant for a loan can repay on time, repay late, or declare bankruptcy. A credit card transaction can be normal or fraudulent. A packet of data traveling on a network can be benign or threatening. A bus in a fleet can be available for service or unavailable. The victim of an illness can be recovered, still be ill, or be deceased. A common task in machine learning is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be. Similar data where the classification is known are used to develop rules, which are then applied to the data with the unknown classification.

Prediction Prediction is similar to classification, except that we are trying to predict the value of a numerical variable (e.g., amount of purchase) rather than a class (e.g., purchaser or nonpurchaser). Of course, in classification we are trying to predict a class, but the term prediction in this book refers to the prediction of the value of a continuous numerical variable. Sometimes in the machine learning literature, the terms estimation and regression are used to refer to the prediction of the value of a continuous variable, and prediction may be used for both continuous and categorical data.

Association Rules and Recommendation Systems Large databases of customer transactions lend themselves naturally to the analysis of associations among items purchased, or “what goes with what.” Association rules , or affinity analysis , is designed

to find such general associations patterns between items in large databases. The rules can then be used in a variety of ways. For example, grocery stores can use such information for product placement. They can use the rules for weekly promotional offers or for bundling products. Association rules derived from a hospital database on patients’ symptoms during consecutive hospitalizations can help find “which symptom is followed by what other symptom” to help predict future symptoms for returning patients. Online recommendation systems, such as those used on Amazon and Netflix, use collaborative filtering , a method that uses individual users’ preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference, as well as other users’ history. In contrast to association rules that generate rules general to an entire population, collaborative filtering generates “what goes with what” at the individual user level. Hence, collaborative filtering is used in many recommendation systems that aim to deliver personalized recommendations to users with a wide range of preferences.

Predictive Analytics Classification, prediction, and, to some extent, association rules and collaborative filtering constitute the analytical methods employed in predictive analytics . The term predictive analytics is sometimes used to also include data pattern identification methods such as clustering.

Data Reduction and Dimension Reduction The performance of some machine learning algorithms is often improved when the number of variables is limited, and when large numbers of records can be grouped into homogeneous groups. For example, rather than dealing with thousands of product types, an analyst might wish to group them into a smaller number of groups and build separate models for each group. Or a marketer might want to classify customers into different “personas” and must therefore group customers into homogeneous groups to define the personas. This process of consolidating a large number of records (or cases) into a smaller set is termed data reduction . Methods for reducing the number of cases are often called clustering . Reducing the number of variables is typically called dimension reduction . Dimension reduction is a common initial step before deploying machine learning methods, intended to improve predictive power, manageability, and interpretability.

Data Exploration and Visualization One of the earliest stages of engaging with data is exploring it. Exploration is aimed at understanding the global landscape of the data and detecting unusual values. Exploration is used for data cleaning and manipulation as well as for visual discovery and “hypothesis generation.” Methods for exploring data include looking at various data aggregations and summaries, both numerically and graphically. This includes looking at each variable separately as well as looking at relationships among variables. The purpose is to discover patterns and exceptions. Exploration by creating charts and dashboards is called data visualization or visual analytics . For numerical variables, we use histograms and boxplots to learn about the distribution of their values, to detect outliers (extreme observations), and to find other information that is relevant to the analysis task. Similarly, for categorical variables, we use bar charts. We can also look at scatter plots of pairs of numerical variables to learn about possible relationships, the type of relationship, and, again, to detect outliers. Visualization can be greatly enhanced by adding features such as color and interactive navigation.

Supervised and Unsupervised Learning A fundamental distinction among machine learning techniques is between supervised and unsupervised methods. Supervised learning algorithms are those used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known. Such data are also called “labeled data,” since they contain the label (outcome value) for each record. The use of the term “label” reflects the fact that the outcome of interest for a record may often be a characterization applied by a human: a document may be labeled as relevant, or an object in an X‐ray may be labeled as malignant. These training data are the data from which the classification or prediction algorithm “learns,” or is “trained,” about the relationship between predictor variables and the outcome variable. Once the algorithm has learned from the training data, it is then applied to another sample of data (the validation data ) where the outcome is known, to see how well it does in comparison

to other models (either a different algorithm or different parameter values of the same algorithm). If many different models are being tried out, it is prudent to save a third sample, which also includes known outcomes (the holdout data ) to use with the model finally selected to predict how well it will do. The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown. Simple linear regression is an example of a supervised learning algorithm (although rarely called that in the introductory statistics course where you probably first encountered it). The Y variable is the (known) outcome variable, and the X variable is a predictor variable. A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. The regression line can now be used to predict Y values for new values of X for which we do not know the Y value. Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify. Hence, there is no “learning” from cases where such an outcome variable is known. Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods. Supervised and unsupervised methods are sometimes used in conjunction. For example, unsupervised clustering methods are used to separate loan applicants into several risk‐level groups. Then, supervised algorithms are applied separately to each risk‐level group for predicting propensity of loan default.

SUPERVISED LEARNING REQUIRES GOOD SUPERVISION In some cases, the value of the outcome variable (the “label”) is known because it is an inherent component of the data. Web logs will show whether a person clicked on a link or not. Bank records will show whether a loan was paid on time or not. In other cases, the value of the known outcome must be supplied by a human labeling process to accumulate enough data to train a model. E‐mail must be labeled as spam or legitimate, documents in legal discovery must be labeled as relevant or irrelevant. In either case, the machine learning algorithm can be led astray if the quality of the supervision is poor. Gene Weingarten reported in the January 5, 2014 Washington Post magazine how the strange phrase “defiantly recommend” is making its way into English via autocorrection. “Defiantly” is closer to the common misspelling definatly than definitely , so Google, in the early days, offered it as a correction when users typed the misspelled word “definatly” on google.com . In the ideal supervised learning model, humans guide the autocorrection process by rejecting defiantly and substituting definitely . Google's algorithm would then learn that this is the best first‐choice correction of “definatly.” The problem was that too many people were lazy, just accepting the first correction that Google presented. All these acceptances then cemented “defiantly” as the proper correction.

2.3 THE STEPS IN A MACHINE LEARNING PROJECT This book focuses on understanding and using machine learning algorithms (Steps 4–7 below). However, some of the most serious errors in analytics projects result from a poor understanding of the problem—an understanding that must be developed before we get into the details of algorithms to be used. Here is a list of steps to be taken in a typical machine learning effort: 1. Develop an understanding of the purpose of the machine learning project : How will the stakeholder use the results? Who will be affected by the results? Will the analysis be a one‐shot effort or an ongoing procedure? 2. Obtain the data to be used in the analysis : This often involves random sampling from a large database to capture records to be used in an analysis. How well this sample reflects the records of interest affects the ability of the machine learning results to generalize to records outside of this sample. It may also involve pulling together data from different databases or sources. The databases could be internal (e.g., past purchases made by customers) or external (credit ratings). While machine learning deals with very large databases, usually the analysis to be done requires only thousands or tens of thousands of records.

3. Explore, clean, and preprocess the data : This step involves verifying that the data are in reasonable condition. How should missing data be handled? Are the values in a reasonable range, given what you would expect for each variable? Are there obvious outliers ? The data are reviewed graphically: for example, a matrix of scatterplots showing the relationship of each variable with every other variable. We also need to ensure consistency in the definitions of fields, units of measurement, time periods, and so on. In this step, new variables are also typically created from existing ones. For example, “duration” can be computed from start and end dates. 4. Reduce the data dimension, if necessary : Dimension reduction can involve operations such as eliminating unneeded variables, transforming variables (e.g., turning “money spent” into “spent

$100” vs. “spent $100”), and creating new variables (e.g., a variable that records whether at least one of several products was purchased). Make sure that you know what each variable means and whether it is sensible to include it in the model. 5. Determine the machine learning task (classification, prediction, clustering, etc.) : This involves translating the general question or problem of Step 1 into a more specific machine learning question. 6. Partition the data (for supervised tasks) : If the task is supervised (classification or prediction), randomly partition the dataset into three parts: training, validation, and holdout datasets. 7. Choose the machine learning techniques to be used (regression, neural nets, hierarchical clustering, etc.) . 8. Use algorithms to perform the task : This is typically an iterative process—trying multiple variants and often using multiple variants of the same algorithm (choosing different variables or parameter settings within the algorithm). Where appropriate, feedback from the algorithm's performance on validation data is used to refine the parameter settings. 9. Interpret the results of the algorithms : This involves making a choice as to the best algorithm to deploy, and where possible, testing the final choice on the holdout data to get an idea as to how well it will perform. (Recall that each algorithm may also be tested on the validation data for tuning purposes; in this way, the validation data become a part of the fitting process and are likely to underestimate the error in the deployment of the model that is finally chosen.) 10. Deploy the model : This step involves integrating the model into operational systems and running it on real records to produce decisions or actions. For example, the model might be applied to a purchased list of possible customers, and the action might be “include in the mailing if the predicted amount of purchase is $10.” A key step here is “scoring” the new records, or using the chosen model to predict the outcome value (“score”) for each new record. The foregoing steps encompass the steps in SEMMA, a methodology developed by the software company SAS: Sample Take a sample from the dataset; partition into training, validation, and holdout datasets. Explore Examine the dataset statistically and graphically. Modify Transform the variables and impute missing values. Model Fit predictive models (e.g., regression tree, neural network). Assess Compare models using a validation dataset. IBM SPSS Modeler (previously SPSS‐Clementine) has a similar methodology, termed CRISP‐DM (CRoss‐Industry Standard Process for Data Mining). All these frameworks include the same main steps involved in predictive modeling.

2.4 PRELIMINARY STEPS Organization of Data Datasets are nearly always constructed and displayed so that variables are in columns and records are in rows. In the example shown in Section  2.6 (home values in West Roxbury, Boston, in 2014), 14 variables

are recorded for over 5000 homes. The spreadsheet is organized so that each row represents a home— the first home's assessed value was $344,200, its tax was $4430, its size was 9965  , it was built in 1880, and so on. In supervised learning situations, one of these variables will be the outcome variable, typically listed in the first or last column (in this case, it is TOTAL.VALUE, in the first column).

Predicting Home Values in the West Roxbury Neighborhood The Internet has revolutionized the real estate industry. Realtors now list houses and their prices on the web, and estimates of house and condominium prices have become widely available, even for units not on the market. In 2014, Zillow ( www.zillow.com ), a popular online real estate information site in the United States, 1 purchased its major rival, Trulia. By 2015, Zillow had become the dominant platform for checking house prices and, as such, the dominant online advertising venue for realtors. By 2021, another competitor, Redfin, had eclipsed Zillow in market capitalization, largely via its strategy of employing its own agents directly on a salaried basis. What used to be a comfortable 6% commission structure for independent realtors, affording them a handsome surplus (and an oversupply of realtors), was being rapidly eroded by an increasing need to pay for advertising on Zillow and by competition from Redfin. (This, in fact, was the original key to Zillow's business model—redirecting the 6% commission away from realtors and to itself.) Zillow gets much of the data for its “Zestimates” of home values directly from publicly available city housing data, used to estimate property values for tax assessment. A competitor seeking to get into the market would likely take the same approach. So might realtors seeking to develop an alternative to Zillow. A simple approach would be a naive, model‐less method—just use the assessed values as determined by the city. Those values, however, do not necessarily include all properties, and they might not include changes warranted by remodeling, additions, etc. Moreover, the assessment methods used by cities may not be transparent or always reflect true market values. However, the city property data can be used as a starting point to build a model, to which additional data (such as that collected by large realtors) can be added later. Let's look at how Boston property assessment data, available from the city of Boston, might be used to predict home values. The data in mlba::WestRoxbury includes information on single family owner‐ occupied homes in West Roxbury, a neighborhood in southwest Boston, MA, in 2014. The data include values for various predictor variables and for an outcome—assessed home value (“total value”). This dataset has 14 variables and includes 5802 homes. A sample of the data 2 is shown in Table 2.1 , and the “data dictionary” describing each variable 3 is in Table 2.2 . As we saw earlier, below the header row, each row in the data represents a home. For example, the first home was assessed at a total value of $344.2 thousand (TOTAL.VALUE). Its tax bill was $4330. It has a lot size of 9965 square feet ( ), was built in the year 1880, has two floors, six rooms, and so on.

Loading and Looking at the Data in R To load data into R, we will typically want to have the data available as a.csv (comma separated values) file. If the data are in an.xls or.xlsx file, we can save that same file in Excel as a csv file: go to File Save as Save as type: CSV (Comma delimited) (*.csv) Save. Note: When dealing with.csv files in Excel, beware of two things: Opening a.csv file in Excel strips off leading 0's, which corrupts zip code data. Saving a.csv file in Excel saves only the digits that are displayed; if you need precision to a certain number of decimals, you need to ensure they are displayed before saving.

TABLE 2.1 FIRST 10 RECORDS IN THE WEST ROXBURY HOME VALUES DATASET TOTAL TAX LOT VALUE SQ FT 344.2 4330 9965 412.6 5190 6590 330.1 4152 7500 498.6 6272 13773 331.5 4170 5000 337.4 4244 5142 359.4 4521 5000 320.4 4030 10000 333.5 4195 6835 409.4 5150 5093

YR GROSS LIVING FLOORS ROOMS BED FULL HALF KIT BUILT AREA AREA ROOMS BATH BATH CHEN 1880 1945 1890 1957 1910 1950 1954 1950 1958 1900

2436 3108 2294 5032 2370 2124 3220 2208 2582 4818

1352 1976 1371 2608 1438 1060 1916 1200 1092 2992

2 2 2 1 2 1 2 1 1 2

6 10 8 9 7 6 7 6 5 8

3 4 4 5 3 3 3 3 3 4

1 2 1 1 2 1 1 1 1 2

1 1 1 1 0 0 1 0 0 0

1 1 1 1 1 1 1 1 1 1

TABLE 2.2 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATASET TOTAL.VALUE TAX LOT SQ FT YR BUILT GROSS AREA LIVING AREA

Total assessed value for property, in thousands of USD Tax bill amount based on total assessed value multiplied by the tax rate, in USD Total lot size of parcel in square feet Year the property was built Gross floor area Total living area for residential properties ( )

FLOORS ROOMS BEDROOMS FULL BATH HALF BATH KITCHEN FIREPLACE REMODEL

Number of floors Total number of rooms Total number of bedrooms Total number of full baths Total number of half baths Total number of kitchens Total number of fireplaces When the house was remodeled (recent/old/none)

Once we have R and RStudio installed on our machine and the WestRoxbury.csv file saved as a csv file, we can run the code in Table 2.3 to load the data into R. Data from a.csv file is stored in R as a data frame (e.g., housing.df ). A data frame is the fundamental object almost all analyses begin with in R. A data frame has rows and columns. The rows are the observations for each case (e.g., house), and the columns are the variables of interest (e.g., TOTAL.VALUE, TAX). If our .csv file has column headers, these headers get automatically stored as the column names of our data.

TABLE 2.3 WORKING WITH FILES IN R To start, open RStudio, go to File New File R Script. It opens a new tab.

Then save your Untitled1.R file into the directory where your csv is saved. Give it the name WestRoxbury.R.

From the Menu Bar, go to Session Set Working Directory To Source File Location; this sets the working directory as the place where both the R file and csv file are saved.

code for loading and creating subsets from the data housing.df str(housing.df$REMODEL)

Factor w/ 3 levels "None","Old","Recent": 1 3 1 1 1 2 1 1 3 1 ...

> levels(housing.df$REMODEL)

[1] "None" "Old" "Recent"

Handling Categorical Variables Categorical variables can also be handled by most machine learning routines, but often require special handling. If the categorical variable is ordered (age group, degree of creditworthiness, etc.), we can sometimes code the categories numerically (1, 2, 3, …) and treat the variable as if it were a continuous variable. The smaller the number of categories, and the less they represent equal increments of value, the more problematic this approach becomes, but it often works well enough.

Nominal categorical variables, however, often cannot be used as is. In many cases, they must be decomposed into a series of binary variables, called dummy variables . For example, a single categorical variable that can have possible values of “student,” “unemployed,” “employed,” or “retired” would be split into four separate dummy variables: Student —1/0 Unemployed —1/0 Employed —1/0 Retired —1/0 In some algorithms (e.g., regression models), only three of the dummy variables should be used; if the values of three are known, the fourth is also known. For example, given that these four values are the only possible ones, we can know that if a person is neither student, unemployed, nor employed, he or she must be retired. In fact, due to this redundant information, linear regression and logistic regression algorithms will produce incorrect results or fail if you use all four variables. Yet, in other algorithms (e.g., k ‐nearest neighbors), we must include all four dummy variables. This is called one‐hot encoding . Note, also, that typical methods of creating dummy variables will leave the original categorical variable intact; obviously you should not use both the original variable and the dummies. In R, we can use the fastDummies package to create dummies from a categorical (factor) variable. An example using this package is shown in Table 2.6 , resulting in two new dummies for REMODEL. Variable Selection More is not necessarily better when it comes to selecting variables for a model. Other things being equal, parsimony, or compactness, is a desirable feature in a model. For one thing, the more variables we include, the greater the number of records we will need to assess relationships among the variables. Fifteen records may suffice to give us a rough idea of the relationship between Y and a single predictor variable X . If we now want information about the relationship between Y and 15 predictor variables

, 15 records will not be enough (each estimated relationship would have an average of only one record's worth of information, making the estimate very unreliable). In addition, models based on many variables are often less robust, as they require the collection of more variables in the future, are subject to more data quality and availability issues, and require more data cleaning and preprocessing. The process of variable selection is also called feature selection .

TABLE 2.6 CREATING DUMMY VARIABLES IN R

code for creating binary dummies (indicators) library(fastDummies)

library(tidyverse)

housing.df % head(2)

Output TOTAL.VALUE

BEDROOMS

6 10

3

4

TAX LOT.SQFT YR.BUILT GROSS.AREA LIVING.AREA FLOORS ROOMS

1

344.2 4330

9965

1880

2436

1352

2

2

412.6 5190

6590

1945

3108

1976

2

FULL.BATH HALF.BATH KITCHEN FIREPLACE REMODEL_Old REMODEL_Recent

1 1 1 1 0 0 0

2 2 1 1 0 0 1

How Many Variables and How Much Data? Statisticians give us procedures to learn with some precision how many records we would need to achieve a given degree of reliability with a given dataset and a given model. These are called “power calculations” and are intended to assure that an average population effect will be estimated with sufficient precision from a sample. Predictive analytics needs are usually different, because the focus is not on identifying an average effect but rather on predicting individual records. This purpose typically requires larger samples than those used for statistical inference. A good rule of thumb is to have 10 records for every predictor variable. Another rule, used by Delmaster and Hancock ( 2001 , p. 68) for classification procedures, is to have at least

records, where m is the number of outcome classes and p is the number of variables. In general, compactness or parsimony is a desirable feature in a machine learning model. Even when we start with a small number of variables, we often end up with many more after creating new variables (such as converting a categorical variable into a set of dummy variables). Data visualization and dimension reduction methods help reduce the number of variables so that redundancies and information overlap are reduced. Even when we have an ample supply of data, there are good reasons to pay close attention to the variables that are included in a model. Someone with domain knowledge (i.e., knowledge of the business process and the data) should be consulted, as knowledge of what the variables represent is typically critical for building a good model and avoiding errors and legal violations. For example, suppose we're trying to predict the total purchase amount spent by customers, and we have a few predictor columns that are coded

, where we don't know what those codes mean. We might find that is an excellent predictor of the total amount spent. However, if we discover that

is the amount spent on shipping, calculated as a percentage of the purchase amount, then obviously a model that uses shipping amount cannot be used to predict purchase amount, because the shipping amount is not known until the transaction is completed. This is known as target leakage . Another example is if we are trying to predict loan default at the time a customer applies for a loan. If our dataset includes only information on approved loan applications, we will not have information about what

distinguishes defaulters from non‐defaulters among denied applicants. A model based on approved loans alone can therefore not be used to predict defaulting behavior at the time of loan application, but rather only once a loan is approved. Finally, in certain applications (e.g., credit scoring), attributes such as gender and race are legally prohibited. Outliers The more data we are dealing with, the greater the chance of encountering erroneous values resulting from measurement error, data‐entry error, or the like. If the erroneous value is in the same range as the rest of the data, it may be harmless. If it is well outside the range of the rest of the data (a misplaced decimal, for example), it may have a substantial effect on some of the machine learning procedures we plan to use. Values that lie far away from the bulk of the data are called outliers . The term far away is deliberately left vague because what is or is not called an outlier is an arbitrary decision. Analysts use rules of thumb such as “anything over three standard deviations away from the mean is an outlier,” but no statistical rule can tell us whether such an outlier is the result of an error. In this statistical sense, an outlier is not necessarily an invalid data point; it is just a distant one. The purpose of identifying outliers is usually to call attention to values that need further review. We might come up with an explanation looking at the data—in the case of a misplaced decimal, this is likely. We might have no explanation, but know that the value is wrong—a temperature of

F for a sick person. We might conclude that the value is within the realm of possibility and leave it alone. Or, it might be that the outliers are what we are looking for—unusual financial transactions or travel patterns. All these are judgments best made by someone with domain knowledge , knowledge of the particular application being considered: direct mail, mortgage finance, and so on, as opposed to technical knowledge of statistical or machine learning procedures. Statistical procedures can do little beyond identifying the record as something that needs review. If manual review is feasible, some outliers may be identified and corrected. In any case, if the number of records with outliers is very small, they might be treated as missing data. How do we inspect for outliers? One technique is to sort the records by the first column (e.g., using the R function order() ), then review the data for very large or very small values in that column. Then repeat for each successive column. Another option is to examine the minimum and maximum values of each column using R's min() and max() functions. For a more automated approach that considers each record as a unit, rather than each column in isolation, clustering techniques (see Chapter 16 ) could be used to identify clusters of one or a few records that are distant from others. Those records could then be examined. Missing Values Typically, some records will contain missing values. If the number of records with missing values is small, those records might be omitted. However, if we have a large number of variables, even a small proportion of missing values can affect a lot of records. Even with only 30 variables, if only 5% of the values are missing (spread randomly and independently among records and variables), almost 80% of the records would have to be omitted from the analysis. (The chance that a given record would escape having a missing value is

) An alternative to omitting records with missing values is to replace the missing value with an imputed value, based on the other values for that variable across all records. For example, among 30 variables, if household income is missing for a particular record, we might substitute the mean household income across all records. Doing so does not, of course, add any information about how household income affects the outcome variable. It merely allows us to proceed with the analysis and not lose the information contained in this record for the other 29 variables. Note that using such a technique will understate the variability in a dataset. However, we can assess variability and the performance of our machine learning technique using the holdout data, and therefore this need not present a major problem. One option is to replace missing values using fairly simple substitutes (e.g., mean, median). More sophisticated procedures do exist—for example, using linear regression, based on other variables, to fill in the missing values. These methods have been elaborated mainly for analysis of medical and scientific studies, where each patient or subject record comes at great expense. In machine learning, where data are typically plentiful, simpler methods usually suffice. Table 2.7 shows some R code to illustrate the use of the median to replace missing values using function replace:na() . Since the data are

complete to begin with, the first step is an artificial one of creating some missing records for illustration purposes. The median is used for imputation, rather than the mean, to preserve the integer nature of the counts for BEDROOMS. TABLE 2.7 IMPUTING MISSING DATA

code for imputing missing data with median

for

median of the

the median.

# To illustrate missing data procedures, we first convert a few entries # BEDROOMS to NA's. Then we impute these missing values using the # remaining values.

rows.to.missing summary(housing.df$BEDROOMS)

Min. 1st Qu. Median Mean 3rd Qu. 1.00 3.00 3.00 3.23 4.00

Max.

9.00

Some datasets contain variables that have a very large number of missing values. In other words, a measurement is missing for a large number of records. In that case, dropping records with missing values will lead to a large loss of data. Imputing the missing values might also be useless, as the imputations are based on a small number of existing records. An alternative is to examine the importance of the predictor. If it is not very crucial, it can be dropped. If it is important, perhaps a proxy variable with fewer missing values can be used instead. When such a predictor is deemed central, the best solution is to invest in obtaining the missing data. Significant time may be required to deal with missing data, as not all situations are susceptible to automated solutions. In a messy dataset, for example, a “0” might mean two things: (1) the value is missing, or (2) the value is actually zero. In the credit industry, a “0” in the “past due” variable might mean a customer who is fully paid up or a customer with no credit history at all—two very different situations. Human judgment may be required for individual cases or to determine a special rule to deal with the situation.

Normalizing (Standardizing) and Rescaling Data Some algorithms require that the data be normalized before the algorithm can be implemented effectively. To normalize a variable, we subtract the mean from each value and then divide by the standard deviation. This operation is also sometimes called standardizing or z‐transformation . In R, function scale() performs the z‐transform operation. In effect, we are expressing each value as the “number of standard deviations away from the mean,” also called a z‐score . Normalizing is one way to bring all variables to the same scale. Another popular approach is rescaling each variable to a [0,1] range. This is done by subtracting the minimum value and then dividing by the range. Subtracting the minimum shifts the variable origin to zero. Dividing by the range shrinks or expands the data to the range [0,1]. This is sometimes called range transformation . In R, the different types of rescaling can be done using function rescale() in the scales package. To consider why normalizing or scaling to [0,1] might be necessary, consider the case of clustering. Clustering typically involves calculating a distance measure that reflects how far each record is from a cluster center or from other records. With multiple variables, different units will be used: days, dollars, counts, and so on. If the dollars are in the thousands and everything else is in the tens, the dollar variable will come to dominate the distance measure. Moreover, changing units from, say, days to hours or months, could alter the outcome completely. Some machine learning software provide an option to normalize the data in those algorithms where it may be required. It is an option rather than an automatic feature of such algorithms, because there are situations where we want each variable to contribute to the distance measure in proportion to its original scale.

2.5 PREDICTIVE POWER AND OVERFITTING In supervised learning, a key question presents itself: how well will our prediction or classification model perform when we apply it to new data? We are particularly interested in comparing the performance of various models so that we can choose the one we think will do the best when it is implemented in practice. A key concept is to make sure that our chosen model generalizes beyond the dataset that we have at hand. To assure generalization, we use the concept of data partitioning and try to avoid overfitting . These two important concepts are described next.

Overfitting The more variables we include in a model, the greater the risk of overfitting the particular data used for modeling. What is overfitting? In Table 2.8 , we show hypothetical data about advertising expenditures in one time period and sales in a subsequent time period. A scatter plot of the data is shown in Figure 2.2 . We could connect up these points with a smooth but complicated function, one that interpolates all these data points perfectly and leaves no error (residuals). This can be seen in Figure 2.3 . However, we can see that such a curve is unlikely to be accurate, or even useful, in predicting future sales on the basis of advertising expenditures. For instance, it is hard to believe that increasing expenditures from $400 to $500 will actually decrease revenue. A basic purpose of building a model is to represent relationships among variables in such a way that this representation will do a good job of predicting future outcome values on the basis of future predictor values. Of course, we want the model to do a good job of describing the data we have, but we are more interested in its performance with future data. In the hypothetical advertising example, a simple straight line might do a better job than the complex function in terms of predicting future sales on the basis of advertising. Instead, we devised a complex function that fits the data perfectly, and in doing so, we overreached. We ended up modeling some variation in the data that is nothing more than chance variation. We mistreated the noise in the data as if it were a signal. Similarly, we can add predictors to a model to sharpen its performance with the data at hand. Consider a database of 100 individuals, half of whom have contributed to a charitable cause. Information about income, family size, and zip code might do a fair job of predicting whether or not someone is a

contributor. If we keep adding additional predictors, we can improve the performance of the model with the data at hand and reduce the misclassification error to a negligible level. However, this low error rate is misleading, because it probably includes spurious effects, which are specific to the 100 individuals, but not beyond that sample. For example, one of the variables might be height. We have no basis in theory to suppose that tall people might contribute more or less to charity, but if there are several tall people in our sample and they just happened to contribute heavily to charity, our model might include a term for height—the taller you are,

the more you will contribute. Of course, when the model is applied to additional data, it is likely that this will not turn out to be a good predictor. TABLE 2.8 HYPOTHETICAL SUBSEQUENT SALES

DATA

ON

Advertising 239 364 602 644 770 789 911

ADVERTISING

EXPENDITURES

Sales 514 789 550 1386 1394 1440 1354

FIGURE 2.2 SCATTER PLOT FOR ADVERTISING AND SALES DATA

AND

FIGURE 2.3 OVERFITTING: THIS FUNCTION FITS THE DATA WITH NO ERROR If the dataset is not much larger than the number of predictor variables, it is very likely that a spurious relationship like this will creep into the model. Continuing with our charity example, with a small sample just a few of whom are tall, whatever the contribution level of tall people may be, the algorithm is tempted to attribute it to their being tall. If the dataset is very large relative to the number of predictors, this is less likely to occur. In such a case, each predictor must help predict the outcome for a large number of cases, so the job it does is much less dependent on just a few cases, which might be flukes. Somewhat surprisingly, even if we know for a fact that a higher‐degree curve is the appropriate model, if the model‐fitting dataset is not large enough, a lower‐degree function (that is not as likely to fit the noise) is likely to perform better in terms of predicting new values. Overfitting can also result from the application of many different models, from which the best performing model is selected.

Creating and Using Data Partitions At first glance, we might think it best to choose the model that did the best job of classifying or predicting the outcome variable of interest with the data at hand. However, when we use the same data both to develop the model and to assess its performance, we introduce an “optimism” bias. This is because when we choose the model that works best with the data, this model's superior performance comes from two sources: A superior model Chance aspects of the data that happen to match the chosen model better than they match other models The latter is a particularly serious problem with techniques (such as trees and neural nets) that do not impose linear or other structure on the data, and thus end up overfitting it. To address the overfitting problem, we simply divide (partition) our data and develop our model using only one of the partitions. After we have a model, we try it out on another partition and see how it performs, which we can measure in several ways. In a classification model, we can count the proportion of held‐back records that were misclassified. In a prediction model, we can measure the residuals (prediction errors) between the predicted values and the actual values. This evaluation approach in effect mimics the deployment scenario, where our model is applied to data that it hasn't “seen.” We typically deal with two or three partitions: a training set, a validation set, and sometimes an additional holdout set. Partitioning the data into training, validation, and holdout sets is done either

randomly according to predetermined proportions or by specifying which records go into which partition according to some relevant variable (e.g., in time series forecasting, the data are partitioned according to their chronological order). In most cases, the partitioning should be done randomly to minimize the chance of getting a biased partition. Note the varying nomenclature—the training partition is nearly always called “training” but the names for the other partitions can vary and overlap. Training Partition The training partition, typically the largest partition, contains the data used to build the various models we are examining. The same training partition is generally used to develop multiple models. Validation Partition The validation partition (sometimes called the test partition ) is used to assess the predictive performance of each model so that you can compare models and choose the best one. In some algorithms (e.g., classification and regression trees, k ‐nearest neighbors), the validation partition may be used in an automated fashion to tune parameters and improve the model. Holdout Partition The holdout partition (sometimes called the test partition or evaluation partition ) is used to assess the performance of the chosen model with new data. Why have both a validation and a holdout partition? When we use the validation data to assess multiple models and then choose the model that performs best with the validation data, we again encounter another (lesser) facet of the overfitting problem—chance aspects of the validation data that happen to match the chosen model better than they match other models. In other words, by using the validation data to choose one of several models, the performance of the chosen model on the validation data will be overly optimistic. The random features of the validation data that enhance the apparent performance of the chosen model will probably not be present in new data to which the model is applied. Therefore, we may have overestimated the accuracy of our model. The more models we test, the more likely it is that one of them will be particularly effective in modeling the noise in the validation data. Applying the model to the holdout data, which it has not seen before, will provide an unbiased estimate of how well the model will perform with new data. Figure 2.4 shows the three data partitions and their use in the machine learning process applied to the West Roxbury dataset. When we are concerned mainly with finding the best model and less with exactly how well it will do, we might use only training and validation partitions. Table 2.9 shows R code to partition the West Roxbury data into two sets (training and validation) or into three sets (training, validation, and test). This is done by first drawing a random sample of records into the training set, then assigning the remaining records as validation. In the case of three partitions, the validation records are chosen randomly from the data after excluding the records already sampled into the training set. The function createDataPartition() from the caret package samples in such a way that the distribution of the outcome variable is reproduced in the partitions.

FIGURE 2.4 THREE DATA PARTITIONS AND THEIR ROLE IN THE MACHINE LEARNING PROCESS

TABLE 2.9 DATA PARTITIONING IN R

sets

code for partitioning the West Roxbury data into training, (validation), and holdout

serve

records

valid.rows))

appropriate rows

list=FALSE)

housing.df %

mutate(REMODEL=factor(REMODEL))

# use set.seed() to get the same partitions when re-running the R code.

set.seed(1)

## partitioning into training (60%) and holdout (40%)

# randomly sample 60% of the row IDs for training; the remaining 40% # as holdout

train.rows prediction(nn)

Data Error: 0;

$rep1

Salt Fat Like Dislike

1 0.1 0.1 0.0002415536 0.99965512

2 0.4 0.2 0.0344215787 0.96556788

3 0.5 0.2 0.1248666748 0.87816828

4 0.9 0.2 0.9349452648 0.07022732

5 0.8 0.3 0.9591361793 0.04505631

6 0.5 0.4 0.8841904620 0.12672438

FIGURE 11.4 NEURAL NETWORK FOR THE TINY EXAMPLE WITH FINAL WEIGHTS FROM R OUTPUT. VALUES INSIDE THE NODES DENOTE THE BIAS The first 3 3 table shows the weights that connect the input layer and the hidden layer. The Bias nodes (first row of the weights table) are the weights

, and . The weights in this table are used to compute the output of the hidden layer nodes. They were computed iteratively after choosing a random initial set of weights (like the set we chose in Figure 11.3 ). We use the weights in the way we described earlier to compute the hidden layer's output. For instance, for the first record, the output of our previous node 3 is:

Similarly, we can compute the output from the two other hidden nodes for the same record and get

and

. The second 4 2 table gives the weights connecting the hidden and output layer nodes. To compute the probability (= propensity) for the dislike output node for the first record, we use the outputs from the hidden layer that we computed above and get

Similarly, we can compute the probability for the like output node, obtaining the value

= 0.93. The probabilities for the other five records are computed in the same manner, replacing the input value in the computation of the hidden layer outputs and then plugging these outputs into the computation for the output layer. The confusion matrix based on these probabilities, using a threshold of 0.5, is given in Table 11.3 . We can see that the network correctly classifies all six records.

TABLE 11.3 CONFUSION MATRIX FOR THE TINY EXAMPLE

code for the confusion matrix

"dislike", "like")),

library(caret)

predict confusionMatrix(as.factor(ifelse(predicted.class=="1", "dislike", "like")),

+ df$Acceptance)

Confusion Matrix and Statistics

Reference

Prediction dislike like

dislike 3 0

like 0 3

Accuracy: 1

Example 2: Classifying Accident Severity Let's apply the network training process to some real data: US automobile accidents that have been classified by their level of severity as no injury , injury , or fatality . A firm might be interested in developing a system for quickly classifying the severity of an accident, based on initial reports and associated data in the system (some of which rely on GPS‐assisted reporting). Such a system could be used to assign emergency response team priorities. Table 11.4 shows a small extract (10 records, 4 predictor variables) from a US government database.

TABLE 11.4 SUBSET FROM THE ACCIDENTS DATA, FOR A HIGH‐ FATALITY REGION Obs. ALCHL_I PROFIL_I_R SUR_COND VEH_INVL MAX_SEV_IR 1 1 1 1 1 1 2 2 1 1 1 0 3 2 1 1 1 1 4 1 1 1 1 0 5 2 1 1 1 2 6 2 0 1 1 1 7 2 0 1 3 1 8 2 0 1 4 1 9 2 0 1 2 0 10 2 0 1 2 0 The explanation of the four predictor variables and outcome variable is given in Table 11.5 . For the analysis, we convert ALCHL_I to a 0/1 dummy variable (1 = presence of alcohol) and create four dummies for SUR_COND. This gives us a total of seven predictors. TABLE 11.5 DESCRIPTION OF VARIABLES FOR AUTOMOBILE ACCIDENT EXAMPLE ALCHL_I Presence (1) or absence (2) of alcohol PROFIL_I_R Profile of the roadway: level (1), other (0) SUR_COND Surface condition of the road: dry (1), wet (2), snow/slush (3), ice (4), unknown (9) VEH_INVL Number of vehicles involved MAX_SEV_IR Presence of injuries/fatalities: no injuries (0), injury (1), fatality (2) With the exception of alcohol involvement and a few other variables in the larger database, most of the variables are ones that we might reasonably expect to be available at the time of the initial accident report, before accident details and severity have been determined by first responders. A machine learning model that could predict accident severity on the basis of these initial reports would have value in allocating first responder resources.

To use a neural net architecture for this classification problem, we use seven nodes in the input layer, one for each of the seven predictors, and three neurons (one for each class) in the output layer. We use a single hidden layer and experiment with the number of nodes. If we increase the number of nodes from one to five and examine the resulting confusion matrices, we find that two nodes gives a good balance between improving the predictive performance on the training set without deteriorating the performance on the holdout set. (Networks with more than two nodes in the hidden layer performed as well as the two‐node network, but add undesirable complexity.) Table 11.6 shows the R code and output for the accidents data.

Our results can depend on how we set the different parameters, and there are a few pitfalls to avoid. We discuss these next.

Avoiding Overfitting A weakness of the neural network is that it can easily overfit the data, causing the error rate on holdout data (and most important, on new data) to be too large. It is therefore important to limit the number of training iterations and not to overtrain on the data (e.g., in R's neuralnet() function you can control the number of iterations using argument stepmax ). As in classification and regression trees, overfitting can be detected by examining the performance on the validation set, or better, on a cross‐ validation set, and seeing when it starts deteriorating, while the training set performance is still improving. This approach is used in some algorithms to limit the number of training iterations. The validation error decreases in the early iterations of the training, but after a while, it begins to increase. The point of minimum validation error is a good indicator of the best number of iterations for training, and the weights at that stage are likely to provide the best error rate in new data.

Using the Output for Prediction and Classification When the neural network is used for predicting a numerical outcome variable, the resulting output needs to be scaled back to the original units of that outcome variable. Recall that numerical variables (both predictor and outcome variables) are usually rescaled to a [0,1] interval before being used by the network. The output will therefore also be on a [0,1] scale. To transform the prediction back to the original units, which were in the range

, we multiply the network output by

and add . In the case of a binary outcome (

), we use two output nodes with a threshold value to map a predicted probability to one of the two classes. Although we typically use a threshold of 0.5 with other classifiers, in neural networks there is a tendency for values to cluster around 0.5 (from above and below). An alternative is to use the validation set to determine a threshold that produces reasonable predictive performance.

TABLE 11.6 A NEURAL NETWORK WITH TWO NODES IN THE HIDDEN LAYER (ACCIDENTS DATA)

code for running and evaluating a neural net on the accidents data

firest dummy)

all

%>%

p=0.6, list=FALSE)

MAX_SEV_IR_2 ~.,

using maximum score

factor(train.actual))

library(tidyverse)

library(fastDummies)

# convert SUR_COND and ALCHL_I to dummy variables (remove # convert outcome MAX_SEV_IR to dummy variables keeping accidents.df %

dummy_cols(select_columns=c("ALCHL_I", "SUR_COND"),

remove_selected_columns=TRUE, remove_first_dummy=TRUE) dummy_cols(select_columns=c("MAX_SEV_IR"),

remove_selected_columns=TRUE)

# partition the data

set.seed(1)

idx