Table of contents : Acknowledgments Chapter 1: Data Mining and Business Data Mining Algorithms and Activities Data is the New Oil Data-Driven Decision-Making Business Analytics and Business Intelligence Algorithmic Technologies Associated with Data Mining Data Mining and Data Warehousing Case Study 1.1: Business Applications of Data Mining Case A – Classification Case B – Regression Case C – Anomaly Detection Case D – Time Series Case E – Clustering Reference Chapter 2: The Data Mining Process Data Mining as a Process Exploration Analysis Interpretation Exploitation Selecting a Data Mining Process The CRISP-DM Process Model Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Selecting Data Analytics Languages The Choices for Languages References Chapter 3: Framing Analytical Questions How Does CRISP-DM Define the Business and Data Understanding Step? The World of the Business Data Analyst How Does Data Analysis Relate to Business Decision-Making? How Do We Frame Analytical Questions? What Are the Characteristics of Well-framed Analytical Questions? Exercise 3.1 – Framed Questions About the Titanic Disaster Case Study 3.1 – The San Francisco Airport Survey Case Study 3.2 – Small Business Administration Loans References Chapter 4: Data Preparation How Does CRISP-DM Define Data Preparation? Steps in Preparing the Data Set for Analysis Data Sources and Formats What is Data Shaping? The Flat-File Format Application of Tools for Data Acquisition and Preparation Exercise 4.1 – Shaping the Data File Exercise 4.2 – Cleaning the Data File Ensuring the Right Variables are Included Using SQL to Extract the Right Data Set from Data Warehouses Case Study 4.1: Cleaning and Shaping the SFO Survey Data Set Case Study 4.2: Shaping the SBA Loans Data Set Case Study 4.3: Additional SQL Queries Reference Chapter 5: Descriptive Analysis Getting a Sense of the Data Set Describe the Data Set Explore the Data Set Verify the Quality of the Data Set Analysis Techniques to Describe the Variables Exercise 5.1 – Descriptive Statistics Distributions of Numeric Variables Correlation Exercise 5.2 – Descriptive Analysis of the Titanic Disaster Data Case Study 5.1: Describing the SFO Survey Data Set Solution Using R Solution Using Python Case Study 5.2: Describing the SBA Loans Data Set Solution Using R Solution Using Python Reference Chapter 6: Modeling What is a Model? How Does CRISP-DM Define Modeling? Selecting the Modeling Technique Modeling Assumptions Generate Test Design Design of Model Testing Build the Model Parameter Setting Models Model Assessment Where Do Models Reside in a Computer? The Data Mining Engine The Model Data Sources and Outputs Traditional Data Sources Static Data Sources Real-Time Data Sources Analytic Outputs Model Building Step 1: Framing Questions Step 2: Selecting the Machine Step 3: Selecting Known Data Step 4: Training the Machine Step 5: Testing the Model Step 6: Deploying the Model Step 7: Collecting New Data Step 8: Updating the Model Step 9: Learning – Repeat Steps 7 and 8 Step 10: Recommending Answers to the User Reference Chapter 7: Predictive Analytics with Regression Models What is Supervised Learning? Regression to the Mean Linear Regression Simple Linear Regression The R-squared Coefficient The Use of the p-value of the Coefficients Strength of the Correlation Between Two Variables Exercise 7.1 – Using SLR Analysis to Understand Franchise Advertising Multivariate Linear Regression Preparing to Build the Multivariate Model Exercise 7.2 – Using Multivariate Linear Regression to Model Franchise Sales Logistic Regression What is Logistic Regression? Exercise 7.3 – PassClass Case Study Multivariate Logistic Regression Exercise 7.4 – MLR Used to Analyze the Results of a Database Marketing Initiative Where is Logistic Regression Used? Comparing Linear and Logistic Regressions for Binary Outcomes Case Study 7.1: Linear Regression Using the SFO Survey Data Set Solution in R Solution in Python Case Study 7.2: Linear Regression Using the SBA Loans Data Set Solution in R Solution in Python Case Study 7.3: Logistic Regression Using the SFO Survey Data Set Solution in R Solution in Python Case Study 7.4: Logistic Regression Using the SBA Loans Data Set Solution in R Solution in Python Chapter 8: Classification Classification with Decision Trees Building a Decision Tree Exercise 8.1 – The Iris Data Set The Problem with Decision Trees Classification with Random Forest Using a Random Forest Model Exercise 8.2 – The Iris Data Set Classification with Naïve Bayes Exercise 8.3 – The HIKING Data Set Computing the Conditional Probabilities Case Study 8.1: Classification with the SFO Survey Data Set Solution in R Solution in Python Case Study 8.2: Classification with the SBA Loans Data Set Solution in R Solution in Python Case Study 8.3: Classification with the Florence Nightingale Data Set Solution in Python Reference Chapter 9: Clustering What is Unsupervised Machine Learning? What is Clustering Analysis? Applying Clustering to Old Faithful Eruptions Examples of Applications of Clustering Analysis A Simple Clustering Example Using Regression Hierarchical Clustering Applying Hierarchical Clustering to Old Faithful Eruptions Exercise 9.1 – Hierarchical Clustering and the Iris Data Set K-Means Clustering How Does the K-Means Algorithm Compute Cluster Centroids? Applying K-Means Clustering to Old Faithful Eruptions Exercise 9.2 – K-Means Clustering and the Iris Data Set Hierarchical vs. K-Means Clustering Case Study 9.1: Clustering with the SFO Survey Data Set Solution in R Solution in Python Case Study 9.2: Clustering with the SBA Loans Data Set Solution in R Solution in Python Chapter 10: Time Series Forecasting What is a Time Series? Time Series Analysis Types of Time Series Analysis What is Forecasting? Exercise 10.1 – Analysis of the US and China GDP Data Set Case Studies Case Study 10.1: Time Series Analysis of the SFO Survey Data Set Solution in Excel Case Study 10.2: Time Series Analysis of the SBA Loans Data set Solution in R Solution in Python Case Study 10.3: Time Series Analysis of a Nest Data Set Solution in Python Reference Chapter 11: Feature Selection Using the Covariance Matrix Factor Analysis When to Use Factor Analysis First Step in FA – Correlation FA for Exploratory Analysis Selecting the Number of Factors – The Scree Plot Example 11.1: Restaurant Feedback Factor Interpretation Summary Activities to Perform a Factor Analysis Case Study 11.1: Variable Reduction with the SFO Survey Data Set Solution in R Solution in Python Case Study 11.2: Hunting Diamonds Solution in R Solution in Python Chapter 12: Anomaly Detection What is an Anomaly? What is an Outlier? The Case Studies for the Exercises in Anomaly Detection Anomaly Detection by Standardization – A Single Numerical Variable Exercise 12.1 – Outliers in the Airline Delays Data Set – Z-Score Anomaly Detection by Quartiles – Tukey Fences – With a Single Variable Comparing Z-scores and Tukey Fences Exercise 12.2 – Outliers in the Airline Delays Data Set – Tukey Fences Anomaly Detection by Category – A Single Variable Exercise 12.3 – Outliers in the Airline Delays Data Set – Categorical Anomaly Detection by Clustering – Multiple Variables Exercise 12.4 – Outliers in the Airline Delays Data Set – Clustering Anomaly Detection Using Linear Regression by Residuals – Multiple Variables Exercise 12.5 – Outliers in the Airline Delays Data Set – Residuals Case Study 12.1: Outliers in the SFO Survey Data Set Solution in R Solution in Python Case Study 12.2: Outliers in the SBA Loans Data Set Solution in R Solution in Python References Chapter 13: Text Data Mining What is Text Data Mining? What are Some Examples of Text-Based Analytical Questions? Tools for Text Data Mining Sources and Formats of Text Data Term Frequency Analysis How Does It Apply to Text Business Data Analysis? Exercise 13.1 – Case Study Using a Training Survey Data Set Word Frequency Analysis Using R Keyword Analysis Exercise 13.2 – Case Study Using Data Set D: Résumé and Job Description Keyword Word Analysis in Voyant Term Frequency Analysis in R Visualizing Text Data Exercise 13.3 – Case Study Using the Training Survey Data Set Visualizing the Text Using Excel Visualizing the Text Using Voyant Visualizing the Text Using R Text Similarity Scoring What is Text Similarity Scoring? Exercise 13.4 – Case Study Using the Occupation Description Data Set Analysis Using an Online Text Similarity Scoring Tool Similarity Scoring Analysis Using R Exercise 13.5 – Résumé and Job Descriptions Similarly Scoring Using R Case Study 13.1 – Term Frequency Analysis of Product Reviews Term Frequency Analysis Using Voyant Term Frequency Analysis Using R References Chapter 14: Working with Large Data Sets Using Sampling to Work with Large Data Files Exercise 14.1 – Big Data Analysis Case Study 14.1 Using the BankComplaints Big Data File Exercise 12.3 – Outliers in the Airline Delays Data Set – Categorical Anomaly Detection by Clustering – Multiple Variables Exercise 12.4 – Outliers in the Airline Delays Data Set – Clustering Anomaly Detection Using Linear Regression by Residuals – Multiple Variables Exercise 12.5 – Outliers in the Airline Delays Data Set – Residuals Case Study 12.1: Outliers in the SFO Survey Data Set Solution in R Solution in Python Case Study 12.2: Outliers in the SBA Loans Data Set Solution in R Solution in Python References Chapter 13: Text Data Mining What is Text Data Mining? What are Some Examples of Text-Based Analytical Questions? Tools for Text Data Mining Sources and Formats of Text Data Term Frequency Analysis How Does It Apply to Text Business Data Analysis? Exercise 13.1 – Case Study Using a Training Survey Data Set Word Frequency Analysis Using R Keyword Analysis Exercise 13.2 – Case Study Using Data Set D: Résumé and Job Description Keyword Word Analysis in Voyant Term Frequency Analysis in R Visualizing Text Data Exercise 13.3 – Case Study Using the Training Survey Data Set Visualizing the Text Using Excel Visualizing the Text Using Voyant Visualizing the Text Using R Text Similarity Scoring What is Text Similarity Scoring? Exercise 13.4 – Case Study Using the Occupation Description Data Set Analysis Using an Online Text Similarity Scoring Tool Similarity Scoring Analysis Using R Exercise 13.5 – Résumé and Job Descriptions Similarly Scoring Using R Case Study 13.1 – Term Frequency Analysis of Product Reviews Term Frequency Analysis Using Voyant Term Frequency Analysis Using R References Chapter 14: Working with Large Data Sets Using Sampling to Work with Large Data Files Exercise 14.1 – Big Data Analysis Case Study 14.1 Using the BankComplaints Big Data File