136 87 14MB
English Pages 395 [381] Year 2023
Walter R. Paczkowski
Predictive and Simulation Analytics Deeper Insights for Better Business Decisions
Predictive and Simulation Analytics
Walter R. Paczkowski
Predictive and Simulation Analytics Deeper Insights for Better Business Decisions
Walter R. Paczkowski Data Analytics Corp. Plainsboro, NJ, USA
ISBN 978-3-031-31886-3 ISBN 978-3-031-31887-0 https://doi.org/10.1007/978-3-031-31887-0
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
My writing and research agenda focuses on data science methods for providing business decision makers with Rich Information for their data-driven decisions. This book continues and expands this agenda. Its topic is the melding of two seemingly disparate methods –Predictive Analytics and Simulation Analytics– to capture a synergistic effect (i.e., increasing returns) providing even richer Rich Information. My previous book, Business Analytics (Paczkowski, 2022b), provides the foundations for “Deep Data Analytics” to extract Rich Information from data. A theme in that book is that information is needed for business decisions, but it is buried inside data: data are not information, but contain information. A second theme of that book is that statistics, econometrics, and machine learning methods are the tools for extracting information from data. Predictive Analytics is a sub-theme: these methods enable predictions of likely outcomes from business decisions. The predictions are part of the Rich Information. This book goes the next step and further penetrates into “Deep Data Analytics”. In my previous writings, I never discussed how data are generated, only mentioning that they are found in data warehouses, data stores, data lakes, or data marts. These are the data location, not the data generation process (DGP). A first theme for this book is that the DGP is a system, either simple or complex. Simply stated, a system is a series of interconnected and interrelated parts that function as a synergistic whole. They produce more when the parts are put together than when they operate separately. Something new emerges from the system. This is important because if one of the parts falters, then the system slows or fails with wide-ranging impacts. An action on one part of the system has implications and ramifications beyond that part. And this will be reflected in the data generated by that system. As an example, the manufacturing of a product, the classic widget for instance, involves many operations: customer demand generation and recording to specify how much to produce; raw material purchasing to enable the production; operational steps to transform the raw material to the final product; head-count sizing and scheduling to meet production; shipment of the final product; and billing and accounting operations to collect and record the revenue from the sale to mention a few. Each of these, except the customer demand per se, has a cost effect. Data are v
vi
Preface
generated by each stage, by the system as a whole. The optimal business decision must incorporate all these data. The interconnections and interrelations is the second theme of this book. Regardless of a system’s complexity, a managerial decision based on a prediction about one aspect of a system has implications and ramifications for its other parts, some of which may be undesirable. The only way to grasp the full impact and scope of a decision based on predictive analytics of one component of the system is by using a simulation of the entire system. This is the third theme for this book. Consider the manufacturing process again. Customer demand is one component of this system. A managerial decision regarding the price of the product will impact this demand. A 1% price cut, for example, may be predicted to increase sales 2%. This is the Predictive Analytics implication for sales and revenue. In many instances, the analytics stops here. The ramification is the effect on the supply chain for raw materials to produce the product, the manufacturing throughput process, the personnel requirements, and the delivery queues. Each of these has cost implications. Simply predicting a 2% increase in sales ignores the full impact on the entire business system. Only Simulation Analytics coupled with Predictive Analytics about customer demand, which drove the pricing decision, will allow the business decision makers to see the entire “picture.” The whole picture is the Rich Information. The Rich Information based on predictions about sales and revenue is only partial; business decision makers need the entire scope of Rich Information. Simulation Analytics have been treated and viewed as separate and distinct from Predictive Analytics; that is, they are considered to be substitutes for each other. They are, instead, complementary that, when used together, provide the entire Rich Information business decision makers need. The Deep Data Analytics I discussed in my Business Analytics book thus consists of two tightly interwoven parts: Predictive Analytics and Simulation Analytics. The focus of this book, as a continuation of my previous one, is the melding of the two complementary Deep Data Analytics methodologies for providing the entire scope of Rich Information to business decision makers. This is the topic of this book.
The Target Audience This book targets you if you are a business data analyst, data scientist, market research professional, or aspiring to be any of these, in the private sector. You would be involved in or be responsible for a myriad of quantitative analyses for business problems such as, but not limited to: • • • • • •
Demand measurement and forecasting Predictive modeling Pricing analytics including elasticity estimation Customer satisfaction assessment Market and advertisement research New product development and research
Preface
vii
You also are someone with a need to know basic data analytical methods and some advanced methods, including data handling and management. This book will provide you with this needed background by: • Explaining the intuition underlying analytic concepts • Developing the mathematical and statistical analytic concepts • Demonstrating analytical concepts using a programming language in Jupyter notebooks • Illustrating analytical concepts with case studies This book is also suitable for colleges and universities offering courses and certifications in business data analytics, data sciences, and market research. It could be used as a major or supplemental textbook. Although business applications are emphasized, public policy audiences are not overlooked. Decision makers in the public domain must also use Predictive Analytics as the basis for their decisions. And they certainly are responsible for large, complex systems such as the entire economy. So, the concepts I will discuss are also applicable to them. Since the target audience consists of either current or aspiring business data analysts and data scientists, I assume you have, or are developing, a basic understanding of fundamental statistics at the “Stat 101” level: descriptive statistics, hypothesis testing, and regression analysis. While not required, knowledge of econometric and market research principles would be beneficial. In addition, a level of comfort with calculus is recommended, but not required. Appendices will provide background as needed. My previous book (Paczkowski, 2022b) also provides the needed background.
The Book’s Competitive Comparison There are many books on the market that discuss the two sub-themes of this book: Predictive Analytics and Simulation Analytics. But they do them separately as opposed to as a synergistic, analytic whole. In this book, I present the three themes so you can more easily master what is needed for your work.
The Book’s Structure I have divided this book into four parts: Part I: The Analytics Quest: The Drive for Rich Information Part II: Predictive Analytics: Background Part III Simulation Analytics: Background Part IV Melding the Two Analytics
viii
Preface
A synopsis of each part follows. Part I: The Analytics Quest: The Drive for Rich Information This first part reviews and summarizes the distinction between Poor and Rich Information and the need for the latter for business decisions. It introduces Predictive Analytics and Simulation Analytics. The latter is usually simply referred to as “simulations,” a term that does not fully capture the complexity and depth of what is involved in simulating something. The “something” is a system of interconnected parts, which can be simply interwoven or combined in a very complex manner. More is involved in grasping a system, hence my use of “Simulation Analytics.” This first part of the book introduces the two analytical paradigms and the notion of systems and how predictions and simulations work together in a synergistic fashion to extract the richest Rich Information about the system. A major concept that emerged for me when I was writing this part is the scale-view of the decision maker. I did not originally plan on this; it just appeared. The scale-view refers to the scope of the environment the decision maker sees, functions in, and makes decision about. Prediction methods as typically presented have no regard for scale-view. Methods are the focus. But the real application is not. The use is immaterial; only the presentation of a method is important. The approach is “one size fits all.” This needs to be corrected. The application depends on this scale-view. I carry this scale-view concept throughout the book. Part II: Predictive Analytics: Background This second part of the book discusses the development of Predictive Analytics methodologies. After reading this part of the book, you should be able to conduct most prediction tasks typical in a business context. Those who have my previous book, Paczkowski (2022b), or who have econometrics training will be familiar with this material. However, I focus on methods divided into three chapters: basic time series predictions; advanced time series predictions; and non-time series predictions. Each chapter is only meant to introduce concepts. Each chapter should be its own independent book. Part III: Simulation Analytics: Background The third part of this book sets the stage for Simulation Analytics. The background needed for simulations is most likely new to most readers of this book. Consequently, this part is devoted to developing background material. Chapters cover the design and analysis of simulations, random number generation—the backbone of simulations, and examples with Monte Carlo simulations. At the end of this part, you will have the essential knowledge to understand and begin using simulations in your work.
Preface
ix
Part IV: Melding the Two Analytics This last part is the heart of the book. The two approaches should be brought together and provide examples. These examples are by the scale-view of the decision maker. Plainsboro, NJ, USA
Walter R. Paczkowski
Acknowledgments
In with all my previous books, I’m grateful for the support and encouragement I received from my wonderful wife, Gail, and my two daughters, Kristin and Melissa. As always, Gail encouraged me to sit down and just write, especially when I did not want to. She was always extremely patient with me, especially in the last few days before I sent the final manuscript to my editor. And my daughters provided the extra set of eyes I needed to make this book perfect. They provided the same support and encouragement for this book so I owe them a lot, both then and now. I would also like to say something about my two grandsons who, now at 7 and 12, obviously did not contribute to this book but who, I hope, will look at this one (and the others I’ve written and will write) in their adult years and say “Yup, grandpa wrote this one, too.”
xi
Contents
Part I The Analytics Quest: The Drive for Rich Information 1
Decisions, Information, and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Decisions and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 What Is Uncertainty?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 The Cost of Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 The Scale-View of Decision Makers . . . . . . . . . . . . . . . . . . . . . 1.1.5 Rich Information Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 A Data and Information Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Rich Information Predictive Extraction Methods . . . . . . . . . . . . . . . . . . . 1.3.1 Informal Analytical Components . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Formal Analytical Components . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 A Systems Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 This Book’s Focus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 4 5 10 12 13 14 15 19 20 21 24 25
2
A Systems Perspective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction to Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Types of Systems: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Economic Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Business Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Other Types of Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Predictions, Forecasts, and Business Complex Systems . . . . . . . . . . . 2.4 System Complexity and Scale-View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Simulations and Scale-View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 29 34 35 35 37 37 41 44
Part II Predictive Analytics: Background 3
Information Extraction: Basic Time Series Methods . . . . . . . . . . . . . . . . . . . 3.1 Overview of Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Predictions as Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Time Series and Forecasting Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49 50 51 51 xiii
xiv
Contents
3.4 3.5 3.6
The Backshift Operator: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naive Forecasting Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constant Mean Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Properties of a Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 h-Step Ahead Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Walk Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Basic Random Walk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Random Walk with Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Moving Averages Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Weighted Moving Average Model. . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Exponential Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Trend Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Linear Trend Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 Linear Trend Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3 Linear Trend Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Reproductive Property of Normals . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 Proof of MSE = V (θˆ ) + Bias 2 . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.3 Backshift Operator Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.4 Variance of h-Step Ahead Random Walk Forecast . . . . . . 3.10.5 Exponential Moving Average Weights . . . . . . . . . . . . . . . . . . . 3.10.6 Flat Exponential Averaging Forecast . . . . . . . . . . . . . . . . . . . . . 3.10.7 Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.8 Background on the Exponential Growth Model . . . . . . . . .
52 53 54 57 57 60 60 63 65 68 69 75 79 82 84 88 88 90 91 91 92 93 94 96
Information Extraction: Advanced Time Series Methods . . . . . . . . . . . . . 4.1 The Breadth of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Introduction to Linear Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Feature Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Model Fit vs. Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Case Study: Predicting Total Vehicle Sales . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Modeling Data: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Modeling Data: Some Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Linear Model for New Car Sales . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Stochastic (Box-Jenkins) Time Series Models . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Model Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Brief Introduction to Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Correcting for Non-stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Predicting with the AR(1) Model. . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Advanced Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Autoregressive Distributed Lag Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Short-Run and Long-Run Effects . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Chow Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99 99 100 106 106 107 109 109 109 117 120 121 123 126 126 128 129 133 134 134
3.7
3.8
3.9
3.10
4
Contents
xv
5
Information Extraction: Non-Time Series Methods . . . . . . . . . . . . . . . . . . . 5.1 Types of Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Discrete Choice Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Discrete Choice Model Extensions . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Types of Discrete Choice Studies . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Discrete Choice Experimental Designs . . . . . . . . . . . . . . . . . . 5.2.4 Discrete Choice Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Discrete Choice Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Purchase Intent Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Purchase Intent Survey Question . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 The Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Purchase Intent Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Purchase Intent Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Purchase Intent Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Choice Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Other Non-Time Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Artificial Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Sum of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137 139 141 145 146 147 148 148 155 156 157 158 158 159 160 165 165 167 168 168
6
Useful Life of a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Predictive Modeling Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Examples of Useful Life of a Predictive Model . . . . . . . . . . . . . . . . . . . . 6.2.1 Five-Year Business Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 New Product Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Daily Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Real-Time Prediction: Modifying the Infrastructure . . . . . . . . . . . . . . . 6.3.1 A Real-Time Predictive Structure . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Example: Real-Time Predictive Analytics . . . . . . . . . . . . . . . 6.4 Summary of Timing of Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . .
169 170 172 172 173 175 176 176 180 181
Part III Simulation Analytics: Background 7
Introduction to Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Overview of Science-Technology Revolutions . . . . . . . . . . . . . . . . . . . . . 7.2 Simulation Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 What Is a Simulation?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Simulations and Decision-Making . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Simulations and Virtual Realities. . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Simulations and Uncertainty Reduction. . . . . . . . . . . . . . . . . . 7.2.5 Simulation Applications: General Focus . . . . . . . . . . . . . . . . . 7.2.6 Simulation Applications: Business Focus . . . . . . . . . . . . . . . . 7.3 Types of Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 A Family of Stochastic Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 System Dynamics Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185 185 187 187 188 188 190 192 193 193 196 197
xvi
Contents
7.4.2 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Discrete-Event Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Agent-Based Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.5 Continuous Simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.6 Hybrid Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulations vs. Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
198 199 199 199 200 200 203
8
Designing and Analyzing a Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Simulator Design Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Level of Simulator Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Validity and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Simulator Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Designing a Simulator: More Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Process Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Scenario Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Simulation Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Steady-State Attainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.6 Number of Replications of Each Scenario . . . . . . . . . . . . . . . 8.3 Simulator Results Analysis for Rich Information . . . . . . . . . . . . . . . . . . 8.3.1 Simulation Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
205 205 205 207 209 210 210 210 214 216 217 219 220 220
9
Random Numbers: The Backbone of Stochastic Simulations . . . . . . . . 9.1 What Is a Random Number? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Digression: Desirable Generator Properties. . . . . . . . . . . . . . 9.2.2 Natural Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Digital Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Algorithmic Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Generating Random Variates for a Distribution . . . . . . . . . . . . . . . . . . . . 9.3.1 The Uniformly Distributed Random Numbers. . . . . . . . . . . 9.3.2 Generating Distributions from the Uniform . . . . . . . . . . . . . 9.3.3 Random Variates for the Normal Distribution . . . . . . . . . . . 9.4 Using the random Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Random Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Random Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Random Choice from a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Random Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.5 Gaussian (Normal) Random Variate. . . . . . . . . . . . . . . . . . . . . . 9.5 Using the Numpy and SciPy Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Random Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Random Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Random Choice from a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Gaussian (Normal) Random Variate. . . . . . . . . . . . . . . . . . . . . . 9.5.5 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
225 225 226 226 228 229 230 240 241 244 245 247 248 249 251 251 252 253 253 253 254 254 254
7.5 7.6
Contents
xvii
9.5.6
Comparing Numpy and Random for the Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.7 Features of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.8 Summary of Seed Specification . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Review of Modulo Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.2 Mersenne Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
255 256 257 257 259 259 260
Examples of Stochastic Simulations: Monte Carlo Simulations. . . . . . . 10.1 A Brief History of Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . 10.2 Structure of a Monte Carlo Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Designing a Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . 10.3 Monte Carlo Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Use-Case 1: Coin Toss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Use-Case 2: Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Use-Case 3: Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.4 Use-Case 4: Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . 10.3.5 Use-Case 5: Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Using Symbolic Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
263 263 264 265 268 269 274 278 284 285 291 291
9.6 9.7
10
Part IV Melding the Two Analytics 11
Melding Predictive and Simulation Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 A Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Melding Process Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Three Scale Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Operational Scale View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Tactical Scale View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Strategic Scale View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
295 295 298 302 302 307 307
12
Applications: Operational Scale View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Application I: A Queueing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Anatomy of a Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 A Note on the Exponential Distribution in Python . . . . . . 12.1.3 Queue Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.4 Restriction on Arrival and Service Rates. . . . . . . . . . . . . . . . . 12.1.5 Queueing Theory Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.6 Illustrative Queueing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.7 Determining the Interarrival and Interservice Rates . . . . . 12.1.8 Queueing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.9 A Critical Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.10 Queueing Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Application II: Linear Programming Problem . . . . . . . . . . . . . . . . . . . . . . 12.3 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
311 311 312 317 317 318 318 319 320 321 324 324 326 333
xviii
Contents
12.3.1 12.3.2 12.3.3 12.3.4 13
Poisson and Exponential Distribution Relationship . . . . . Maximum Likelihood Estimator of λ . . . . . . . . . . . . . . . . . . . . Mean and Variance of the Exponential Distribution . . . . . Using Simpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
333 336 337 338
Applications: Tactical and Strategic Scale Views . . . . . . . . . . . . . . . . . . . . . . . 13.1 Tactical Scale View Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.1 Tactical Application I: Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.2 Tactical Application II: Reducing Churn. . . . . . . . . . . . . . . . . 13.2 Strategic Scale View Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
339 339 339 346 351
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
List of Figures
Fig. 1.1 Fig. 1.2 Fig. 1.3 Fig. 1.4 Fig. 1.5 Fig. 1.6 Fig. 1.7 Fig. 1.8 Fig. 1.9 Fig. 1.10 Fig. 2.1 Fig. 2.2 Fig. 2.3 Fig. 2.4 Fig. 2.5 Fig. 2.6 Fig. 2.7 Fig. 2.8 Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
2.9 2.10 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10
Illustration of Kant’s Knowledge Foundation . . . . . . . . . . . . . . . . . . . . . Illustration of the Data—Uncertainty Connection . . . . . . . . . . . . . . . . Information/Uncertainty Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . Components of Rich Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Basic Components of a System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normal Distribution Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Data Paradigm Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Extraction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Extraction Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feedback Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Chart of Two Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Circular Flow in Economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Feedback Loop System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Feedback Loop System with Prediction Input . . . . . . . . . . . . Time Adjustment Patterns for System State Variables . . . . . . . . . . . . Simple Feedback Loop System with Prediction and Forecast Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Inverse Complexity-Scale Relationship. . . . . . . . . . . . . . . . . . . . . . . Simple Feedback Loop System with Simulations . . . . . . . . . . . . . . . . . h-Steps Ahead Naive Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Constant Mean Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moving Average Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponential Averaging Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance of One-Step Ahead Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponential Smoothing Fit Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponential Smoothing Forecast Example . . . . . . . . . . . . . . . . . . . . . . . . Example of Prediction Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Non-Linear Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Linear Trend Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 6 7 8 11 16 17 18 19 20 28 31 34 36 38 39 40 41 42 46 55 56 67 71 73 74 75 76 78 80 xix
xx
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
3.11 3.12 3.13 3.14 3.15 3.16 3.17 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 5.1
Fig. 5.2 Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12
Example Estimation of Linear Trend Data . . . . . . . . . . . . . . . . . . . . . . . . Example ANOVA Table Linear Trend Model . . . . . . . . . . . . . . . . . . . . . Example Prediction of Linear Trend Data . . . . . . . . . . . . . . . . . . . . . . . . . Prediction Intervals Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prediction Interval Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constant Mean Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constant Mean Model Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residual Autocorrelation Plots: Time Plots . . . . . . . . . . . . . . . . . . . . . . . Residual Autocorrelation Plots: Lagged Plot . . . . . . . . . . . . . . . . . . . . . . Real Interest Rate on New Car Loans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smoothed Real Interest Rate on New Car Loans . . . . . . . . . . . . . . . . . . Real Price of New Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real Disposable Personal Income. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Car Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real Price of New and Used Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Car Sales Structural Break. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Structural Break in a Time Series . . . . . . . . . . . . . . . . . . . . Chow Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code for Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression Results for Domestic New Car Sales . . . . . . . . . . . . . . . . . AutoRegression Correction for Domestic New Car Sales . . . . . . . . Regression Results for Domestic New Car Sales: Lagged Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Durbin’s h-Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example ACF and PACF Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Time Series Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First Differencing of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . First Differenced Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ARDL Setup in Statsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ARDL Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chow Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustrative Market Research Interconnections: Narrow Scale-View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustrative Market Research Interconnections: Wide Scale-View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested Choice Problem: Case I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested Choice Problem: Case II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R Script for Discrete Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Data Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Python Script to Manage Discrete Choice Data . . . . . . . . . . . . . . . . . . . Discrete Choice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated Discrete Choice Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Purchase Intent Data Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Purchase Intent Logit Estimation Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . Purchase Intent Logit Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . .
81 82 85 86 87 89 90 103 104 112 112 113 114 115 115 116 117 118 118 119 119 120 121 123 124 127 129 131 132 135 140 140 145 146 151 151 153 154 155 159 160 161
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
5.13 6.1 6.2 6.3 6.4 6.5 6.6 6.7 7.1 7.2
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
7.3 7.4 7.5 7.6 8.1 8.2 8.3 8.4 9.1 9.2 9.3 9.4
Fig. Fig. Fig. Fig. Fig. Fig. Fig.
9.5 9.6 9.7 9.8 9.9 9.10 9.11
Fig. Fig. Fig. Fig. Fig. Fig.
9.12 9.13 9.14 9.15 9.16 9.17
Fig. Fig. Fig. Fig. Fig. Fig.
9.18 9.19 9.20 9.21 9.22 9.23
xxi
Example Decision Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive Modeling Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Real-time Prediction Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Business Planning Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Real-time Prediction Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic of a WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stylized Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Predictive Model Updates. . . . . . . . . . . . . . . . . . . . . . . . . . . . Science-Technological Revolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uncertainty Quantification and Simulator Complexity Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive Accuracy and Simulator Complexity . . . . . . . . . . . . . . . . . . Classifications of Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulator Family Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feedback Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time Adjustment Patterns to a Steady-State . . . . . . . . . . . . . . . . . . . . . . Example Simulation Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Replication Data Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Categorization of Random Number Generators . . . . . . . . . . . . . . . . . . . Different Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . von Neumann Squaring Method Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of von Neumann Squaring Method Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fibonacci Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factors of the Fibonacci Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fibonacci Random Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Linear Congruential Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Seconds One Day from the Epoch . . . . . . . . . . . . . . . . . . . . Number of Seconds from the Epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simplified Linear Congruential Generator with Small Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flow Chart of Random Variate Generation . . . . . . . . . . . . . . . . . . . . . . . . The Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cumulative Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardized Normal Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cumulative Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Finding a Quantile Using the Inverse of the Cumulative Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardized Normal Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating Random Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating Random Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generating Gaussian Random Variate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing Exponential Distribution Functions . . . . . . . . . . . . . . . . . . .
166 170 171 172 177 179 180 182 186 191 191 194 197 203 212 218 221 223 226 229 232 232 234 235 236 237 238 238 239 241 242 243 246 247 248 249 250 251 252 253 256
xxii
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
9.24 9.25 9.26 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10 10.11 10.12 10.13 10.14 10.15 10.16 10.17 10.18
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
10.19 10.20 10.21 10.22 10.23 10.24 11.1 11.2 11.3 11.4 11.5 11.6 11.7 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 12.10 12.11
Examples of SciPy Probability-based Functions . . . . . . . . . . . . . . . . . . Standard Normal Distribution with Quantiles . . . . . . . . . . . . . . . . . . . . . Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Design Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Experiment Coding Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coin Toss Monte Carlo Simulation: Three Tosses . . . . . . . . . . . . . . . . Coin Toss Monte Carlo Simulation: 50 Tosses . . . . . . . . . . . . . . . . . . . . Coin Toss Monte Carlo Simulation: Means for 50 Tosses . . . . . . . . Coin Toss Monte Carlo Simulation: Trace Graph . . . . . . . . . . . . . . . . . Random Walk Variance at Each Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Walk Simulation Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Walk Without Drift Simulation Results. . . . . . . . . . . . . . . . . . Random Walk Without Drift Regression Results . . . . . . . . . . . . . . . . . Random Walk Without Drift Regression Plot . . . . . . . . . . . . . . . . . . . . . Random Walk With Drift Simulation Results . . . . . . . . . . . . . . . . . . . . . Monte Carlo Simulation: Expected Value of Sample Mean . . . . . . Jarque-Bera Test for Normality of Monte Carlo Simulation. . . . . . Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Simulation of the Central Limit Theorem . . . . . . . . . . Jarque-Bera Test for Normality of Monte Carlo CLT Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph of Example Integration Function . . . . . . . . . . . . . . . . . . . . . . . . . . . Approximating Area Under a Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Integration Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Differentiation Using sympy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indefinite Integration Using sympy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definite Integration Using sympy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Melding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Scale-View Focuses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Operational Scale-View Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Expanded Operational Scale-View Problem . . . . . . . . . . . . . . . . . . A Tactical Scale-View Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A More Complex Tactical Scale-View Problem. . . . . . . . . . . . . . . . . . . A Strategic Scale-View Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Queueing Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Poisson and Exponential Distributions: Code . . . . . . . . . . . . . . . . The Poisson and Exponential Distributions: Graphs . . . . . . . . . . . . . . An M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Artificial Arrival Generating Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Queueing Example Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Queueing Example Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Script for Queueing Example . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Output for Queueing Example . . . . . . . . . . . . . . . . . . . . . . . . Simulation Queueing Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of Constrained LP Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
257 258 261 266 266 270 271 272 272 273 275 276 277 278 279 280 282 283 285 286 287 288 289 290 291 292 292 296 303 304 305 308 309 310 314 315 316 319 321 322 323 325 326 327 329
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
12.12 12.13 12.14 12.15 12.16 12.17 13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9
xxiii
Linear Programming Example: Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Programming Example: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . The Log-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Programming Simulation Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Programming Simulation Output . . . . . . . . . . . . . . . . . . . . . . . . . . Poisson-Exponential Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shifts in the Poisson and Exponential Distributions . . . . . . . . . . . . . . A Tactical Queueing Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Script to Import Churn Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Script to Randomly Sample the Churn Data. . . . . . . . . . . . . . . . . . . . . . . Script to Split the Churn Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Churn Logit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Churn Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Churn Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Churn Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
331 331 334 335 336 337 344 345 348 349 349 350 351 352 353
List of Tables
Table 1.1 Table 2.1 Table 4.1 Table 4.2 Table 4.3 Table 5.1 Table 7.1 Table 8.1 Table 9.1 Table 9.2 Table 9.3 Table 12.1 Table 12.2 Table 13.1
Business Interlocking Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example List of Business System Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . Durbin-Watson Test Statistic Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Car Sales Data Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time Series AR(p) Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loan Amortization Schedule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Data Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Distributions in Numpy and SciPy . . . . . . . . . . . . . . . . . . . . . . . . . Probability-based Functions in SciPy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Seed Specification by Python Package . . . . . . . . . . . . . . . . . . Illustrative Event Times for Queueing Example . . . . . . . . . . . . . . . . . . Pricing Parameters for LP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Churn Data Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24 33 105 110 123 163 195 221 255 256 259 320 328 347
xxv
Part I
The Analytics Quest: The Drive for Rich Information
This first part reviews and summarizes the distinction between Poor and Rich Information and the need for the latter for business decisions. It introduces Predictive Analytics and Simulation Analytics. The latter is usually simply referred to as “simulations,” a term that does not fully capture the complexity and depth of what is involved in simulating something. The “something” is a system of interconnected parts, which can be simply interwoven or combined in a very complex manner. More is involved in grasping a system, hence my use of “Simulation Analytics.” This first part of the book not only introduces the two analytical paradigms but also introduces the notion of systems and how predictions and simulations work together in a synergistic fashion to extract the richest Rich Information about the system.
Chapter 1
Decisions, Information, and Data
Know your audience is a well-known advice often quoted in public speaking, effective presentation, or creative writing courses. You are then taught to target their interests and concerns which are sometimes called pain points. Your employer or client, perhaps a CEO, is your audience. What are her pain points? What keeps her up at night? How does she address and manage them? Basically, how does she manage her business? Her pain points encompass the massive decisions she must make, from the mundane to the strategically imperative, that could make or break the business. Compounding the gravity of these decisions is the welfare of the employees and their families. Unfortunately, she will not know the outcome of her decisions until a later date. She must agonize over them until the results are known. This agony is the pain, the pain of not knowing if a decision is the right one. She has to make that painful decision under uncertainty. But it is not only the broad, enterprise-wide decisions that are made by her that matter. Lower-level managers similarly make daily decisions that profoundly affect the business. They too will not know the effects of their decisions until a later date. These decisions, made at the daily operational longer-term tactical level of the business, are pain points equally as onerous as those made by the CEO. These decisions are also made under uncertainty. The CEO and all her subordinate managers do not make these decisions by themselves for all aspects of the business. The CEO, for example, has a team with different backgrounds and skill sets to help her. This team provides her with information, part of which is the implications and ramifications of her decisions. Some of her team includes those who can work with data—the data scientists— because data are the foundation for that information. Do you know why she hired you (or will hire you) as a data scientist team member? Is it simply to run regressions, create pie charts, or calculate means? Highly unlikely. A deeper, more profound reason is to reduce the uncertainty associated with her decisions. You provide her the information she needs; your job is information provisioning. But not © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_1
3
4
1 Decisions, Information, and Data
any information. She needs information on the entire system that is the business. You, as the data scientist member of the team, also provide all the subordinate managers with information for their decisions. The fundamental problem is the same regardless of the management level: needing information to make decisions that have wide-ranging effects which will not be known until a later date.
1.1 Decisions and Uncertainty The CEO’s main, almost sole, task every day is to make enterprise-wide decisions. The same holds for any member of her C-Level Team. This includes, but is not limited to, the Chief Operating Officer (COO), Chief Financial Officer (CFO), Chief Technology Officer (CTO), Chief Legal Officer (CLO), Chief Marketing Officer (CMO), and so on. Even lower-level plant and line managers are involved and are part of her team. The team’s decisions are about the business, all aspects of it from production scheduling, pricing, approving new products, expanding or shrinking head count, advertising, and so on. A main characteristic of them is that they are all made today, but their impact will not be known until tomorrow, however tomorrow is defined. Tomorrow could be tomorrow on the calendar, or next week, next month, next quarter, or next year. Regardless of how you define it, its main characteristic is that you will never know what will happen until tomorrow occurs. All the decisions made today are under a Cloud of Uncertainty. Uncertainty could be high enough to prevent the CEO from making any decision, stymieing the entire organization. It could lead to confusion and send mixed signals throughout the organization, or lead to wrong decisions such as not developing a new product the market overwhelmingly wants, raising a price point when competition is fierce and extensive, or ending a popular promotional offer. Inaction due to uncertainty could lower morale and decrease confidence in what the organization can do to succeed in the market. In short, uncertainty, as a root cause of inaction, is costly. The management team needs to reduce that uncertainty, not eliminate it, but just reduce it. It cannot be eliminated simply because tomorrow is never known. See Paczkowski (2022b, Chapter 1) for a detailed discussion of decisions and the costs of those decisions. You might view this as obvious, if not trite and simplistic. It does, however, beg three questions: 1. “What is uncertainty?” 2. “What is the cost associated with uncertainty?” 3. “How can uncertainty be reduced?”
1.1 Decisions and Uncertainty
5
1.1.1 What Is Uncertainty? Simply stated, uncertainty is the opposite of certainty. This does not help because it only begs the obvious question: “What is certainty?” This seemingly simple question is difficult to answer because there are three types of definitions of certainty: 1. colloquial; 2. philosophical; and 3. probabilistic. A colloquial definition is what is commonly understood. Such a definition is ultimately unclear because everyone has their definition of certainty rendering a colloquial one useless. You could, however, safely state (with certainty) that certainty is knowing something perfectly, without reservation, whatever the “something” is, such as an event, an idea, an amount, a location, or a concept. People will assert that they have complete certainty (a redundant statement) that the sun will rise tomorrow. They do not doubt it. Their knowledge level, whatever its source, is sufficient to allow them to make this assertion. That knowledge could be based on experience, education, or what they were told or read outside of formal education. Their level of information allows them to make this assertion. The second definition, the philosophical one, deals with our information, or what is commonly called knowledge. This is within the realm of epistemology, the philosophical analysis of knowledge focused on “the study of knowledge and justified belief.” See Steup and Neta (2020) for a discussion. This is a very complex topic with a long, complicated, and rich history dating back to the ancient Greek philosophers. It is also divided into several subcategories. The Internet Encyclopedia of Philosophy,1 for example, lists 29 subtopics dealing, directly or indirectly, with epistemology. Suffice it to say that as the study of knowledge, certainty is ranked higher, and it has priority. See Reed (2022) for a philosophical discussion of certainty. A major issue in epistemology is the source of our knowledge. How do we obtain the knowledge we have? This is tantamount to asking where does the information we call knowledge come from? There are two sources: our senses and experiences. Immanuel Kant is the major philosophical proponent of the notion that all we know comes from our five senses: hearing, sight, smell, taste, and touch. He states that “All our knowledge starts with the senses, proceeds from thence to understanding, and ends with reason. . . .” See Kant (1961, p. 300). The “understanding” is the human intellect or mind that connects all the sense experiences. See Yovel (2018, p. 2). I summarize this in Fig. 1.1. I modify this paradigm for this book to “all our decisions begin with data, proceeds then to information extraction, and ends with reduced uncertainty.” I show this modification in Fig. 1.2.
1 See
https://iep.utm.edu/epistemo/.
6
1 Decisions, Information, and Data
Senses
Understanding
Reason
Fig. 1.1 This summarizes the Kantian connections between the senses, understanding, and reason. I modify this in Fig. 1.2 for my purposes in this book
Data
Information Extraction
Reduced Uncertainty
Fig. 1.2 This summarizes my connections between data, information extraction, and reduced uncertainty. This is a modification of Kant’s paradigm in Fig. 1.1
The third definition, the probabilistic one, assigns a probability to an event. The higher the probability, the more certain someone is the event will occur. This is not to say it will occur since believing does not make it so. It is just believed that it will occur. Conversely, the lower the probability, the less we believe, and the less certain we are the event will occur. A result of the high or low probability, one greater or less than a threshold probability, respectively, is that someone will take an action even though the probability is not 100%. For example, suppose 80% is the threshold people have in mind for accepting or believing a weather alert such as a tornado alert. An alert rating greater than 80% is sufficient for them to take it seriously and precautiously. If an alert indicates a 95% chance of a local EF4 or EF5 tornado within the next 30 min, then almost everyone will take it seriously and seek shelter.2 If there is a 10% chance, however, then most likely no one will take cover. The probabilistic interpretation of uncertainty raises a question about the relationship between uncertainty and risk. To most people, there is no distinction between the two. This amounts to an extension of the colloquial definition of uncertainty. There is, of course, a formal distinction based on knowledge of the probability distribution for an event. Knight (1921) defined a situation as uncertain if we do not know the probability distribution for an event, while we have risk if we do have enough knowledge to say something about the distribution. This is another epidemiological issue that only leads to further questions about the source of our knowledge. See Aven (2004) for an interesting discussion about risk and uncertainty in decision making. Economics and financial analysis mostly focus on risk rather than uncertainty, in part because subjective probabilities are used to calculate risk. See Hodgson (2011) for a discussion of the relative uses of risk and uncertainty in these areas.
2 See
the National Weather Service website at https://www.weather.gov/mkx/taw-tornado_ classification_safety for tornado classifications. The EF4 and EF5 tornadoes have wind speeds of 166 to 200 MPH or more.
1.1 Decisions and Uncertainty
7
Fig. 1.3 This illustrates the relationship between the uncertainty concepts, information, subjective probabilities, and Knight’s concepts of risk and uncertainty. Based on concepts in Fox and Ulkumen (2011)
Uncertainty consists of two parts: Aleatoric Uncertainty This is concerned with random events or outcomes such as the result of the toss of a fair die. This uncertainty cannot be eliminated, although it could be minimized because it is inversely related to the amount of information we have.3 Epistemic Uncertainty This is strictly information-based and is inversely related to the amount of information we have.4 See Fig. 1.3 for reference. Both are tied to subjective probabilities and information. See Fox and Ulkumen (2011) for a very readable analysis of these two forms of uncertainty. I have mentioned “information” several times. What is information and where does it come from? This is a difficult concept to define. There is certainly the colloquial definition: something we know or are told. A slightly more sophisticated view is that information is a surprise, something not expected, something totally new. A clever descriptive word is surprisal.5 Regarding sources, there are probably too many to list. For my purpose, the main source is data, but information is buried inside the data and must be extracted using statistics, econometrics, and machine learning. Some of the extractions result in Poor Information and some result in Rich Information. Poor Information only skims the surface of what is possible to know; it is mildly surprising. Means, proportions, and very simple graphs such as pie charts are examples. Rich Information is insightful, useful, and actionable. Regression 3 Aleatory
from a Latin word for dice throw. from the Greek word for knowledge. It is the basis for the philosophical area of epistemology. 5 Coined by C. Shannon of Bell Labs. See Shannon (1948). 4 Epistemic
8
1 Decisions, Information, and Data
Fig. 1.4 This illustrates the three components of Rich Information and their relationship to data. Rich Information is the intersection of all three bubbles; it is the Actionable Information
analysis, machine learning, and complex data visualization with Gestalt Graphicacy Principles are examples of ways to extract Rich Information. See Paczkowski (2022c) on the Gestalt Principles. Rich Information must be insightful, useful, and actionable. My Venn diagram in Fig. 1.4 shows the universe of data (the square area) divided into multiple subareas, the division based on three “bubbles” representing the insightful, useful, and actionable components. Any area inside a bubble but not in the intersection of all three is Poor to mild (or middling) Information. The intersection of all three is Rich Information. Anything not in a bubble is Null Information, which is just the data themselves; there is nothing gained from just the data. Insightful Information tells you about the current or potential environment, whether there is a problem and what can be done about it. This is Shannon’s surprisal6 See Shannon (1948) for the information content of a message. Anything that is not insightful is not useful or actionable. As an example, suppose an annual customer satisfaction survey shows that customers are top-two box (T2B) satisfied with customer support. See Paczkowski (2022c) for ways to analyze survey data. Suppose you had the same result for the past 5 years. This year’s survey is not insightful: it is nothing new. It is also not Useful Information because it is already incorporated in, say, ads. And it is not actionable because any action based on customer satisfaction would have already been taken. Consequently, this information is Null Information: it is just data and nothing else. Actionable Information tells you what you can and should do. The specific action you should take may not be clear; it may still have to be discussed, developed, honed, or articulated. But action will be forthcoming, nonetheless. Useful Information guides you in the action you (eventually) take. There may be a host of actions. Useful Information tells you or helps you decide which one. It suggests a course of action. It is one part of a larger whole for the decision. It is also a potential: maybe you cannot use it immediately but after being combined with other information, you
6 See
https://en.wikipedia.org/wiki/Information_content. Last accessed September 22, 2022.
1.1 Decisions and Uncertainty
9
can then make a decision. Useful Information is magnitude, direction, and I&R. The I&R is important because it frames the decision. As an example, suppose a customer survey reveals that nine out of ten of your customers prefer your product. This is insightful if it was not known before. It is not useful if management and marketing have no plans for it, say an ad campaign. How can this fact be used? And it is not actionable; what action can you take based on this information? If the survey shows that nine out of ten say the price is too high, this may be insightful because a price issue may not have been previously recognized. It is useful because it suggests the price should be reduced or it can be used as a directive for further investigation of elasticities and price changes. But it is not actionable: how much to reduce? Now consider a Business Analytics study to estimate a price elasticity. The elasticity may be insightful: your product may be more elastic than expected. The elasticity is useful: it can be used in a simulator7 to test different price points. But the elasticity per se is not actionable because it does not tell you the price change, only the result of a change. The elasticity is just a number such as -2.0. Suppose you build a simulator using the elasticity and use it to test a series of candidate price changes (i.e., price scenarios). The elasticity may be insightful: more revenue and profits than expected. It is useful because the specific number will tell management the magnitude of what it can do and what it has to decide. It is also actionable because a specific recommendation can be made for a price change. Poor Information is the absolute minimum you can work with. It provides some insight but nothing useful and actionable. It consists of, but is not limited to, means, proportions, and even hypothesis tests. For example, suppose you compare your product against two competitor products on a common attribute (e.g., usability) using survey data. The data are measures on a five-point Likert scale with 5 representing Very Usable. The comparisons are based on the sample mean for this attribute for each product. Since there are three products, you would use a multiple comparison test of the means of the three products. See Paczkowski (2022c) for ways to analyze survey data and the use of multiple comparison tests. Suppose you find that the three products are rated the same: the mean rating is the same for all three products. This may be information (i.e., a surprise) if you believed your product is superior. It is, therefore, insightful because it tells you that the products are the same, no differentiation. So, one of the three properties of information is met. Is this useful or actionable? No. What do you do with this (Poor) information? Perhaps you could engineer a superior product, but how? What product component would you improve? Or maybe you could change the marketing program. How? The results do not tell you what to do or how you can use it—this is unactionable information. Is it useful? Also no. Can you use this in an advertising campaign? Hardly, since you would have to say that you are the same as the others, there is no advertising advantage. So, this is Poor Information.
7I
discuss simulation later in this book.
10
1 Decisions, Information, and Data
Information is used to form subjective probabilities of future events or outcomes, which are future states of the world (SOW). The SOW are possible outcomes. Unfortunately, these are unknown at the time of the decision. This is the pain point. You could, however, associate probabilities which each possible state. There are two issues with this approach: 1. You have to specify the SOW, and 2. You have to specify the probabilities. The specification of SOWs is tantamount to defining scenarios, which are small snapshots or “scenes” of possible future worlds. I will discuss scenarios in Chap. 8. The specification of probabilities for each SOW (i.e., scenario) is purely subjective, perhaps based on Bayesian concepts. See Paczkowski (2022c) for a discussion of Bayesian approaches.
1.1.2 The Cost of Uncertainty Uncertainty, however defined, leads to an unseen cost. Most people are familiar with the concept of a cost, but in terms of dollars and cents. You buy a new copier, laptop computer, or printer and you incur a cost. This is recorded in an accounting record as a dollar amount that either was paid or will be paid to the vendor who sold you the product.8 There are thus two aspects to this cost: 1. The dollar amount paid or owed; and 2. Who was or is to be paid. This is well understood by all who work in business. There is a subtle third component to this cost description: the accounting system itself. The operative word is “system,” a series of records of assets and liabilities (i.e., the balance sheet), transactions (i.e., the income statement), ledgers, journals, and more. Each has its role to play in tracking costs, but yet they are not separate and independent. Each contains a type of data such as the actual dollar cost incurred (and the amount paid), the name and address of the vendor, the nature of the cost, and so on. The collection is a database, one that can become very large depending on the size, nature, complexity, and legal requirements of the business. A system has inputs and outputs. The input to the accounting system is the invoice and the output is the payment to the vendor. Inside the system are mechanisms, processes, and procedures to accept the input, record it, validate it, process it, and then produce an output (i.e., a check). All systems have these components:
8 The accounting record could be a business’s bookkeeping system or a personal checking account. Either one is a record-keeping system.
1.1 Decisions and Uncertainty
Inputs
11
Processes
Output
Fig. 1.5 This summarizes the three basic components of any system
1. Inputs; 2. Processes; and 3. Output where the flow is naturally from the first to the last component as I illustrate in Fig. 1.5. I will discuss systems in detail in Chap. 2. The first two aspects of a cost (i.e., the amount and the vendor) are most likely well-known to you, but the intricacies of your business’s accounting system (the middle block of Processes) may be less known. You probably view it as a “black box” where you merely submit an invoice and then never have contact with it again. There is more to the cost of doing business, however. The concept goes beyond the dollars and cents to something with wider impacts on the enterprise. Economists call this wider concept an opportunity cost. This is usually introduced to college students in the first week of an introductory course, the infamous Economics 101. The concept is very simple. It rests on the principle that you cannot have all you want of everything. You often have to trade-off some of one item to get one more unit of another. An opportunity cost is simply whatever you give up. There may be several alternatives, each with a different value to you, but each would be given up. In simple terms, a cost is “the highest-valued rejected option.” See Alchian and Allen (1972, p. 36). Clearly, a dollar amount represents or measures a cost because those dollars could have been used to buy something else. So, this definition is in accord with what most people intuitively say is a cost. An opportunity cost, however, is not, cannot be, represented in the accounting system, yet it is a cost nonetheless that must be considered. A decision maker such as the CEO wrestles with these opportunity costs all the time. If she has two decision opportunities, A and B, and she chooses decision A, she gives up the opportunity to do B. This is a cost no different than one on an invoice. But she has no way of knowing the return on that decision, if it was worth the cost, until some future time. This is, again, the uncertainty she faces, the pain point. But this is her uncertainty because of her decision. This is an internal decision. Uncertainty affects a business in other ways. One is associated with product demand. Demand is uncertain when you do not know when a customer will place an order, how much will be ordered, and even if it will be placed. These are three decisions a customer, not the CEO, makes about a product: 1. To buy or not to buy; 2. If buy, when; and 3. If buy, how much.
12
1 Decisions, Information, and Data
This uncertainty is external to the firm since it is determined by the customers themselves; it is based on external decisions. The C-Level executives impact these customer decisions through their business decisions regarding product promotions, price points, product attributes, product availability, and product delivery time. There is a connection between internal business decisions and external market decisions. As an example, assume customers randomly arrive at your business to buy or place an order. There is a mean arrival rate, .λ, per unit time. For example, five customers per hour may arrive, on average. This mean rate is a function of the product’s price so it is written as .λ(p) with .dλ/dp < 0. This has the effect of shifting the demand curve. When customers arrive, they may find that they have to wait to be served by the order fulfillment system. So, they join a fulfillment queue. If it is too long, the customer could either balk (i.e., not join the queue) or renege (i.e., cancel an order) and go to a competitor. In either case, a customer is lost, revenue is lost, and potential market share is lost. These are all opportunity costs of a pricing decision made before customers arrive. If you raise the price, fewer orders are placed and the fulfillment queue is shorter. There are fewer unhappy customers, but there are also fewer customers. See DeVany (1976) for a discussion of queues and this pricing function. I will return to this example in Chap. 12 when I discuss queueing systems and simulations. The fulfillment queue is the result of a shock to the ordering-fulfillment system of your business due to a pricing decision. The system consists of an order-taking process, a fulfillment process, an invoice and accounting process, and an inventory tracking and replenishment process. The shock is internal to your business since the CEO made the decision. A disruptive shock to the supply chain of some inputs for the product could have a similar effect. But in this case, production may have to be temporarily shifted to other products with adequate supplies. Again, there are opportunity costs involved. In both cases, your management does not know ahead of time what will happen either because of their decision or inaction. There is uncertainty.
1.1.3 Reducing Uncertainty How is uncertainty and, therefore, cost reduced? By developing and using the best information possible about decision impacts, market structure, and customer needs to list a few. This is Rich Information that leads to decisions with a high degree of confidence of success that it is the right decision. Poor Information does not significantly reduce uncertainty or instill confidence. Where does Rich Information come from? There is only one source: data. Data per se are not information. They are a storehouse of information, but one that hides information. Data are a veil covering and clouding the information within. Somehow, information must be extracted from data using tools, techniques, and methodologies that pull away that veil. This is where you, as the data scientist, come
1.1 Decisions and Uncertainty
13
in. This is why you were hired. Your function is to pull aside this veil to extract Rich Information from data so that your CEO and her team across the enterprise can make the best data-driven decisions at the lowest possible opportunity cost of uncertainty, thus minimizing their pain points. Of the two components of uncertainty, epistemic and aleatoric, only the former can be reduced to zero. I indicated this in Fig. 1.3. This is because epistemic uncertainty is defined as knowledge-based uncertainty. Aleatoric uncertainty can also be reduced as you gain more information, but unfortunately, it cannot be reduced to zero as can epistemic uncertainty. Aleatoric uncertainty is a function of purely random events and, by the nature of randomness, you have no control, no influence, and, in fact, no knowledge of the causes of that randomness.9 Aleatoric uncertainty has a lower limit, a limit that is always present regardless of how much Rich Information you have. Consequently, the opportunity cost associated with uncertainty asymptotically approaches a positive, nonzero lower limit and does not go to zero. See Paczkowski (2022b, Chapter 1) for a discussion of the cost curve.
1.1.4 The Scale-View of Decision Makers There is a slew of decisions a data scientist supports. Some may have larger impacts than others, but none are trivial. The magnitude and importance of a decision depend on the level of the decision maker. There are two levels: primary and secondary. Primary decision makers are responsible for all the decisions that affect the very existence of the business. These are strategically, existentially focused decisions answering questions such as: • • • •
“Where should the business be focused?” “Where should the business go next?” “What do we have to do to get there?” “Should we merge with another company or divest part of our company?”
as just a few examples. These decision makers are all in the C-Level echelon of the business. Secondary decision makers are more operational and tactical. They answer questions such as: • “Should I assign production work to machine A or B for this order?” (operational) • “Should I buy this material input or that one?” (operational) • “Should I schedule overtime work this week?” (operational) • “Should I raise the price 1% for a product in my product line?” (tactical) • “Should I hire this spokesperson for my ad campaign?” (tactical) 9 For a deep philosophical discussion of randomness, see the article “Chance versus Randomness” at the Stanford Encyclopedia of Philosophy: https://plato.stanford.edu/entries/chancerandomness/. Last accessed September 22, 2022.
14
1 Decisions, Information, and Data
and so on. These people are in the middle and entry levels of the business hierarchy. In short, everyone is responsible for some type of decision and all decisions have an impact on the business. The level of the decision maker determines their scale-view, the perspective they have of the business. Some decisions are very broad and all-encompassing; others are narrow and specific. The CEO’s scale-view is that of the entire enterprise, while the scale-view of a production line manager is for the production line only. The CEO makes decisions that impact the entire enterprise with far-reaching implications and ramifications that will last for several years. In most instances, these decisions have major financial implications as well as human resource and stockholderequity implications. The line manager’s decisions only impact the production of that particular line, and most likely only for a day, that moment in time when the decision is made. The CEO’s scale-view is broad; the line manager’s scale-view is narrow. How the business is perceived differs by scale-views. Those who have a broad scale-view perceive the business as being very complex with many interacting parts. Their major concern is those interacting parts, in particular ensuring that the parts work smoothly together with a tight symbiotic relationship. Expanding (e.g., through mergers) or contracting (e.g., through divestitures) the enterprise is their focus. This perceived complexity is long term. Those with a narrow scale-view see a less complex overall operation. Their concern is only with their immediate functional area (e.g., production of a single product in a job shop) and perhaps the one or two other functional areas they interact with daily. I will have more to say about scale-views, complexity, and systems in Chap. 2.
1.1.5 Rich Information Requirements Regardless of the level and scale-view of a decision maker, Rich Information is still needed for an effective decision. The Rich Information requirement for those with a broad scale-view differs from those with a narrow one. Those with a broad scale-view do not need great detail on all operations, such as daily schedule runs, personnel tardy and absentee records, and which departments and groups are behind schedule for a report hand-off. They do need Rich Information, however on returns on investment (ROI), stock price movements, competitor strengths and weaknesses, and so on. They need higher-level Rich Information. Those with a narrow scaleview need more detail because they see a different operation. They need focused Rich Information on daily and tactical functions. This level of Rich Information—higher and broader or narrower and focused— is one dimension of Rich Information. Another encompasses the type of Rich Information. The first is current and historical and the second is predictive. They answer questions such as:
1.2 A Data and Information Framework
15
• “Where is the business currently?” • “Where is the business going?” The answer to the first question is provided by Business Intelligence and the answer to the second by Business Analytics. These are two distinct disciplines, but with a common base. They are distinct in their focus. They have data as their common base. The data scientist provides both types of Rich Information to decision makers; whether they are primary or secondary decision makers is of no consequence. They need Rich Information that must be extracted from data.
1.2 A Data and Information Framework I said that information must be extracted from data. You may have challenged this because most data scientists view data and information as equivalent, two words used interchangeably that mean the same thing. Most business managers also believe in their equivalence. It is not uncommon, for example, to hear a CEO say she has a large data warehouse full of information, all at her disposal. Unfortunately, this perspective is accurate, but only to a point. She has no idea what the information is. The larger and more complex the data warehouse, the more information it contains. See Paczkowski (2022b). All data consists of two parts: information and noise that distort the information base. The paradigm is Data = I nf ormation + Noise.
.
(1.1)
This simple equation is a major theme of this book. Noise is a random variation due to unknown, unknowable, and uncontrollable factors that cause the data measures to differ from what you expect. This is the minimum aleatoric uncertainty. As a simple example, look at the two graphs in Fig. 1.6. The left-hand panel shows what you might expect to be the relationship between sales and price. This is the demand curve drawn in a basic economics course. Visually, the data points lie perfectly on a straight line. You can refer to this graph as a stylized relationship. The right-hand panel shows the same data but with random noise added. This is more realistic; it is non-stylized. Notice that the points are not on a straight line but form a cloud around what you would perceive as a trend line. To help visualize this trend line, I added a line that you can interpret as a trend line. This is a regression best-fit line. I will discuss this line in Chap. 3. The points in the right-hand panel of Fig. 1.6 are not on the trend line for reasons you cannot know although you can certainly list many potential reasons, but you will never be certain of them all. This has grave implications. You cannot take action to control or manipulate what you do not know. The only reason you can state for being “off the line” is that noise exists in the data which thus produces a cloud around the line as you can see in the right panel.
16
1 Decisions, Information, and Data
Fig. 1.6 This chart shows an example of two graphs, one without and one with random noise
Noise is important to recognize and acknowledge because it distorts your ability to see relationships, patterns, and trends. These are useful for predicting the impact of a decision, regardless of the level of the decision and regardless of the scale-view. The noise is the veil I referred to above; it is the cloud in the right panel. Without noise, then .Data = I nf ormation. Despite not knowing what causes noise, you still have to make some statement about its origin. Where does it come from? The best you can do is make some assumptions. The major one is that noise is a random draw from a probability distribution. This accounts for its random nature. The usual distribution is the normal or Gaussian distribution, the infamous “bell-shaped curve.” The reason for the last name is the familiar visual image of this distribution that looks like a bell. Unfortunately, this is not a good descriptor because there are other bell-
1.2 A Data and Information Framework
17
Fig. 1.7 This is the standard normal distribution. Its mean is zero and standard deviation is 1.0
shaped distributions, the t-distribution being an example. Nonetheless, the name is conventional, but should be avoided. The normal distribution curve is more formally called a probability density curve and the formula for it is the probability density function (pdf ). I show a normal distribution pdf in Fig. 1.7. The normal pdf is 2 1 − (x−μ) e 2σ 2 . pdf = √ 2π σ 2
.
(1.2)
I will say more about draws from a probability distribution in Chap. 9. The normal distribution is characterized or defined by its mean, .μ, and variance, 2 2 2 = 1, the normal is said to be .σ , summarized as .N (μ, σ ). If .μ = 0 and .σ standardized. The pdf in Fig. 1.7 is a standardized normal pdf. There are several further assumptions about noise. Its mean is assumed to be zero and its variance is .σ 2 . We assume a zero mean because we do not want the noise in the long run, on the average, to have any impact or influence. The collective assumption is written as .noise ∼ N (0, σ 2 ). In addition, the covariance between any two noise elements must also be zero so that there is no relationship between them. This is written as .Cov(noisei , noisej ) = 0, ∀i, j, i = j . With these assumptions, the noise is called white noise. I will say more about white noise in Chap. 3.
18
1 Decisions, Information, and Data
Fig. 1.8 This is an example of what the Data Paradigm looks like in a grid fashion. This grid is a modification of one in a blog article by Mike Wolfe at https://nolongerset.com/signal-vs-noise/. Permission to use granted by Mike Wolfe
The Information in (1.1) is the foundational information at the core of Poor and Rich information. It is the expected value of the data assuming the noise itself has an expected value of zero. This foundational piece is what a CEO is referring to when she says I have all this data, all this information. The noise is ignored or not even acknowledged. She assumes, like most people, that .E(noise) = 0. The only issue is the packaging of the data into Poor or Rich Information components. Once noise is acknowledged, however, then the issue extends beyond mere packaging. It becomes an extraction one. How the informational base is extracted, displayed, summarized, and reported determines its Poor and Rich Information status. What is extracted is only the beginning. You can portray data defined by the paradigm in (1.1) using a .2 × 2 grid such as the one I show in Fig. 1.8. I show both components of data as being artificially split into low and high designations. Notice the solid lines in the graphs in each quadrant. These are the expected relationships given the information content. A steep line indicates high information content because a one-unit change in the X-axis variable results in a large change in the Y -axis variable’s unit. Compare this to a relatively flat line. The points around the lines are due to the noise. The spread of the cloud-
1.3 Rich Information Predictive Extraction Methods
19
Fig. 1.9 This illustrates the extraction of information from the information-base component of data
like formation of noise points indicates the degree of noise: small amount of noise, small cloud; large amount of noise, large cloud. You can see relationships, trends, and patterns when you have high information content and low noise. You have clarity and insight so that any information you extract is useful and insightful. If you have high noise, however, you have distractions and distortions in the information caused by that noise. You have to work harder to get that information. With low information content and high noise, you have obscurity because you cannot tell what kind of information you have. Finally, with low information content and low noise, you can see immediately that you have nothing, so the data can just be discarded. This data are useless. Information is that piece of data that has value and must be extracted. Noise is just a nuisance, but one that has to be dealt with. I will discuss how it is dealt with in Chap. 3. I illustrate the information and noise possibilities in Fig. 1.9. I am not concerned with the extraction of Poor Information because this information is applicable for Business Intelligence; I am concerned with Business Analytics and, in particular, with the predictive aspect of Business Analytics. It is the extraction methods for Rich Information for Predictive Analytics that is my focus.
1.3 Rich Information Predictive Extraction Methods There is a bewildering array of Rich Information extraction methods that can be condensed into two broad categories: Informal Analytics Focus on past behaviors of internal operations and markets via summaries; Formal Analytics Focus on forward-looking behaviors of operations, external markets, and environments via sophisticated methods. Informal Analytics is retrospective, while Formal Analytics is prospective. Each is composed of parts which I illustrate in Fig. 1.10.
20
1 Decisions, Information, and Data
Fig. 1.10 This chart shows the general information extraction methods
1.3.1 Informal Analytical Components Reporting and Dashboards are what most people think about when Business Intelligence (BI) is mentioned. In the early period of Decision Support Systems (DSS), from which BI evolved, this was the case. Modern reporting and dashboards have gone well beyond the early versions. Dashboards are still important and informative, but they are now more detailed, elaborate, complex, and, most importantly, interactive with drill-downs and queries. They are generally subdivided into two categories: operational and markets/environments. Operational dashboards focus on how the business performed or is performing concerning production quotas, supply chain management, financial goals, revenue streams, and sales goals. The environment is stock prices and economic activity (e.g., interest rates, inflation, and real GDP growth). Drill-downs involve digging into the messages derived from the reports and dashboards to address questions such as “Why is this happening?” A dashboard may indicate that sales are lower in one marketing region than another. Drill-downs help management understand why so they could take corrective actions. Queries is the process of answering ad hoc questions (i.e., queries) posed by decision makers. This is a subset of the drill-down function because the process of drilling down into the data is driven by questions such as “Why are sales lower in the Southern marketing region?” The query function differs, however, in that the questions can be general, not related to a concern raised by a dashboard, and, again, ad hoc. They are frequently referred to as ad hoc queries. Examples are as follows: How many employees have advanced degrees and are in management? or How many customers have orders exceeding 1000 units per month on average? The SQL programming language and its derivatives are heavily relied on for these queries. Complex queries are frequently deferred to the data science team that has the expertise in SQL to answer them.
1.3 Rich Information Predictive Extraction Methods
21
1.3.2 Formal Analytical Components Statistics and econometrics are concerned with building formal models and testing hypotheses about the business, its markets, its customers, and its operating environment. Key Driver Analysis (KDA) is an important application, for example, for determining the main determinants of customer satisfaction, credit worthiness, and churn. Price elasticity estimation is another important example application used in pricing strategy development. A/B chi-square testing is yet another example in digital advertising for comparing two landing page options for websites. See Paczkowski (2018, 2022b) for insight into pricing analytics and general Business Analytics, respectively. Data and Text Mining are both concerned with finding messages hidden inside large data sets (i.e., Big Data). Data mining, per se, is concerned with structured numerical data, while text mining is concerned with unstructured textual data. Structured data are well formatted and organized, the type the early computer centers and DSS dealt with, while unstructured data are chaotic and disorganized. Both data and text mining deal with finding “messages.” Both heavily rely on statistical methods, supervised and unsupervised learning, machine learning, and now AI because of the sheer volume of data and complexity of modern operations. Predictive modeling and forecasting are similar yet different just as drill-downs and queries are similar yet different. Forecasting, in fact, is a subset of predictive modeling. Both are concerned with filling in a “hole” in our knowledge so complex business decisions can be made. The difference is that forecasting is temporally oriented. Predictive modeling per se deals more with what-if or scenario analysis. You forecast sales for 2025 but you predict if a customer would default on an invoice given past credit scores. I will use the two terms interchangeably in this book. You can divide Predictive Analytics into two categories depending on the type of your data. If you have time series data, then you have forecasts; otherwise, you predict the likely outcome of an action. Both are ultimately used for scenario or what-if analysis. This type of analysis relies on conditions (i.e., the scenario) specified by the team responsible for the analyses (e.g., the data scientists), by upper management, or by a client. The scope of a scenario depends on the scale-view of the client. Scenarios for the CEO, who has a wide strategic scale-view, cover the entire enterprise and involve questions such as: • “What would be our ROI over the next 5 years if I enter into a new product category?” • “What would happen to our market share if I have to divest a business unit?” • “What would happen to our net earnings by quarter for the next 3 years if there is another pandemic?” Scenarios for a product manager with a more narrow tactical scale-view involve only that product line. The scenarios might involve questions such as: • “What would be our increase in net sales over the next 5 years if I lower our price by 1%?”
22
1 Decisions, Information, and Data
• “What would happen to our production quotas if I reduce the head count by 5%?” • “What would happen to our contribution margin by quarter for the next 3 years if raw material prices rise by 2% per quarter?” Scenarios for a line manager with a narrow daily scale-view might involve questions such as: • “What would happen if two production robots suddenly went down for maintenance?” • “How should I reallocate production if there is a sudden increase in the demand for product X if there is a special promotion offered for it?” • “What would be the effect of a drop in a raw material inventory for 5 days?” I will discuss scenarios in more detail in Chap. 8 as part of the design of a simulator. A simulator is used to simulate the different conditions of a scenario. A simulation is in the domain of Predictive Analytics methods since it will produce a prediction, only in a wider system context. This wider context includes all the relevant components of a system. But a simulation need not be just “another method” in the family of Predictive Analytics methods. It can be coupled with the other methods to create a more robust prediction of likely outcomes. For example, consider a proposed price decrease for one product in a multiproduct enterprise. This product is under the control of one product manager who makes the pricing decision. A stand-alone perspective, a narrow scale-view, of the price cut would focus only on the sales impact for that product. How much will sales increase and, therefore, how much will revenue change? This is “stand-alone” in the sense that the question is narrowly focused based on the scale-view of the product manager. He is only interested in his product. His data scientist could estimate a price elasticity using any one of several different methods and with different types of data. I describe some methods for doing this in Paczkowski (2018). The elasticity, extracted from data, would show the impact on sales of the price Q change for that product. If the elasticity is .ηP , then it is easy to show that the change Q in revenue for the product is calculated as .1 + ηP . For one extracted factor, the price elasticity would provide at least two insightful and useful pieces of information to the product manager. See Paczkowski (2018, Chapter 2) for this revenue result.10 A simulation could help the product manager decide on the best price change since the price elasticity will only tell him the impact on sales of a specific price change. For Q example, suppose .ηP = −2.0. Then a 1% cut in price will increase sales by 2%.11 A simulator would allow him to evaluate different price points to see a range of effects on sales and revenue. The simulator plus the insightful and useful elasticity measure provides Actionable Information so the product manager now has Rich Information.
price elasticity is the percent change in sales divided by the percent change in price: .ηPQ = / . 11 If .ηQ = (dQ/Q)/(dP/P ), then .(dQ/Q) = ηQ (dP/P ). Therefore, .−2% = (−2) × (−1%). P P 10 The
(dQ/Q) (dP/P )
1.3 Rich Information Predictive Extraction Methods
23
This may not seem needed for this simple example, but it should be clear that a simulator has a definite role to play in more complicated pricing situations. Now consider the CEO. She has a wider scale-view so she will want to know the full impact on the enterprise, not just on the one product. To her, the enterprise is a complex system that encompasses more than one product. If this is a multiproduct business, then changing the price of one product could have implications and ramifications for others in a line and across lines in the product portfolio. She also needs to know the larger financial impact such as the effect on stakeholder wealth (i.e., the company’s stock price). At this point, a simulation of the system provided her with this Rich Information. The simulation enhances the predictive capability of whatever methods the data scientist used to develop the price elasticity. Although there are several parts in the Business Analytics domain, my focus is only on Predictive Analytics. Predictions are extremely important because the decision makers make decisions not for the past, but for the future. This is so obvious that you could be excused from not even mentioning it as an action by rationalizing it as subtle. Nonetheless, it is true. Predictive Analytics is concerned with what will happen under different conditions. The conditions could be initiated by the decision makers directly or in response to market changes (e.g., demographic changes; lifestyle changes; or sudden shifts in tastes and preferences, i.e., “fads”) or competitive actions (e.g., a price change; a more aggressive promotional campaign; or a new and unexpected new product introduction.) These decisions must be informed, not seat-of-the-pants, knee-jerk responses because there is too much at stake: the future of the business, not to overlook or subordinate the welfare of the employees and suppliers. This last point is very important. It is worth repeating that decisions regarding any aspect of the business, decisions informed by Predictive Analytics, have wideranging implications and ramifications (I&R) on all parts of the business. Except for the simplest of businesses (i.e., a proverbial, almost mythical “mom ’n pop”), most modern businesses are complex operations with many interacting parts. They are complex systems. There is no uniformly accepted definition of a complex system, but most seem to “loosely” agree on this: A Complex System is . . . composed of many interacting parts, such that the collective behavior of those parts together is more than the sum of their individual behaviors.12
Businesses as complex systems have many interlocking parts such as the ones I list in Table 1.1. Many of these interconnections and dependencies should be obvious. Some, maybe less so. For example, the forecasting function in a business may have responsibility for all levels of forecast development such as sales forecasts. There is a positive feed-forward loop from the forecasting team to the sales team. The sales forecast is input into the sales force’s compensation and sales quota for the next year. The sales executives will, of course, push back on the forecast if it implies negative effects on the sales team. There might be a negative feedback loop from
12 Source: Complex Systems: A Survey. M. E. J. Newman. Am. J. Phys. (2011, V. 79, pp. 800–810).
24 Table 1.1 These are just a few interlocking parts of a business as a complex system
1 Decisions, Information, and Data • Finance • Pricing • R&D • Manufacturing • Data science
• Marketing • Human resources • Sales • Support • Forecasting
the sales force back to the forecasting team. At the same time, the forecast is input into the manufacturing and order fulfillment divisions that will also have a negative feedback loop to the forecasting team and, probably, the sales team. These loops are part of the make-up of a complex business system. In a complex system, if a decision is made about one part (i.e., a narrow impact), it will have impacts, the I&R, for other parts (i.e., broad impacts)—desirable and otherwise! When a part of a system is studied, it is usually assumed that the rest of the system “is essentially uniform and that local details do not matter for the behavior of a system on a larger scale. These assumptions are not generally valid for larger system.” See Bar-yam (1999, p. 9). What might be considered a simple change with immediate benefits might have damaging, expensive repercussions elsewhere. You have to assume there are impacts on the larger system.
1.4 A Systems Perspective Regardless of the complexity of a business, a decision based on a prediction about one aspect of it has implications and ramifications for its other parts, some of which may be undesirable. The interconnections is another theme of this book. The only way to grasp the full impact and scope of a decision based on the predicted impact on one component of the system is by taking a systems approach to the entire business. And the only way to determine the future full impact of a decision on the entire business system is by Simulation Analytics coupled with Predictive Analytics. This will allow the business decision makers to see the entire “picture.” The entire picture is Rich Information. Simulation Analytics, however, is only one part of a more general approach to understanding the I&R of a predicted impact of a decision. There are so many impacts that decision makers need even more information to help them decide what they should do. Business Analytics is now divided into two parts: Predictive Analytics and Prescriptive Analytics. The former tells them the likely outcome of a decision, albeit with a narrow focus. For example, a 1% decrease in price will result in a 2% increase in orders. This is a prediction with a narrow focus. Prescriptive Analytics helps them understand the larger I&R on the system so that they know what they should do. If the 2% increase in orders results in fulfillment issues, supplychain management issues, personnel problems, increased inventory, increased plant and equipment expenditures to handle the new orders, and so on, then is the 2% increase in orders worth it to the business?
1.5 This Book’s Focus
25
Prescriptive Analytics has been treated and viewed as separate and distinct from Predictive Analytics; that is, they are considered to be substitutes. They are, instead, complements that when used together provide the Rich Information business decision makers need. The Deep Data Analytics I discussed in Paczkowski (2022b) thus consists of two tightly interwoven parts: Predictive Analytics and Simulation Analytics. The focus of this book, as a continuation of my previous one, is the melding of the two complementary Deep Data Analytics methodologies for providing Rich Information to business decision makers. This is the topic of this book.
1.5 This Book’s Focus This book is focused on four themes that I developed in this chapter: 1. Information is embedded in data—data are not information per se. 2. Information must be extracted from data. I cover some extraction methods in Chaps. 3 and 4. 3. Decisions have system-wide effects; they are not localized to one product, product line, business unit, or geography. These larger impacts must be accounted for. I introduce complex systems in Chap. 2. 4. Simulations are (a) another form of Predictive Analytics; (b) they are another form of a data generating process (DGP); (c) they can be combined with prediction methods; and (d) they can guide what should be done, which is Prescriptive Analytics. Chapter 5 and onward focus on the simulations.
Chapter 2
A Systems Perspective
One of the four themes of this book is that information is buried inside data: data per se are not information, but contain information which must be extracted. There are many extraction methods that can be grouped into classes as I have shown in Fig. 2.1. The two umbrella classes are Business Intelligence and Business Analytics. The former tells a decision maker what did or is currently happening. The latter tells him/her what will happen and, possibly, what should be done. These two classes are umbrella classes because each has more specific subclasses. A subclass of Business Intelligence is Descriptive Analytics which is the heart of Business Intelligence. This merely provides summary measures of business activity such as units shipped, orders fulfilled, or attrition rates. Some details may be included such as trends and deviations from trends. Overall, however, from a decision-making perspective, this is Poor Information because they do not tell a decision maker what will happen as a result of a decision, or even if the decision should be made. Classification Analytics is part of Business Analytics. This also provides Poor Information. What do you do with a classification? It may be insightful, but not necessarily useful or actionable. For example, I could classify people as likely to default on a loan or not. This classification per se does not tell anyone what they should do. See Paczkowski (2022b) for a discussion of classification methods. Another subclass of the Business Analytics umbrella is Predictive Analytics which produces predictions or forecasts of the future result of a decision made today. More than the other information extraction methods, Predictive Analytics provides Rich Information. This class is bifurcated into simulations and predictions as specific methods. The former provides a means of testing different scenarios of possible situations or states of the world (SOWs), while the latter tells a decision maker what will most likely happen in a particular SOW. The operative phrase is “most likely.” No one can ever state with 100% certainty what will happen because of a decision. There is always a probability of an outcome. As a class standing on its own, predictions include statements about a future outcome and a prediction interval © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_2
27
28
2 A Systems Perspective
Information Extraction Methods
Business Intelligence
Descriptive Analytics
Business Analytics
Classification Analytics
Predictive Analytics
Simulations
Predictions
Prescriptive Analytics Fig. 2.1 This illustrates a framework for information extraction methods. There is a hierarchical structure of the major extraction classes
for a range of possibilities around that outcome. However, when combined with simulations, the interval becomes even more forceful in guiding decisions. Taken together, they comprise the class of Prescriptive Analytics that tells a decision maker what action they should take in light of the scenarios. My focus in the book is the predictions and simulations. Prescriptions are outside the scope of what I want to accomplish. A decision based on a prediction has wide-ranging implications and ramifications, some of which may be undesirable, in a complex system. A business is such a system. A complex system contains interconnected and interacting parts so that a decision about one part affects the entire system, the only issue being the degree or magnitude of the effect. These interconnections and interactions are another theme of this book. The only way to grasp the full impact and scope of a decision based on the predicted impact on one component of the system is by taking a systems approach to the entire business. But once a systems approach is taken, other than a narrow one, decisions take on a new dimension. I am concerned with the nature of a system, and a complex one, in this chapter.
2.1 Introduction to Complex Systems
29
2.1 Introduction to Complex Systems There are many definitions of a system as there are those who research systems. One possible definition is that a system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by its boundaries, structure and purpose and expressed in its functioning. Systems are the subjects of study of systems theory.1
This is acceptable for any type of system, but what about a complex one? The adjective “complex” is very important because it implies a continuum of systems of different sizes, functionalities, and compositions. The adjective has a popular, everyday connotation of something that lies between order and chaos as noted by Ladyman et al. (2013). They note that for many people, complexity means chaos, so they view them as synonyms. As in most, if not all, technical and scientific areas, popular views of a concept are insufficient for the advancement of that area as well as knowledge in general. Slisko and Dykstra (1997), for example, note the negative learning impacts on students of confusing and imprecise definitions of fundamental physics terms such as “temperature,” “heat,” “energy,” and “relativistic mass.” The notion that there is not a well-defined language and accepted set of definitions of basic terms is consistent with a more general view of language, especially English. McWhorter (2016), for example, argues that language is always changing, so words commonly used and accepted today to mean something specific did not have the same meaning when the word was originally coined. A good example is the word “literally” which is commonly used to mean its opposite, “figuratively.” But this is just a natural evolution of a language. As McWhorter (2016, p. 3) notes, “One of the hardest notions for a human being to shake is that a language is something that is, when it is actually something always becoming.2 Ladyman et al. (2013) lists nine definitions or attempted definitions of complex systems. Some of these appear to refer to complexity per se and complexity theory in addition to complex systems. Is “complex” the adjective or the noun? This just adds to the confusion. In an attempt to narrow the concept, they list seven features of complexity and a complex system which they then criticize one by one. These features are: 1. 2. 3. 4. 5. 6. 7.
Nonlinearity; Feedback; Spontaneous order; Lack of central control; Emergence; Hierarchical organization; and Numerosity.
1 Source:
https://en.wikipedia.org/wiki/System. Last accessed October 5, 2021. in original.
2 Emphasis
30
2 A Systems Perspective
Nonlinearity refers to a situation in which “a change in the size of the input does not produce a proportional change in the size of the output.”3 In other words, if you double an input, you more than double the output. This is sometimes expressed as the whole exceeds the sum of its parts. In a linear system, the change in the output is proportional to the change in the input. A common measure in economics to capture this concept is elasticity. See Paczkowski (2018) for a detailed exposition of elasticities. Other related concepts in economics are increasing and decreasing returns to scale. Ladyman et al. (2013) agree that nonlinearity is a feature of complex systems, but they also believe that it is not a necessary condition. The same also holds, incidentally, for linearity. Feedback is a feature cited by Forrester (1968) in his systems textbook. Also, see Ladyman et al. (2013) on feedback loops. Feedback refers to a flow from one part of a system to another and then back again to the first. The loop runs forward from an origin and returns to that origin. One part of the system is the source, or parent, of an action that flows to another part, or child. The child responds or reacts and sends a response via another action back to the parent. I illustrate this in Fig. 2.2. In this simple feedback loop, an action is initiated by a decision maker based on the current information about the state of the system (SOS). The state is the current operation of the system. The action may be a price decrease and the information is the price elasticity. The decided action is sent as an Action Directive to an Action Processor that implements the action. In the case of a price change, the Action Processor could be the marketing or pricing department which has responsibility for managing prices. Once the action is implemented, the system state changes. For the price change, this may be changes to the sales volume via an ordering process, as well as the manufacturing, fulfillment, invoicing, and accounting processes. Each sends information to an Information Processor responsible for collecting, organizing, and presenting the information about the SOS. This information could be in the form of a dashboard sent to the system decision maker to assess. This is a Business Intelligence function. The dashboard is Poor Information since it only provides a synopsis of the current SOS. This is important and necessary to have, but it is not sufficient. Then the loop is reignited and the whole process repeats. This is an example of a positive feedback loop because information is fed back to the decision maker. The loop in Fig. 2.2 is a closed, positive feedback loop because all actions that initiate or order a change are born inside and stay inside the system; the order is endogenous to the system. Only actions initiated by its past and not by outside influencers drive the system. In the case of a price change, there may be outside information about a competitive price action, but this is just information and not a direct action that impinges on the system. If, however, a government agency, such as a court, ordered or mandated a price change, then that order is an outside factor. It is exogenous to the system. The feedback is positive because, assuming the decision is a correct one, the action will allow the system to grow; a positive action will beget
3 Source:
https://en.wikipedia.org/wiki/Complex_system. Last accessed October 12, 2021.
2.1 Introduction to Complex Systems
31
Fig. 2.2 This illustrates a positive feedback loop for a simple system
a positive response and another positive action. If the price change produces an increase in sales and revenue, then this is positive and may result in further actions of a similar nature in the future. If the action has a detrimental effect causing the system to decrease or, in the extreme, stop working altogether, then the feedback is negative and you have a negative feedback loop. If the price change results in a decline in sales and revenue, perhaps because the elasticities were incorrectly estimated or interpreted, then the business could suffer. Unchecked positive feedback could lead to the demise of the system because it could be explosive. An action that produces a positive reaction could incent another positive action which leads to another positive reaction, and so on. The system could become so large that it eventually dies under its own weight. Referring to the price decrease example, if sales and revenue increase due to a price decrease, another price decrease could be ordered leading to more sales and revenue. However, in the limit, the price could be reduced to zero (which, hopefully, no one would do) so that products are given away for free. Revenue would drop to zero and the business would cease. Negative feedback could act as a balancing force to stabilize the system. The spontaneous order of the system refers to its ability to materialize without direction to fulfill a function or need. The invisible hand concept in economics is the best example of a spontaneous order system. Economies come about without any outside guidance or influence. An economic order arises, just as other social institutions have arisen such as language, by people gathering together and just doing something. The something could be exchanging in the case of economics, or developing a language in the case of communications. The spontaneous order, incidentally, does not have to be a desirable one. We often associate spontaneous with desirable because we, as a society, are developing something based on our community values, that is to say, based on endogenous values. This is opposed to the values, and therefore the order, imposed by some outside force or power’s
32
2 A Systems Perspective
value, that is, exogenous values. See Mittermaier (2020, Chapter 5) for a detailed discussion of spontaneous order in economics. The lack of central control is questionable in some systems. For most systems studied in complex system science, this is the case. An economic system is a good example. Ever since Adam Smith, economists have puzzled over, theorized about, and studied how an economy functions unaided by any outside controlling force. According to Smith, or at least in the modern interpretation Smith,4 if consumers and businesses are left to pursue their own self-interests, they will “as if lead by an invisible hand” make all the correct decisions that will benefit themselves and, more importantly, benefit society as a whole. This is a powerful concept because, in the almost 250 years since he proposed it, the concept has been the mantra of many political movements and much philosophical discussions in economics and otherwise. Emergence is perhaps the most important factor of a complex system. To understand emergence, consider an important piece of the definition of a complex system: it has interconnected and interacting parts. They are not independent of each other so they do not perform their functions in isolation from other parts. The interconnections and dependencies should be obvious for the departments I list in Table 2.1, which is a repeat for convenience of Table 1.1. As an example, the forecasting department is responsible for forecasting all aspects of a business. Let us focus on a sales forecast for the next several quarters. Once the forecast is produced, it has to be used somewhere in the business; otherwise, there is no point in producing it! In and of itself, this forecasting function is useless. Now consider sales. The sales force, without any guidance or incentive, would not sell products in an amount needed to maximize profits. They would sell for their commission. It needs a target which comes from the forecast. The two work together. The sales forecast sets the sales force’s quotas and compensation. The sales executive either accepts or rejects the forecast. If it is accepted but is exceptionally high, then the sales force may be overworked and stressed to the point that sales personnel resign. In addition, compensation may not be set high enough to satisfy the sales force for their efforts so that they will, again, leave. The sales executive may be forced to hire and train new salespeople to backfill those who left which will increase the sales budget requirements. This hiring and training will become the responsibility of the human resources (HR) department but it will also involve the finance department which monitors budgets. The interconnections and interactions in the system, and the complexity of the system, should be obvious from this simple description. All units now act in concert which is the business. Standing alone, they produce nothing; together they produce revenue and shareholder value. This is emergence: something new is produced by the system than the stand-alone parts would produce.
4 See
the Wikipedia article on the invisible hand at https://en.wikipedia.org/wiki/Invisible_hand# Economist’s_interpretation. Last accessed October 18, 2021. Also see Vaughn (1989).
2.1 Introduction to Complex Systems
33
Table 2.1 This is a partial list of key parts of most businesses. While it is just illustrative, it does, nonetheless, show that there are many parts. These parts comprise a system that we call a business. This is a repeat of Table 1.1 for convenience • Finance • Pricing • R&D • Manufacturing • Data science
• Marketing • Human resources • Sales • Support • Forecasting
This discussion of emergence is well summarized by: emergence occurs when an entity is observed to have properties its parts do not have on their own, properties or behaviors which emerge only when the parts interact in a wider whole.5
Bar-yam (1999, p. 10) notes that interpreting emergence this way leads to a serious misunderstanding because the “collective behavior [in the system] is not readily understood from the behavior of the parts. The collective behavior, however, is contained in the parts if they are studied in the context in which they are found.” The behaviors are there; we just do not see them. There are three behaviors: random, coherent, and correlated. Random behavior has no fixed pattern as you might expect from an intuitive understanding of the word “random.” An example is a crowd of people. Their movement from one moment to the next displays no rhyme or reason. Coherent behavior is “logically or aesthetically ordered.”6 An example is people at an airport TSA security check-in station. They move single-file through a security queue until finally cleared to enter the main airport concourse and eventually a gate. This differs from random crowd behavior. Correlated behavior is the behavior between these two. The behaviors are associated but not dependent. This is the same notion as the statistical concept of correlation: an association but not a cause-and-effect relationship. The ordering and fulfillment systems have correlated behaviors: the more orders taken, the faster they have to be fulfilled and shipped; but the speed of fulfillment impacts orders: if fulfillment is slow, then customers will have to wait longer to receive their orders and are more likely to balk and not place an order, to begin with. I discuss this balking in Chap. 12. These behaviors can be observed at two levels, local and system, so a system is hierarchical. See Bar-yam (1999, p. 10). A local behavior is at a specific part of the system, while a system behavior is at the complex system level. A product manager and his staff for one product in a product line are at the local level and the CEO and
5 Wikipedia
article on Emergence at https://en.wikipedia.org/wiki/Emergence. Last accessed May 19, 2021. 6 See the definition of coherent at https://www.merriam-webster.com/dictionary/coherent. Last accessed June 28, 2022.
34
2 A Systems Perspective
Fig. 2.3 This illustrates network charts for two systems. The one on the left is a very simple system with only two parts. The one on the right is a more complex system. The numerosity feature of a system is evident in both: the numerosity is higher on the right
her executive staff are at the system level. A price change for a product managed by a product manager is at the local level; a merger is at the system level. It is important to emphasize that a hierarchical system consists of many parts, each at the local level that has specific functionality to perform and fulfill in the system. In an enterprise, they are often called departments, organizations, or business units. Each has its structure, usually hierarchical, with an executive vice president at the top and many functional areas beneath this executive level. Each of these functional areas is itself part of the whole department so it is itself a system. Finally, numerosity refers to the number of interacting parts in a system. An enterprise with only one part is not a system; two interacting parts constitute a minor or simple system; a very large number of interacting parts constitutes a complex system. As noted by Ladyman et al. (2013, p. 42), “many more than a handful of individual elements need to interact in order to generate complex systems.” You can view the parts’ interconnections as a network and use a network diagram, such as the one in Fig. 2.3, to illustrate the system. The simple system on the left in Fig. 2.3 may represent a mom ’n pop store with just an owner and two staff helpers. The one on the left may represent a larger manufacturing concern with a CEO in the center and three executive officers branching out from the CEO. The network on the right is more complex. The larger the network, the larger the degree of complexity.
2.2 Types of Systems: Examples Systems have been studied and categorized in numerous ways. There are economic systems, business systems, robotic systems, and the list goes on. In this section, I want to just highlight a few, especially those in the business sphere or related to business decision making.
2.2 Types of Systems: Examples
35
2.2.1 Economic Complex Systems Complex systems have been studied in economics for centuries, at least since Adam Smith published his magnum opus, The Wealth of Nations.7 The two types of economies that are often discussed regarding national economies and international trade are: 1. Closed economic systems; and 2. Open economic systems. A closed economic system is self-sufficient, with no interaction with its environment, the environment being other economies. This is traditionally used in a Principles of Economics course when international trade is introduced to highlight the deficiencies of a closed system and the advantages of trade. It is also a benchmark for an open economic system which allows for trade. An open economic system, in contrast to a closed one, interacts with its environment, the environment being other nations and the interaction being trade. This leads to the important economic concept of comparative advantage. An economic example of a system is the circular flow of income model. This is an old paradigm used in elementary economic textbooks to illustrate the interconnection of households and firms. Households provide labor services to firms that produce products sold back to the households. At the same time, money flows from the households in the form of payments for those products, while firms send that money back to the households in the form of wages and salaries. Without any outside intervention (e.g., another country), this circular flow represents a closed economic system. I illustrate one possible configuration of a closed circular flow in Fig. 2.4. See McClure and Thomas (2019) for some interesting discussion about the history of this paradigm and the distinction between closed and open economic systems.
2.2.2 Business Complex Systems Businesses are another type of complex system but the complexity depends on how you view the business. First, there are small, mid-sized, and large businesses. There are no official cutoff points to designate size, but, as a rule of thumb, the number of employees could be used. Gartner, for example, defines a business as small if it has less than 100 employees; mid-sized if it has from 100 to 999 employees; and large if it has more than 1000 employees.8 Regardless of the size, they all (except
7 The full title of the book is An Inquiry into the Nature and Causes of the Wealth of Nations. It was published in 1776. 8 See “Small and Midsize Business (SMB)” at https://www.gartner.com/en/informationtechnology/glossary/smbs-small-and-midsize-businesses. Last accessed January 26, 2023.
36
2 A Systems Perspective
Fig. 2.4 This illustrates a typical circular flow in income in a closed economic system. There are two flows: the goods/labor flow in the outer loop and the monetary flow in the inner loop
sole proprietor businesses) have interlocking parts. I am more concerned with midto large-size businesses which I will often refer to as “enterprises.” Enterprises, especially very large, multinational ones, not only have many interconnected and interrelated departments, but they also have many interlocking business units. A business unit, also called a strategic business unit (SBU), is a separate entity under the umbrella of a larger parent. An SBU has its own organizational structure, budget, marketing and production branches, and so on; it is its own business.9 Even within a SBU, a subsection or department could have complex systems. A good example is a manufacturing department of an enterprise. This department could have an array of robotic stations that mechanically, but, of course, very precisely, produce the products. The array could be in a serial organization: each robot produces a component used as an input for the next robot until the last one completes the production. This, of course, mimics the old-style assembly line but with robots rather than people. Humans are still needed for maintenance, but the “hard work” is done by robots. The array of robots is a system. For robotic manufacturing, not only is the array a system, but each robot is itself a complex system, albeit a mechanical one. There are computers associated with each one, circuits, sensors, and so on that synergistically operate together so that
9 See
“Strategic Business Unit” at https://en.wikipedia.org/wiki/Strategic_business_unit. Last accessed January 26, 2023.
2.3 Predictions, Forecasts, and Business Complex Systems
37
one robotic unit can produce what it is supposed to produce. If one part of a single robot’s system fails, then the whole robot fails and the entire array fails. I will return to the monitoring of such systems using sensors in Chap. 6.
2.2.3 Other Types of Complex Systems The economy is just one example of a complex system. Other examples are: • • • • • • •
ants; climate;10 nervous systems; cells and living things; modern energy and telecommunication infrastructures; matter; and the cosmos
to list just a few.11 For my purpose, another example of a complex system, one frequently overlooked but which is yet one piece of the larger complex economic system, is a business. My discussion above about the interconnection among the forecasting, sales, human resources, and finance departments exemplifies the system’s complexity.
2.3 Predictions, Forecasts, and Business Complex Systems A complex system is difficult, if not impossible, to visualize. To grasp the nature of a system, not necessarily a complex one, it is best to use a simplified representation. I show one in Fig. 2.5, which is a duplicate of Fig. 2.2 but with some minor changes for my current purpose. Assume that this system is a business with one product. It consists of a decision maker (e.g., the CEO) who assesses the current state of the business based on information provided through a dashboard. The dashboard is a graphical, interactive management tool that displays key performance metrics (KPM) for the business’s operations. The information in the dashboard is gathered by the IT department and perhaps the team of data scientists who process and provide commentary on the data. 10 The 2021 physics Nobel Prize was awarded to three climate scientists—Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi—for their work on the Earth’s climate complex system. See the 2021 Nobel Prize Physics announcement at https://www.nobelprize.org/prizes/physics/2021/ summary/. Last accessed October 18, 2021. 11 This list was partly gleaned from the Complex System Society: https://cssociety.org/about-us/ what-are-cs. Last accessed October 18, 2021.
38
2 A Systems Perspective
Fig. 2.5 This illustrates a simple system with a feedback loop. This system could be a small business with one product. Note the similarity to Fig. 2.4
This dashboard presentation describes the SOS at this point in time. The information is Poor Information because it contains little actionable insight. The measures in the dashboard might be revenue, unit sales, contribution margins, rates of return, employee absences, stock prices, and so forth. Some of these may be compared to targets. For instance, the company’s target unit sales may be 10,000 units on average per month, but the dashboard data indicates that for the past 6 months, only 5000 units have been sold on average. After reviewing the dashboard, the decision maker may decide that one KPM, unit sales, is below target and action is warranted to improve performance. The best and most direct way to quickly see an improvement in sales is by changing the product’s price. The data scientists estimated the elasticity as .−2.0. This Rich Information implies that a 1% price decrease will result in an expansion of sales by 2%. It also implies that revenue will increase by 1%. An Action Directive is forwarded to the Action Activator who, in this example, may be the Chief Marketing Officer (CMO). This Action Activator lowers the price 1% and sales increase. The SOS changes because the sales have changed in response to the price change. The new system state data are compiled by the IT staff, processed by the data scientists, and then incorporated into the next version of the dashboard. The decision maker then examines the dashboard as before and makes a new decision. And the process repeats. Recall that the data scientists estimated that the product’s demand is price elastic implying that a 1% price decrease will stimulate unit sales by 2%. This is a prediction, which is another piece of information for the decision maker. The prediction is based on two sources of data: 1. the dashboard data plus other company data not part of the dashboard; and 2. external data on the competition, market, and economy.
2.3 Predictions, Forecasts, and Business Complex Systems
39
Fig. 2.6 This illustrates a simple system with a feedback loop that has a prediction input
These data are used in sophisticated econometric models to estimate the elasticity. See Paczkowski (2018) for a detailed discussion of elasticity models. The decision maker now has two sources of information: 1. the dashboard with system state information; and 2. the prediction from the data scientists. One is based on system data and is thus endogenous to the system, while the other is exogenous to the system and originates with the data scientists. Both must be considered. The second, the predictive information, must be added to the system diagram, Fig. 2.5, which I do in Fig. 2.6. The diagrams I have shown so far are incomplete because they do not show how the SOS changes because of the Action Directive. The only statement I have made is that sales have increased consistent with the prediction. The time path of the change is just as important as the change itself. There is a time dimension to the changes in system state variables. Nothing takes place outside the time dimension. The adjustment of sales, the pattern followed, is just as important to the decision maker and the system itself as is the system state at any one point in time. I illustrate some possible adjustment patterns in Fig. 2.7. Both panels of Fig. 2.7 have an equilibrium target value for the system state variable which I show on the Y -axis. This variable could be the unit sales in my example and the equilibrium value could be these sales adjusted by the 2% target increase from the data scientists. For example, if sales were previously 5000 units on average per month, the equilibrium value would be 5100 (.= 5000 × 1.02). This could be interpreted as a long-run target.
40
2 A Systems Perspective
Fig. 2.7 This illustrates possible system state variable change patterns over time
If the price decrease occurred at time .t0 , the target could initially be overshot, perhaps because customers are enthusiastic about the price decrease. I show this in the left panel of Fig. 2.7. Customers, however, adjust to the new lower price so their purchases are eventually reduced to yield the target sales number. The competition, of course, could also reduce its price to shift some demand to it, further reducing the initial large sales increase. Regardless, the system state variable would decline exponentially to the long-run target. The exponential decline is reflected in a distributed lag model. The “lag” is a delay in a response and the “distributed” is the amount of the delay. I discuss this type of model in Chap. 4. I show oscillatory changes in the right panel of Fig. 2.7. The oscillations could be increasing so that sales eventually “explode” beyond a point the company could handle. Or they could go to zero, perhaps because customers come to believe that the lower price signals a deterioration of quality. See Monroe (1990) and Nagle and Holden (2002) for some comments about price and quality perceptions. Also, see Coad (2009) for a statistical analysis of price distributions for different product qualities. The time pattern and when the predicted long-run equilibrium will be reached are important for the decision maker, not to overlook the viability of the business. The business may not be able to survive if the long-run equilibrium value is not attained for several years or if the oscillating pattern produces unmanageable or zero sales. The decision maker needs a time-based prediction, that is, a forecast, of sales under the scenario of a 1% price decrease. The dashboard will not provide this. The prediction I described above will not provide this. Only a time-based forecast will. I describe some basic time-based forecasting methods in Chap. 3 and advanced methods in Chap. 4. Another input, a time-based forecast, must be added to the feedback loop. I show this addition in Fig. 2.8. The forecast would provide input to the dashboard so it is an additional information for the decision maker. The combination of prediction, forecast, and system state variables via the dashboard is the Rich Information I discussed in Chap. 1 and in Paczkowski (2022b).
2.4 System Complexity and Scale-View
41
Fig. 2.8 This illustrates a simple system with a feedback loop that has prediction and forecast inputs. Notice that the forecast would provide input to the dashboard
2.4 System Complexity and Scale-View I described the system I illustrated in Fig. 2.5 as “simple.” This is just one end of a continuum of business system configurations ranging from simple to complex. The whole array of possible systems is called complex business systems. But the basic concept of Fig. 2.5 is unchanged whether you call the system simple or complex, business or otherwise. There is a decision maker who directs an action that changes the system state. Many nonbusiness complex systems, such as an economy, do not have a decision maker. This should not stop you from imagining that one exists. Deacon (2013, pp. 49–56) uses the term homunculus, “any tiny or cryptic humanlike creature, . . . as the enigmatic symbol of the most troubling and recalcitrant challenge to science: the marker for the indubitable existence of ententional phenomena in all their various forms, from the simplest organic functions to the most subtle subjective assessments of value.” In short, the homunculus is what we invoke to explain what we otherwise could not explain. These have been “gods, demigods, elves, fairies, demons, and gremlins” in ancient times. See Deacon (2013, p. 53). In modern game theory, we normally assume that only people or firms play games. But you could play a game against Nature or the market, each being a homunculus. In the case of a complex system such as the economy or climate, the homunculus would initiate an action. As the system grows more complex, the question becomes the focus of the decision maker and his/her ability to process all the information, Rich and otherwise, to make credible decisions about the whole system. The more detail the decision maker focuses on, the narrower the scope or scale of viewing the system, the scale-
42
2 A Systems Perspective
Fig. 2.9 This figure illustrates the inverse relationship between the complexity of the system and the scale of viewing it. The smaller the scale-view, the more detail is revealed, and the more complex the interactions in the system that have to be managed
view, a term I introduced above; the less detail, the broader the scope. The scope of the scale-view determines the nature and type of decisions that are the Action Directives. The complexity of the system, the level of detail that has to be managed, varies inversely with the scale-view. Generally, a small scale-view relates to a complex view of the system and a large scale-view relates to a less complex view of the system. For example, a large multiproduct, multiplant international business, when viewed at the scale of individual operations, a small scale-view, would appear to be very complex. There are many different departments and personnel that interact. When viewed from the scale of the entire enterprise, the level of complexity is reduced for both decision making and management. The decision maker of Fig. 2.5 has a different perspective of the business and a different set of issues to manage for a large scale-view than for a small one. I illustrate this inverse relationship in Fig. 2.9. The lower left panel of Fig. 2.9 shows the relationship between the behaviors observed to take place in a system. These behaviors could be clerical, supervisory, managerial, analytical, and R&D activities, to mention a few. The more detail observed, that is the smaller the scale-view, the more behavioral activities that are observed. The less detail, that is, the larger the scale-view, the less behaviors are observed. The CEO observing his/her executive staff sees only their activities and their behaviors. The CEO’s scale-view is large. A manufacturing plant manager for the same company has a much smaller scale-view of the company’s operations, the operations in that plant. This manager would see many different types of activities and behaviors, ranging from plant line workers to loading dock workers, inventory
2.4 System Complexity and Scale-View
43
management personnel, clerical support staff, maintenance personnel, and so on. The CEO does not see these people and is probably even unaware of them personally and what they do. This is not to disparage or demean the CEO and what they do as trivial or unimportant; the scale-view, the perspective, is just different. This also does not mean, of course, that those behaviors at the lower scale-view have disappeared when the larger scale-view is considered. On the contrary, they are still real, and they are still there, but just not observed by the one viewing the system from a large-scale perspective. There is a negative relationship between behaviors observed and scale-view. This relationship is not linear, but negative exponential. To be linear, there must be a scale level for which no behaviors are observed. But this is unrealistic. Even at the CEO level, there is still some behaviors, the executive staff I previously mentioned at a minimum. This may be a small and lean staff, but it exists and performs tasks such as preparing reports and presentations, providing legal counsel, and so forth. A curve should asymptotically approach a lower limit for behavioral activity, but not zero. At the same time, the curve should not terminate at an upper limit of behaviors because everyone typically performs multiple tasks and has multiple assignments and duties. So, as the scale approaches zero, the number of observable behaviors asymptotically approaches the vertical axis. I show this in the lower panel of Fig. 2.9. The upper right panel of Fig. 2.9 shows the relationship between these behaviors and the complexity of the system. The more behaviors observed, the more complexity perceived. At the manufacturing plant level, many behaviors can be observed, whereas the executive suite has few staff members and fewer behaviors as I have mentioned several times. The plant is more complex because of all the behaviors; the executive suite is less complex. This is not to imply that the management of the executive suite is itself not a complex operation, but only that it has fewer behaviors to manage which implies a lower level of complexity. Managing 1000 people at the plant level along with shipments, inventory, schedules, maintenance issues, and so forth is a level of complexity higher than at the executive suite with a staff of five people.12 Similar to the panel in the lower left of Fig. 2.9, I expect the curve to be exponential rather than linear but with a positive slope. Linearity implies terminating points that seem unrealistic. The panel in the lower right of Fig. 2.9 is a simple 45.◦ line that connects the lower left and upper right panels. This diagrammatic tool works by connecting points on the lower left and upper right panel curves via this 45.◦ line. The result of “connecting the dots” across the three panels is the final panel in the upper left of Fig. 2.9. This is the relationship between scale-view and complexity. The larger the scale-view of the system, the lower the level of complexity observed. The larger the scale-view, the less detail is evident. The lower the scale-view, the more detail that is evident, and thus the more complex the system appears to be. It is
12 My personal experience at AT&T was that whenever I went to the executive floor, I was always struck by how calm everyone appeared and how quiet the floor was. The lower floors, however, were marked by a lot of hustle and bustle and noise.
44
2 A Systems Perspective
important to note that the scale and detail are a point-of-view, while the complexity is a perception of what the system is doing and how it operates. The system itself does not change because the scale we view the system changes; our perception of the system changes. It is still the same system.13 The inverse relationship, a complexity profile, in the upper left panel of Fig. 2.9 is only one of many possible configurations. Siegenfeld and Bar-Yam (2020) present and discuss other patterns and how they could develop in applications. Also see Bar-yam (1997). The major point, however, is the inverse relationship. How the decision maker views the complex system, and the level of detail that is managed, determines how that system has to be analyzed. From a predicting point of view, a different set of predictive tools are needed. The simple system of Fig. 2.5 may require simple tools, such as the ones I will discuss in Chap. 3, or maybe advanced ones such as those in Chaps. 4 and 5. But when the complexity rises, something else is needed, a tool for predicting the complex operations of the complex system. This is where simulations are introduced. Simulations are a predictive tool for a complex system, or at least for a portion of that complex system. The arsenal of predictive tools includes the ones I will discuss in Chaps. 3–5 plus simulations which I introduce beginning in Chap. 7. Which is used depends on the scale-view of the decision maker, not the data scientist creating the prediction. For a very large scale-view (e.g., the whole enterprise), the tools of Chaps. 3 and 5 are still appropriate. Even for a very small segment of the enterprise with little complexity (e.g., the demand for a single product), they are appropriate. But when a more complex view is required, then simulations are required.
2.5 Simulations and Scale-View Simulations, as a form of Predictive Analytics, are concerned with the entire system (or at least a reasonably large, significant part of it). Since a system is composed of many interconnecting and interacting parts, a predictive model or modeling framework would be overwhelmed by the intricacies of the system. However, they can still be used depending on the scale-view. For a small scale-view, the complexity of the system would make predictive modeling almost impossible. For a large scaleview, the melding would be more tractable and yield greater insight, albeit at a high level. Simulations, as a predictive tool, would be too complicated at the large scaleview but very powerful and beneficial at the small scale-view.
13 This is comparable to Einstein’s famous thought experiment of a stone dropped on a train platform. The perceptions of the stone’s movement and timing differ depending on the observer’s point of view, but the stone still fell regardless of the point of view. See Weinert (2005, p. 64) for an example. Especially see Lake (2018) for interesting and useful charts illustrating the concept. Finally, see Einstein (1961) for his own account of this thought experiment.
2.5 Simulations and Scale-View
45
There is a caveat, however. The scale-view could be defined at any level of an organization; it is not a fixed concept, immutable by the decision makers but definable and malleable by them for their purpose. The CEO could request an analysis of supply-chain flows across multiple business units which would be a scale-view below what she normally considers. Simulations might then be appropriate. Conversely, a plant manager could require an analysis of production processing and fulfillment of one segment of the manufacturing facility encompassing material handoffs from one production group to another downstream group. This is a smaller scale-view than what he normally manages, but yet complicated enough that the usual predictive methods are insufficient. The use of simulations vs. predictive methods is dependent on requirements as well as the scale for an analysis. The issue of the scale-view from a practical perspective, while simple to explain in a book such as this, is more difficult to work with in modern, real-world contexts. As noted by Brailsford et al. (2010, p. 2293), communications speed and global organization complexities have blurred the distinction between strategic, tactical, and operational decisions. A tactical decision (e.g., price change) could have major strategic implications (e.g., a change in business direction). Where does one end and the other begin? The implication is that a simulation may have to be broad and narrow at the same time: it may have to be broad enough for a strategic view, yet narrow enough for a tactical or operational view. A hybrid approach may be required. A simulation, as a predictive tool, is analogous to a predictive model. Both require inputs and produce output. You will see this for the predictive methods I will discuss in Chaps. 3–5. For a simulation, the major input is knowledge of the system itself, and the processes that have to be modeled. They are also both based on parameters, or fixed values, that represent standard features of only a part of the system. A predictive model has a few although that “few” could still be dozens. This means that only examining a few parameters, i.e., a small part of a system, overlooks the whole system I&R. If you use traditional predictive analytical methods, you will miss the larger picture. As noted by Bar-yam (1999, p. 9), a simulation is different in that it has far more parameters just because it deals with the system; a predictive model only deals with a part of the system. Computer simulations must “keep track of many parameters. . . ” Bar-yam (1999, p. 9) At the same time, a simulation may need input from a prediction model or even provide input to a prediction model. Regardless, a simulation provides input to the decision maker no different than what predictive methods provide. I illustrate this in Fig. 2.10.
46
2 A Systems Perspective
Fig. 2.10 This illustrates a simple system with a feedback loop that has a simulation component. The simulation takes input from the system itself in the form of a description of the system processes. It then provides direct input to the decision maker no different than the predictions. The dashed lines indicate that the flows are tentative and depend on requirements
Part II
Predictive Analytics: Background
This second part of the book begins the development of Predictive Analytics methodologies. After reading this part of the book, you should be able to conduct most prediction tasks typical in a business context. If you have my previous book, Paczkowski (2022b), or have some econometrics training, you will be familiar with this material. Do not worry if neither applies to you since I will present and develop the necessary background material you will need.
Chapter 3
Information Extraction: Basic Time Series Methods
All data share a common structure, and all models are meant to reflect that data share the same structure. The commonality is expressed as Data = I nf ormation + Noise.
.
(3.1)
The information is the main message or factor of importance contained in the data. This is sometimes called the signal, but I prefer information. Equation (3.1) may seem simplistic, but it is not. The goal of data analytics is the extraction of information from data for use in decision making. This extraction requires a toolkit, but the toolkit has a huge number of extraction methods, each developed to handle a different class of problem. The large number of available extraction methods, which can be classified by data type and extraction objectives, presents a problem. The data types are time series and cross-sectional data.1 Time series data are for one analytical unit measured across time, for example, monthly same-store sales for a national retailer. Crosssectional data are measures on several analytical units at one point in time, for example, average grocery store dollar-sales volume by state in 2022. The extraction objective can be divided into description or application. The former is just basic descriptive statistics such as means and proportions. To risk being repetitive, this is Poor Information that may be just insightful. The latter can be classification (e.g., assigning customers as shoppers or non-shoppers, credit applicants as risky or not, and products as defect-free or not), explanation (e.g., what determines the number of orders received per month), or prediction (e.g., how much will be sold in the next four quarters). This is Rich Information. My focus is the prediction application. In this chapter, I will begin to focus on predictive methods because a decision maker should know the effect of any and all decisions regardless of the perceived
1 There is also a combination of the two. This is called panel data or longitudinal data. I will not discuss this type.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_3
49
50
3 Information Extraction: Basic Time Series Methods
importance of the decision. I will cover basic time series forecasting methods in this chapter, more advanced ones in Chap. 4, and non-time series ones in Chap. 5. I give time series methods more emphasis because most decisions are in time or have a time-associated question attached to them. For example, a product manager may want to know the time pattern of retail sales by month for the next year. Or, the product manager could ask: “How long will it take for the full effect of a price decrease to be realized?” This chapter will be somewhat theoretical. It is important to know the theoretical foundations of the tools in your Predictive Analytics toolkit so that you know the correct one to use for a particular task. Not all analytical methods are appropriate for all situations. The methods mostly differ in what they do and how they perform for different problems, but they also differ, sometimes only slightly, in their output. These slight differences could greatly affect interpretations and recommendations to decision makers.
3.1 Overview of Extraction Methods There are many information extraction methods. Lepeniotia et al. (2020) list several possibilities divided into three categories with subdivisions: Probabilistic Methods These are concerned with the probability of an event happening. Other methods also deal with probabilities (e.g., logistic regression which I develop in Chap. 5), but this class differs in that the probabilities are based on causal relationships among events. Historical data are downplayed but still used, just not as extensively as with other methods. The main methods are Bayesian networks, Markov chain Monte Carlo (MCMC), and the hidden Markov model. Machine Learning/Data Mining Methods These heavily rely on historical data. The data are usually split into training and testing data sets. Models are then trained with the training set and tested with the testing set. In some aspects of machine learning (e.g., clustering), algorithms are the main tool. These are unsupervised learning methods. Supervised learning methods rely on a target variable; unsupervised do not. The subdivisions are extensive and include decision trees and its variations, pattern recognition, support vector machine (SVM), artificial neural networks (ANN), and clusteringbased heuristics, to mention a few. See Lepeniotia et al. (2020) for more examples and discussion. I comment on decision trees and ANNs in Chap. 5. Statistical Methods These frequently overlap with many machine learning methods. They involve the detailed use of statistical and econometric methods to estimate unknown parameters in parametrically specified models. The main exemplar is the regression family which is statistically based but is also used in machine learning, hence the overlap. The family members are purely data-driven using historical, survey, or
3.3 Time Series and Forecasting Notation
51
experimental data. The regression family is large with members linked together by link functions. I will explain a link function below. See Alpaydin (2014, Chapter 4) on regression in machine learning and also Paczkowski (2022b). The problem with the statistical methods is that all too often only simple statistics (e.g., means, proportions, standard errors) are calculated and displayed as simple charts (e.g., pie and bar). This is Shallow Analysis resulting in Poor Information. More sophisticated Deep Analysis, yielding Rich Information, is done using regression approaches such as multiple linear regression, logistic regression, multinomial logistic regression, support vector regression, autoregressive integrated moving average methods, and density estimation, to mention a few. See Lepeniotia et al. (2020) for more examples and discussion. Also see Paczkowski (2022b).
3.2 Predictions as Time Series The majority of business decisions are time-oriented. Hence, the predictions (i.e., forecasts) needed, especially simulator-based predictions which I will discuss later, are time-oriented. The data are times series reflecting a dynamic process with changes or developments taking place over time. For example, if a decision is made to reduce a price, it is useful to know how much sales will increase as a result. This is given by an elasticity. This is Poor Information. But it is more useful, and more informative to know when sales will begin to increase, when (and if) they will plateau, and at what amount will they plateau (i.e., if and when they reach a steadystate level). In economic terminology: when will a new equilibrium be reached? This is Rich Information. I will revisit the notion of a steady-state time series in Chaps. 7 and 8 where I discuss the time series results of a simulation. Data scientists could and do use times series in their work, for example, sales, prices, revenue, contribution margins, hiring, attrition, inventory levels, and product returns, and the list goes on and on. Time series analysis methods can be applied to any time series, regardless of source (i.e., databases or simulations). The source is irrelevant. My focus is on time series predictions (i.e., forecasts) coupled with simulations.
3.3 Time Series and Forecasting Notation Let .Y0 , Y2 , . . . , Yt−1 , Yt , Yt+1 , . . . , YT be a times series from period 0 to period T where 0 denotes the epoch (i.e., the beginning of time; the first period available; the beginning of time for data storage) and T is the most recent or last period available. You can interpret your data set’s epoch as the first observation. Many computer systems and software, statistical and otherwise, use the Unix epoch which is one second past midnight on January 1, 1970. Time is recorded as the number of seconds
52
3 Information Extraction: Basic Time Series Methods
after the epoch. After that, the periods may be daily, weekly, monthly, quarterly, or annually. This may be inconsequential except when you have to do calendrical calculations. See Dershowitz and Reingold (2008) for a comprehensive treatment of calendars and calendrical calculations. The series is a history of an actual process such as weekly sales, length of time each month to fulfill an order, and so on. I refer to this history as the actuals. A prediction or forecast made at .t = T for the first period outside the history of actual observations, that is, for period .t = T + 1, is denoted as .YT (1). This is a onestep ahead forecast for .t = T + 1 made in period T . The h-step ahead forecast, .YT (h), is the forecast for .t = T + h made in period T . The set of forecast values .{YT (1), YT (2), YT (3), . . . , YT (k)} is a forecast profile where k is the maximum forecast horizon needed for any rational business decision. For example, assuming annual data, you would use .k = 5 years for a five-year business plan and .k = 30 years for a major capital investment such as a new production facility or relocation of corporate headquarters from one state to another. Clearly, .1 ≤ h ≤ k.
3.4 The Backshift Operator: An Overview A major property of a time series is that a value of the series at one point in time is related to previous values in the same series. Each value does not appear anew each period but is the result of a data generating process (DGP) that involves values from previous periods. These are lagged values. This is the lag structure I mentioned in Chap. 2. So, if .Yt is the value in period t, then it is generated by, or related to, the previous value, .Yt−1 . By the same logic, this lagged value is itself related to the one before it, or .Yt−2 , which means that .Yt is also related to .Yt−2 . This lagging relationship can be continued backward in time. Trying to express all the relationships through time can become cumbersome. In addition, trying to understand the theoretical and statistical properties of a time series is complicated by this lagging structure. The backshift operator is an efficient way to handle both the expression and analysis problems. The backshift operator is represented by B and, as an operator, it operates on or is applied to a time series in a simple manner.2 It simply shifts the time index back one period: BYt = Yt−1 .
.
(3.2)
Notice that if .Yt = c is a constant, then .Bc = c, so effectively .B = 1 in this case. The operator can be applied recursively with the number of recursions represented as the exponent on B. For example,
2 Some
authors use L and refer to it as a lag operator.
3.5 Naive Forecasting Models
53
B 2 Yt = B(BYt ).
.
(3.3)
= BYt−1.
(3.4)
= Yt−2 .
(3.5)
B 2 does not mean that B is squared. It means that B is applied twice. Therefore, m 0 .B Yt = Yt−m . If c is a constant, then .Bc = B c = c. You can use some basic algebra with B. For example, you can show that .
(1 − φB)−1 = 1 + φB + φ 2 B 2 + φ 3 B 3 + . . . .
.
(3.6)
This is a geometric series. I provide a simple demonstration of this expansion and similar useful ones in the Appendix. There are other manipulations of the backshift operator which I will not use here. See Paczkowski (2022b) for some examples. Also, see Dhrymes (1971, Chapter 2) for a highly technical discussion of this operator.
3.5 Naive Forecasting Models It may be surprising, but many forecasters use simple techniques, especially naive methods and smoothers. The most naive prediction technique uses the current period’s actual value as the best forecast for the next period. The prediction is the last actual value. A smoothing predictive technique uses weighted sums of data to create a prediction or to smooth irregularities in data. I will discuss both in this section. A naive forecast model uses the current period’s actual value for the onestep ahead forecast. The current period is the last one in the series, or .YT . The information in the data series is assumed to be all in this last observation, .YT , so only that information is used for a forecast. You can write the forecast as YT (1) = YT .
.
(3.7)
This is sometimes called the Naive Forecast 1 (NF1). See Levenbach and Cleary (1981) and Levenbach and Cleary (2006). The simplistic assumption is that what will happen tomorrow is the same as what already happened today, hence the naivete. This is reasonable for operations with a small, steady demand so that inventory fluctuations are small. So, it may not be that naive after all. Since .YT is the best forecast of .YT +1 , the best forecast of .YT +2 is the one for .YT +1 , which is simply .YT . This can be expanded as .YT (2) = YT (1) = YT ; .YT (3) = YT (2) = YT (1) = YT ; and so on. The forecast profile is just a series of .YT values. A naive h-step ahead forecast is then a repetition of the one-step ahead forecast since nothing else is known beyond period T , so
54
3 Information Extraction: Basic Time Series Methods
YT (h) = YT , ∀h ≥ 1.
.
(3.8)
The technique has four obvious flaws. The first is that each time a new actual becomes available, a new forecast must be generated. The forecast then becomes a moving target, constantly changing. If the differences in the actuals from one period to the next are small so that you could consider them to be zero, then this technique will suffice. Otherwise, other techniques should be used. The second flaw, associated with the idea that a new actual will change the forecasts, perhaps dramatically, is that the process that led to the actuals is not considered. How did you get to that last actual value, .YT ? If there was a decrease (i.e., .YT − YT −1 < 0) that brought you to the last actual, then it might be reasonable to believe that the decrease will repeat. Similarly, for an increase (i.e., .YT − YT −1 > 0). A modification to the NF1 in (3.7) to account for these changes is YT (1) = YT + p(YT − YT −1 )
.
(3.9)
where p is the proportion of the change from .T − 1 to T you wish to include. What is the basis for p? Pick a number; there is no guiding rule. You can calculate a proportion from the last two observations, but it will always change. This is Naive Forecast 2 (NF2). I show an example of NF1 and NF2 in Fig. 3.1. Despite these issues, the naive forecast is sometimes used as a benchmark, a “know nothing else” forecast. The third flaw of this method is that it does not allow for any random variation or noise in the data. Recall that .Data = I nf ormation + Noise. For NF1 and NF2, .Data = I nf ormation which is difficult to accept. Finally, the fourth flaw is that the entire time series is ignored; only .YT is used (and maybe .Yt−1 for NF2). This is a tremendous loss of data and whatever information is contained in the entire data set since the information is naively assumed to be encapsulated in .YT . Something better is needed to handle this last flaw.
3.6 Constant Mean Model The constant mean model addresses the basic statement that .Data = I nf ormation + Noise, although it is just as naive as NF1 and NF2. It does serve, however, as the foundation for advanced models that are not naive and are used, as I will mention in Chap. 12, in many practical applications. You can view this method as a base model from which many others are specified. It is Y t = μ + t .
.
t = 1, 2, . . . . t ∼ N (0, σ2 ) ∀ t.
(3.10) (3.11) (3.12)
3.6 Constant Mean Model
55
Fig. 3.1 These are two-step ahead naive forecasts for daily sales of a product with the last actual data point on day 4 of a week. Both the NF1 and NF2 forecasts are shown. For this example, I used .p = 0.5 for the NF2 method so that .892.5 = 940 + 0.5 × (940–1035)
Cov(t , t+s ) = 0, s = ±1, ±2, . . .
(3.13)
where .Cov(·) is the covariance. A covariance is always between two, and only two, random variables. See the Appendix for some information about a covariance. I display this model in Fig. 3.2. See Gilchrist (1976). This model is a combination of a constant part, .μ, and a random or stochastic part, .t . The .μ is the mean or expected value of the actual time series. That is, .E(Y ) = μ. It is the information; the noise is the .t which is assumed to be normally distributed with mean 0 and variance .σ2 and zero covariance between any pair of noise elements3 The stochastic part is called white noise. See the Appendix for a
3 .Cov( , t t−1 )
= 0 is commonly assumed. I will examine this case later.
56
3 Information Extraction: Basic Time Series Methods
Fig. 3.2 This is what the time series data might look like for a constant mean model. The horizontal line is at the mean, indicated in the descriptive statistics table, and the vertical line to the far right is at time .t = T
review of the expected value and variance. Since the mean is constant, the actual future value in period .t + k, .Yt+k for any t (including .t = T ) is just the same model but for period .t + k: Yt+k = μ + t+k .
.
From this, .E(Yt+k ) = μ and .V (Yt+k ) = σ2 , ∀k where .V (·) is the variance.
(3.14)
3.6 Constant Mean Model
57
3.6.1 Properties of a Variance There are several general properties about the variance of the random term, .t , that are helpful for you to know. For any random variable, X, these are V (cX) = c2 V (X).
(3.15)
.
V (X1 + X2 ) = V (X1 ) + V (X2 ) + 2Cov(X1 , X2 ).
(3.16)
V (X1 + X2 ) = V (X1 ) + V (X2 ) if Cov(X1 , X2 ) = 0.
(3.17)
See the Appendix for proofs of these properties.
3.6.2 h-Step Ahead Forecasts First, I will discuss a one-step ahead forecast, .YT (1). For this, you must assign values to both .μ and .t+1 . The random variable .t+1 is, by definition, independent of the available information in the time series, and, hence, you cannot forecast its future value. You can only state its expected value: 0. Your best forecast of the noise is .t+1 = 0. The .μ is constant over all time and so an estimate of its value at the present time will also be an estimate of its value at a future time. The most natural estimate of .μ is the sample mean. A one-step ahead forecast at .t = T is then YT (1) = Y . T Yi = i=1 . T
(3.18)
.
(3.19)
So, the entire time series is used, unlike for NF1 and NF2. This forecast, however, is still Poor Information for decisions. What is .YT (2)? Since the mean is constant, .YT (2) requires .T +2 . The best forecast of this is zero, so .YT (2) = μ which is estimated by .Y . But this is the same estimate as for .YT (1). This would be the case for every forecast. Therefore, .YT (h) = Y , ∀h ≥ 1. Several useful statistical properties of the constant mean model are instructive to know. You can re-express the one-step ahead forecast as T YT (1) =
.
i=1 Yi
T
(3.20)
.
T =
i=1 (μ + i )
= μ+
T T
i=1 i
T
.
.
(3.21) (3.22)
58
3 Information Extraction: Basic Time Series Methods
The random noise is included as an average, unlike for NF1 and NF2. Taking expectations, you get E [YT (1)] = μ.
(3.23)
.
Thus, the one-step ahead forecast is unbiased. Since you are forecasting .YT +1 , you can expect that your forecast will differ from the actual for .YT +1 when it is ultimately available. Your forecasts will always be in error. The question is the size of the forecast error. The forecast error is the difference between the actual observation in period .T + 1 and the forecast for this period. The error is eT +1 = YT +1 − YT (1).
(3.24)
.
You can substitute the forecast and collect terms to get eT +1 = YT +1 − YT (1).
.
T
= [μ + T +1 ] − μ +
i=1 i
T
(3.25)
.
(3.26)
T
i=1 i
= T +1 +
T
.
(3.27)
Since .E(T +k ) = 0, ∀k, you can see again that the forecast is unbiased for all future values. Using (3.22) and the trivial fact that .V (μ) = 0 since .μ is a constant, you have V [YT (1)] = V
.
T i i=1
1 = 2V T =
1 T2
T
.
T i .
T
(3.28)
(3.29)
i=1
σ2 by independence.
(3.30)
i=1
=
T σ2 . T2
(3.31)
=
σ2 . T
(3.32)
You should recognize this as the variance of the sample average, .V (X), from basic statistics. Since the mean error is zero because the forecast is unbiased, the mean
3.6 Constant Mean Model
59
square error (MSE) equals the variance of the error. The MSE for an estimator, say θˆ , of a population parameter .θ is defined as
.
MSE = E[(θˆ − θ )2 ].
.
(3.33)
ˆ + Bias 2 where the bias is the difference I show in the Appendix that .MSE = V (θ) between the expected value and the parameter: .Bias = E(θˆ ) − θ . For this model, the bias is zero. So, MSE = V (eT +1 )
.
(3.34)
where .eT +1 is the forecast error. Since .YT +1 and .YT (1) are independent, you have MSE = V (eT +1 ).
.
(3.35)
= V [YT +1 − YT (1)].
(3.36)
= V (YT +1 ) + V [YT (1)].
(3.37)
σ2
= σ2 + . T 1 σ2. = 1+ T T +1 σ2 . = T
(3.38) (3.39) (3.40)
As T gets larger (i.e., you have more data), the method estimates .μ with higher precision and the MSE asymptotically approaches .σ 2 . You can now develop 95% confidence intervals for a forecast. Assume that the forecast errors are normally distributed with zero mean and constant variance .σ2 . Then, P r(−1.96σ ≤ eT +1 ≤ 1.96σ ) = 0.95.
.
Since .eT +1 = YT +1 − YT (1) and .σ2 =
(3.41)
T +1 2 σ , you get T
T +1 T +1 σ ≤ YT +1 ≤ YT (1) + 1.96 σ (3.42) .P r YT (1) − 1.96 T T
for the 95% CI or prediction interval. You would use the sample standard deviation of the time series to implement this prediction interval. See Gilchrist (1976, p. 46). The h-step ahead prediction interval is
T +h T +h σ ≤ YT +h ≤ YT (h) + 1.96 σ . (3.43) .P r YT (h) − 1.96 T T
60
3 Information Extraction: Basic Time Series Methods
3.7 Random Walk Model A model that is a logical extension of the constant mean model is the random walk model. This is not a forecasting model per se, but it is very important in the context of advanced forecasting methods such as the autoregressive model and the associated ARIMA model that I will discuss in Chap. 4. I will introduce this model in this section and then refer to it later in this book.
3.7.1 Basic Random Walk Model A random walk model is the naive model (NF1) with a stochastic element: Yt = Yt−1 + t
.
(3.44)
where the stochastic term is white noise, so .Cov(t , t−1 ) = 0. Starting from .Y0 , the epoch, the series evolves as Y0 = Y0.
(3.45)
Y1 = Y0 + 1.
(3.46)
Y2 = Y1 + 2 = Y0 + 1 + 2.
(3.47)
.
.. ..
(3.48)
Y t = Y 0 + 1 + 2 + . . . + t . = Y0 +
t−1
(3.49) (3.50)
t−i .
i=0
Based on (3.50), “today” is just the evolution of all the noise terms. This is the basis of a stock market. See Malkiel (1999). If .t is white noise following a normal distribution, .t ∼ N (0, σ2 ), then .Yt is also normally distributed by the Reproductive Property of Normals. See Paczkowski (2022b). Using the backshift operator, B, and assuming .Y0 is in the infinite past, you get Yt = Yt−1 + t .
(3.51)
= BYt + t .
(3.52)
.
−1
= (1 − B)
t .
(3.53)
Recognizing that .(1 − B)−1 = 1 + B + B 2 + . . ., this becomes Yt = t + t−1 + . . . .
.
(3.54)
3.7 Random Walk Model
61
= t
∞
Bi
(3.55)
i=0
where the index, .i = 0, is today or the current period. The infinite sum could either converge or diverge. You certainly want it to converge. You can truncate this infinite sum by just going back to a finite past, .Y0 , for practical purposes. Truncation ensures that the sum converges. Note that you can make .Y0 so far back in time (i.e., the infinite past) that it can be safely ignored. Assume that the infinite sum in (3.55) is truncated at time t so that the sum converges and the series started at .Y0 . Then you only go back to the epoch and have Y t = Y 0 + t
t−1
.
Bi .
(3.56)
i=0
You can now easily find the mean and variance. For the mean, notice that E(Yt ) = E(Y0 ) + E(t )
t−1
.
B i.
(3.57)
i=0
= Y0
(3.58)
i since . t−1 i=0 B converges because it is a finite sum. For the variance, you have V (Yt ) = V (Y0 ) + V (t )
t
.
B 2i.
(3.59)
i=0
= σ2
t
B 2i.
(3.60)
i=0
= tσ2
(3.61)
since .σ2 is a constant and .Y0 is non-stochastic so .V (Y0 ) = 0.4 Notice that the variance gets bigger as time passes because of the t multiplier. The above is just a description of the random walk model. How is it used to forecast? What form will those forecasts take and how are they derived? Let .Y = . . . , YT −1 , YT be the given history which you know. For ease of notation, I will
4 The backshift operator is squared because of the property of variances. In addition, each squared backshift is applied to a constant (i.e., .σ2 ) so it is effectively equal to 1. Finally, the summation involves t terms.
62
3 Information Extraction: Basic Time Series Methods
omit this history later, but for now, I will note it for clarity. You should recognize the following: • • • • •
YT (1) is composed of .YT plus .T +1 ; YT is known so it is not a random variable; The actual .YT +1 is a random variable because it has not occurred yet at time T . .T +1 is a random variable; and therefore the one-step ahead forecast, .YT (1), is a random variable.
. .
What is the one-step ahead forecast? It is .YT (1) = YT + T +1 . But you do not know .T +1 , just its (assumed) expected value, .E(T +1 ), which is zero. So, the expected value of the one-step ahead forecast, .YT (1), conditioned on its history, .Y, is E[YT (1) | Y] = E(YT + T +1 | Y).
.
(3.62)
= E(YT | Y) + E(T +1 | Y).
(3.63)
= YT + E(T +1 | Y).
(3.64)
= YT .
(3.65)
See Gilchrist (1976). The expected number in the next period is just the current actual. This says that the best guess for tomorrow is the value today. This is like the NF1, but history is now considered through the conditional. I will now drop the .Y notation since its function is clear, but remember that the forecasts are conditioned on this history. The variance of .YT (1) is V [YT (1)] = V (YT + T +1 ).
.
(3.66)
= V (YT ) + V (T +1 ).
(3.67)
= 0 + V (T +1 ).
(3.68)
= σ2 .
(3.69)
If . is normally distributed white noise, then you have YT (1) ∼ N (YT , σ2 )
.
(3.70)
by the Reproductive Property of Normals. A 95% one-step ahead prediction interval is YT ± 1.96σ .
.
(3.71)
Now extend the forecast to two steps ahead. Remember that the forecasts are still made in period T , just for two periods ahead. Then,
3.7 Random Walk Model
63
E[YT (2)] = E[YT (1) + T +2 ].
.
(3.72)
= E(YT + T +1 + T +2 ).
(3.73)
= E(YT ) + E(T +1 ) + E(T +2 ).
(3.74)
= YT + E(T +1 ) + E(T +2 ).
(3.75)
= YT .
(3.76)
The mean is still .YT , an NF1-like, so the forecast is just the current value, .YT . The variance is V [YT (2)] = V (YT ) + V (T +1 ) + V (T +2 ).
.
=
2σ2 .
(3.77) (3.78)
In general, the variance of successively longer forecast horizons is .hσ2 . See the Appendix for a proof. So, the .E[YT (h)] = YT , ∀h ≥ 1 and .V [YT (h)] = hσ2 , ∀h ≥ 1. The 95% h-step ahead prediction interval is √ YT ± 1.96 hσ .
.
(3.79)
3.7.2 Random Walk with Drift Now consider the following random walk model: Yt = Yt−1 + t + δ
.
(3.80)
where .δ is a constant. This is a random walk with drift model. The .δ is a drift parameter that pulls the series upward through time if .δ > 0, and downward otherwise. I will provide an example of a random walk with drift in Chap. 10 when I discuss simulations. Starting at the epoch, .Y0 , as before, you get via successive substitutions Y0 = Y0.
(3.81)
Y1 = Y0 + 1 + δ.
(3.82)
Y2 = Y0 + 1 + 2 + 2δ.
(3.83)
.
.. .. Yt = Y0 + 1 + 2 + . . . + t + tδ.
(3.84) (3.85)
64
3 Information Extraction: Basic Time Series Methods
The series keeps drifting by .δ. You can also see this using the backshift operator with .Y0 in the infinite past: Yt = BYt + t + δ.
(3.86)
.
−1
= (1 − B) = t
∞
−1
t + (1 − B)
Bi + δ
i=0
∞
δ.
Bi .
(3.87) (3.88)
i=0
Truncating the summations to period t and letting the epoch not be in the infinite past, you have Y t = Y 0 + t
t−1
.
B +δ i
i=0
= Y 0 + t
t−1
t−1
B i.
(3.89)
i=0
B i + tδ.
(3.90)
i=0
The mean and variance are then E(Yt ) = Y0 + tδ.
(3.91)
V (Yt ) =
(3.92)
.
tσ2 .
The expected value keeps drifting, whereas before it was constant at .Y0 . The variance increases over time. Because both the expected value and variance constantly increase by the drift factor, this series is said to be nonstationary. I will discuss non-stationarity in Chap. 4. The one-step ahead forecast, .YT (1), is the expected value of .YT (1): E[YT (1)] = E[YT + T +1 + (T + 1)δ].
.
(3.93)
= E(YT ) + E(T +1 ) + E[(T + 1)δ].
(3.94)
= YT + E(T +1 ) + (T + 1)δ.
(3.95)
= YT + (T + 1)δ.
(3.96)
So, the expected number in the next period is just the current actual, .YT , plus the drift for one more period. The variance of .YT (1) is V [YT (1)] = V [YT + T +1 + (T + 1)δ].
.
(3.97)
= V (YT ) + V (T +1 ) + V [(T + 1)δ].
(3.98)
= T σ2 + σ2 + 0.
(3.99)
3.8 Simple Moving Averages Model
65
= (T + 1)σ2 .
(3.100)
Remember that .YT is an actual, so it is not stochastic. If . is normally distributed white noise, then, YT (1) ∼ N (YT + (T + 1)δ, (T + 1)σ2 )
.
(3.101)
by the Reproductive Property of Normals. A 95% prediction interval is simply YT + (T + 1)δ ± 1.96 (T + 1)σ .
.
(3.102)
Now extend the forecast to two steps ahead. Then, E[YT (2)] = E[YT (1) + T +2 + (T + 2)δ].
.
(3.103)
= E[YT (1) + T +1 + T +2 + (T + 2)δ).
(3.104)
= E[YT (1)] + E(T +1 ) + E(T +2 ) + E[(T + 2)δ].
(3.105)
= YT + E(T +1 ) + E(T +2 ) + (T + 2)δ.
(3.106)
= YT + (T + 2)δ.
(3.107)
So, the forecast is still the last actual value plus drift. The variance of the two-step ahead forecast is V [YT (2)] = V [YT (1) + T +2 + (T + 2)δ].
(3.108)
V [YT (1)] + σ2.
(3.109)
.
=
= (T + 1)σ2 + σ2.
(3.110)
+ 2)σ2 .
(3.111)
= (T
Note that the forecast is flat with spreading variance. The forecast of successively longer forecast horizons, .h ≥ 1, is just the current value plus drift, .YT + (T + h)δ, and the variance is .(T +h)σ2 . I provide a proof of this last variance in the Appendix.
3.8 Simple Moving Averages Model The constant mean model is only good for short periods. It is unlikely that the mean will be constant from one “locality in time” to another. A locality is one or more periods as a constant chunk of time, say the first five months, the second five months, etc. of a time series. A locality is a window of size m observations. Five-month chunks are .m = 5.
66
3 Information Extraction: Basic Time Series Methods
The averaging of all the data in the constant mean model had the effect of reducing random variation, leaving an estimate of .μ. If this estimate is required for just one window, then you simply average the data in that window and ignore the rest. But a focus on only one window is highly unlikely. You will typically be interested in many windows. The mean, .μ, is often viewed as a slowly varying quantity that can be seen through the window if you move the window through the series by single incremental periods. Finding an estimate of .μ in multiple windows is the moving averages method. The moving averages method is the most popular one for smoothing time series because of its simplicity. It is based on averaging the values inside a window as you slide it over the series through time. You place a window of size m over a set of values and calculate the average of those m values. The m-term moving average is Yt,m =
.
=
Yt + Yt−1 + . . . + Yt−(m−1) . m
(3.112)
m−1 1 Yt−i . m
(3.113)
i=0
You slide the window over one period to cover a new set of m values and calculate a new average. You continue this sliding-window process until all the values have been covered. This smooths historical data in which the effect of seasonality, if this is a factor, and randomness, which is always a factor, are eliminated and reduced, respectively. See Levenbach and Cleary (2006, p. 42). You could set m to any size for smoothing. The larger you set it, the greater the smoothing. If .m = n, where n is the sample size, then there is only one window and one estimate of .μ which is, of course, .Y . This is back to the constant mean model. This method is good for forecasts of one or two periods ahead in which trend or seasonal patterns can be ignored. This is then an effective, readily understood, and practical method. I illustrate the process in Fig. 3.3. I use the Pandas rolling method for the calculations. The argument is the number of periods to include in a calculation. The method output is pasted to the mean function to complete the calculation of the moving average. Using the (fictitious) sales data in Fig. 3.3, the estimate of .μ for the most recent .m = 3 months, periods 10–12, is the sample average of 7.0 .(= (8+7+6)/3). I randomly generated the “Sales” data using a random number generator. I will discuss this generator and others in Chap. 9. The window average has to be placed somewhere relative to the data used for the window. Two options are the center of the window or at the end next to the window’s last observation. In Fig. 3.3, the last window covers values (8, 7, 6). The average of 7.0 is placed next to the last window value of 6. Centering would have placed it next to the 7 in the window. Most software packages default to non-centering. Pandas defaults to non-centering. A one-step ahead forecast made at .t = T based on a simple moving average is found by setting the forecast equal to the value of the moving average at time
3.8 Simple Moving Averages Model
67
Fig. 3.3 This is an example of a simple moving average of fictitious monthly sales data. The window size is .m = 3. The Pandas method rolling, which I used here, has a required argument of the window size. A default argument, center, which is not shown, indicates whether or not to center the calculated window mean; the default is not to center, or center .= False. The sales numbers were randomly generated
68
3 Information Extraction: Basic Time Series Methods
t = T : .YT (1) = YT ,m . It is the last smoothed value based on the last m actuals— you cannot do any more calculations. So, the one-step ahead forecast, .YT (1), is a moving average based on a simple average of the current period’s value, .YT , and the previous .T − (m − 1) values. The h-step ahead forecast, .YT (h), is the repetition of the one-step ahead forecast: .YT (h) = YT (1) = Y T ,m ∀h ≥ 1. You can now forecast period 13: .Y12 (1) = 7. You cannot calculate any more averages to forecast beyond period 13 since there are no more actuals past period 12. The best you could do is repeat the last average, but this is now equivalent to a constant mean, naive model forecast. If .μ wanders, then this forecast is biased because the forecast is always constant.
.
3.8.1 Weighted Moving Average Model The simple moving average is often said to be unweighted. It is actually weighted, but the weights are constant so that each value of the historical data series has the same weight. For a simple average of n observations, the weights are .1/n. Then, Y 1 + Y 2 + . . . + Yn . n 1 1 1 = Y1 + Y2 + . . . + Yn. n n n = w1 Y1 + w2 Y2 + . . . + wn Yn.
Y =
.
=
n
wi Yi
(3.114) (3.115) (3.116) (3.117)
i=1
where .wi = 1/n, . ni=1 wi = 1, are the weights. The weights are said to be normalized. A weight profile is the vector of n terms .Wn = (1/n, 1/n, . . . , 1/n) ∈ Rn . For a moving average with window size m, the weights are .ωi = 1/m and m . i=1 ωi = 1. For short-term forecasting using daily or weekly data, the most recent historical period is usually the most informative, so you typically want to weight the most recent values more heavily; the most recent are more important. With conventional weights, the weights are all the same; with importance weights, they differ. The weighted m-term moving average is Yt,m = ω1 Yt + ω2 Yt−1 + . . . + ωm Yt−(m−1).
.
=
(m−1)
ωi+1 Yt−i
i=0
m−1 ωi = 1. with .ω1 > ω2 > . . . > ωn and . i=0
(3.118) (3.119)
3.8 Simple Moving Averages Model
69
Sometimes, the weights do not sum to 1.0. In this case, merely divide each weight ωi by . ωi to get the normalized weight. The h-step ahead forecast, .YT (h), based on a weighted moving average is the h-step ahead forecast given by a repetition of the one-step ahead forecast: .YT (h) = YT (1) for .h ≥ 1. If all the weight is placed on the last or most recent observation, then the average is simply that observation. The weight profile for .m = 3 is .W3 = (0, 0, 1). The one-step ahead forecast is just the last observation which is the NF1 model. The NF1 is a special case of the weighted moving average forecast model.
.
3.8.2 Exponential Averaging The one-step ahead forecast based on the simple moving average can be written as YT + YT −1 + . . . + YT −(m−1) . m YT YT −1 + . . . + YT −(m−1) + YT −m YT −m = + − . m m m YT YT −m = + YT −1 (1) − . m m
YT (1) =
.
(3.120) (3.121) (3.122)
Suppose you have only two pieces of information: the most recently observed value, YT , and the one-step ahead forecast for that same period, .YT −1 (1). In lieu of the observed value for period .T − m, .YT −m , which I assume you do not have, you could use an approximation which is a forecast of .YT −m : .YT −m−1 (1). But the only forecast you have is .YT −1 (1), so you can use this. See Wheelwright and Makridakis (1980, p. 62). The forecast .YT (1) is now
.
YT YT −m + . m m YT YT −1 (1) + YT −1 (1) − = . m m 1 1 = YT + 1 − YT −1 (1). m m
YT (1) = YT −1 (1) −
.
(3.123) (3.124) (3.125)
1 so .0 < α < 1. Then you have the general form of the equation for m forecasting by the method of exponential smoothing or exponential averaging: Let .α =
YT (1) = αYT + (1 − α) YT −1 (1)
.
(3.126)
where .YT is today’s actual and .YT −1 (1) is the forecast of today’s actual based on yesterday’s actual. The one-step ahead forecast is the weighted average of the last
70
3 Information Extraction: Basic Time Series Methods
actual and the forecast of that actual. These weights sum to 1.0. You only need the most recent observation, the most recent one-step ahead forecast of “today”, .YT −1 (1), and a value for .α, the weight placed on “today.” You need .0 < α < 1, which you can specify or estimate. Experience has shown that good values for .α are between 0.10 and 0.30. See NIST (2012)5 As a general rule, smaller weights are appropriate for series with a slowly changing trend, while larger weights are appropriate for volatile series with a rapidly changing trend. You can “estimate” .α if you wish by repeatedly trying different values (typically .0.1, 0.2, . . . , 0.9), checking some error statistic such as MSE, and then choosing that value of .α that gives the best value for the statistics (e.g., minimum MSE). This is a grid search. See Wheelwright and Makridakis (1980). Also, see Hyndman and Athanasopoulos (2021) for a good online discussion of exponential smoothing. Especially see Hyndman et al. (2008) for a thorough development of the exponential smoothing method for forecasting. Regardless of how you determine .α, it is a parameter that must be specified; it is not estimated from data. Consequently, it is a hyperparameter. A hyperparameter is a parameter that is set by the analyst and is not estimated by data. It could, however, be the result of simulations as I note in Chap. 8. Since (3.126) has a lagged value of the form of .YT −1 (1), you can expand it by backward substitution to get YT (1) = αYT + (1 − α) YT −1 (1).
(3.127)
.
= αYT + α (1 − α) YT −1 + (1 − α) YT −2 (1). 2
(3.128)
Continuing backward, you get YT (1) = αYT + α (1 − α) YT −1 + α (1 − α)2 YT −2.
.
+ α (1 − α)3 YT −3 + . . .
(3.129) (3.130)
so the current one-step ahead forecast is the weighted average of past actuals. Since 0 < α < 1, then also .0 < 1 − α < 1. Therefore, the weights .α, .α (1 − α), 2 .α (1 − α) , etc., have decreasing magnitude: the further back you go, the less the weight; the past has less relevance for the future. A weight is a geometric (exponential) function of the number of periods the observation is in the past, hence the name “exponential averaging.” I show three possible patterns for weights in Fig. 3.4. You can use the backshift operator with slightly modified notation for a more compact expression: .
YT (1) = α(1 − α)0 B 0 YT + α(1 − α)1 B 1 YT .
.
5 Especially
(3.131)
see the section on DataPlot at https://www.itl.nist.gov/div898/software/dataplot/ refman2/auxillar/exposmoo.htm. Last accessed February 11, 2023.
3.8 Simple Moving Averages Model
71
Fig. 3.4 This is a graph of three values for the exponential smoothing weights .α(1 − α)x for = 0, 1, 2, 3, 4
.x
+α(1 − α)2 B 2 YT + α(1 − α)3 B 3 YT + . . . . = αYT
∞
(1 − α)i B i .
(3.132) (3.133)
i=0
∞
Clearly, you need . i=0 (1 − α)i B i = K < ∞ as a condition for convergence. Now consider a two-step ahead forecast. This is given by YT (2) = αYT (1) + α(1 − α)YT + α(1 − α)2 YT −1 + . . .
.
(3.134)
You can show that .YT (2) = YT (1) so that the forecast for any number of steps ahead is “flat.” See the Appendix for this result. Also, see Hyndman and Athanasopoulos (2023). Therefore, you have for any h-step ahead forecast, .h ≥ 1, .YT (1) = YT (2) = . . . = YT (h).
72
3 Information Extraction: Basic Time Series Methods
The general weight term is wτ = α (1 − α)τ
.
(3.135)
where .τ = 0, 1, 2, 3, . . .. The weights, .wτ sum to 1.0. A simple proof is in the Appendix. You can also write a recurrence equation in terms of forecast errors: .eT = YT − YT −1 (1). This recurrence form allows you to update a forecast with new data but with minimal effort. This is not necessary with modern software and technology, but the steps are instructive. Therefore, YT (1) = αYT + (1 − α)YT −1 (1).
.
(3.136)
= α[YT −1 (1) + eT ] + (1 − α)YT −1 (1).
(3.137)
= YT −1 (1) + αeT .
(3.138)
The new forecast is the prior one-step ahead forecast plus a fraction of the last error made in your forecast. Gilchrist (1976, p. 55) notes that this shows that this forecast method is always “alert” for changes in the data, changes revealed by the errors. Finally, using (3.133), E[YT (1)] = E(YT )
∞
.
(1 − α)i B i.
(3.139)
i=0
=μ
∞
(1 − α)i.
(3.140)
i=0
=μ
∞
(3.141)
since . i=0 (1 − α)i = 1. See the Appendix for this summation result. So, the exponential smoothing leads to an unbiased forecast if the mean is constant at .μ. The mean .μ is a global mean. Further, you can show that V [YT (1)] =
.
(1 − α) (1 + α T ) σ 2. (1 + α) (1 − α T )
(3.142)
For T large and .0 < α < 1, this becomes .V [YT (1)] = (1−α/1+α ) σ 2 . This variance is different from the one for the constant mean model, which was shown to be .σ2/T . The variance for the constant mean model goes to zero for large time series. For exponential smoothing, the variance approaches a minimum because as new data are added, the old data have less impact so the “effective” amount of data remains constant. See Gilchrist (1976, p. 55). Also, if .α → 1, then .V [YT (1)] → σ 2/T . A plot of .V [YT (1)] will have a maximum of .σ 2 when .α = 0 and bottoms out at .σ 2/T when .α = 1. This can be seen in Fig. 3.5. As an example of an exponential smoothing forecast, consider the monthly sales data for the simple moving average in Fig. 3.3. I show the exponential smoothing
3.8 Simple Moving Averages Model
73
Fig. 3.5 This is the variance of the one-step ahead forecast for .α = 2 and .T = 100
fit for this data in Fig. 3.6 using .α = 0.20 and the prediction for five steps ahead in Fig. 3.7. Rather than specifying the smoothing parameter as I did in Fig. 3.6, I could have estimated it using f it01 = mod.f it ( optimized = T rue )
.
in Step 2 in Fig. 3.6. This is my preferred or recommended approach unless you have prior knowledge or experience with the smoothing constant. How is the forecast calculated? The process starts with an initial value. Since a lagged value is part of the model, an initial value at the epoch is needed to start the process. This is estimated by the statsmodels function and reported in the summary
74
3 Information Extraction: Basic Time Series Methods
Fig. 3.6 This is an example of a simple exponential smoothing fit of the fictitious monthly sales data in Fig. 3.3. I used .α = 0.20 as the smoothing constant. This is just the smoothing. See Fig. 3.7 for the predictions
results. This is the value in period 0 for the actual as well as the prediction. Once this is known, the process is mechanical. The next predicted value, the one-step ahead value, is the prior actual times .α plus the prior predicted value times .1 − α. I illustrate the process in Fig. 3.8.
3.9 Linear Trend Models
75
Fig. 3.7 This is an example of a simple exponential smoothing forecast of the fit in Fig. 3.6. Notice that the prediction is flat for each step ahead as shown in the text
3.9 Linear Trend Models The ordinary least squares (OLS) regression model is typically introduced in a basic statistics course. This method fits a straight line through two data series, simply named X and Y . The Y is a dependent variable. In machine learning, it is also
76
3 Information Extraction: Basic Time Series Methods
Fig. 3.8 This illustrates how a prediction is calculated from one period to the next
referred to as the target variable. The X is an independent variable that accounts for the variation in Y , explains Y , or drives Y , all interchangeable expressions. In machine learning, this is referred to as a feature variable. The model relating Y to X is written as a linear in the parameter equation: Yi = β0 + β1 Xi + i .
.
(3.143)
The terms .β0 and .β1 are parameters of the model, a parameter being an unknown, constant, numeric characteristic of the population. The first parameter is the intercept of the straight line and the second is its slope. These two parameters are estimated from data, so they are not hyperparameters. The exponential smoothing has a hyperparameter, .α, but OLS has none. The last term in (3.143) is called a disturbance term, white noise, or simply noise, all terms I use interchangeably. Some refer to it as an error term, but I prefer “disturbance term” because it reflects the fact that the points are disturbed or pushed off the straight line; an error per se is not involved. Why are they pushed off the line? You may be able to identify a few causes, but ultimately you do not know. This is aleatoric uncertainty. The best you can say is that unknown random forces are responsible and are reflected in this term. Since these random factors are unknown, the best you can do is state several assumptions about them. These are the Classical Assumptions: Normally Distributed .i ∼ N , ∀i Mean Zero .E(i ) = 0, ∀i Homoskedasticity 2 .V (i ) = σ , ∀i Independence I
3.9 Linear Trend Models
77
.Cov(i , j ) = 0, ∀i = j Independence II .Cov(i , Xi ) = 0, ∀i Non-Stochasticity .Xi is non-stochastic and fixed in repeated samples.
The first three are compactly stated as .i ∼ N (0, σ 2 ), ∀i. The Independence I assumption says that there is no relationship between any two disturbances. This is often stated as no autocorrelation. The Independence II assumption says there is no relationship between the disturbance term and the X. So, what determines .? Not the X. The non-stochasticity of .Xi means that it is not a random variable. Using these assumptions, it is easy to see that E(Yi ) = β0 + β1 Xi + E(i ).
(3.144)
= β0 + β1 Xi .
(3.145)
.
The linear part of (3.143) is the mean of .Yi ; that is, .μ = β0 + β1 Xi . This is the information component of data. You then have Yi = β0 + β1 Xi + i.
.
(3.146)
= E(Yi ) + i.
(3.147)
= I nf ormation + Noise
(3.148)
so the linear model is consistent with the data structure I introduced above. I will expand on this model in Chap. 4 where I discuss advanced time series predictive methods. For now, this section is just an introduction and a specialization to a particular linear model: one involving a time trend as the independent variable. That is, the X is time, represented as t. The model with a time trend is Yt = β0 + β1 t + t .
.
(3.149)
Without the .β1 t term, you would have the constant mean model: .Yt = β0 + t . The time trend, t, can be either zero-based integers such as 0, 1, 2, etc. (or one-based: 1, 2, 3, etc.) or specific dates such as 2020, 2021, and 2022 for annual data. The simplest representation is the zero-based integers. The variable is intervalscaled, meaning you can change the variable’s definition by adding a constant; the results are invariant to additive change. As an example: you could add 2020 to a zero-based trend variable for years and get 2020, 2021, and 2022. Thus, the same estimation results.6
6 Technically, only the slope remains unchanged. The intercept changes to reflect the index change. ˆ X − β Old ˆ δ where .δ is the index change factor, which is The new intercept is .β Nˆew = Y − β Old 2020 in my example. See Kmenta (1971).
78
3 Information Extraction: Basic Time Series Methods
Fig. 3.9 This is an example of nonlinear data that can be linearized using the natural log transformation
Sometimes, more often than not, the data are not linear so a linear trend model is inappropriate. The data could be exponential such as what I show in Fig. 3.9 which shows the Standard & Poor’s 500 Stock Index.7 The left panel shows an exponential growth curve so clearly modeling this data with a linear trend is inappropriate. You could, however, still use the linear trend model if you use the natural log to transform the scale of the stock data (and just the stock data, not the dates themselves since this would not make sense). The natural log transformation straightens or linearizes an exponential curve as you can see in the right panel There are many ways to estimate the two parameters, but the single most important and most used is OLS. As noted by Paczkowski (2022b), this is a member of a family of methods called the Generalized Linear Model (GLM) family, which is quite large and flexible. The family members are connected by a function that is a linear function of the independent variables. This function is a link function.
7 The
data were downloaded using the Python package yfinance. This package is available through pip using pip install yfinance.
3.9 Linear Trend Models
79
The link function for each member of the family depends on the distribution of the target variable. For the OLS model in (3.143), .E(Yi ) is the mean, so it is a linear function of the feature variable. It can also be shown that the variance of .Yi is .σ 2 . Since .i ∼ N (0, σ 2 ) by the Class Assumptions, then .Yi ∼ N (β0 + β1 t, σ 2 ) by the Reproductive Property of Normals. In this case, the function linking the mean to the linear combination is called the Identity Link: the mean is identically equal to the linear combination. Two other link functions are the Logit Link for binary data and the Log Link for count data. See McCullagh and Nelder (1989), Dobson (2002), and Paczkowski (2022b, Chapter 10) for discussions.8 The linear trend model is applicable when the data follow a linear pattern over time. I illustrate one possibility in Fig. 3.10. I will now review how the parameters are estimated for an Identity Link.
3.9.1 Linear Trend Model Estimation The .β0 and .β1 are estimated using sample data and a set of estimation formulas. The formulas are derived by minimizing the error sum of squares (SSE) where the errors, also called residuals, are the actual values of .Yi , less their estimated values, .Yˆi : ei = Yi − Yˆi .
.
(3.150)
are the estimates for .β0 and .β1 , respectively. where .Yˆi = βˆ0 + βˆ1 X. The .βˆ0 and .βˆ1 The error sum of squares is .SSE = ni=1 ei2 . The minimization process involves finding the first derivative of SSE with respect to .βˆ0 and .βˆ1 which results in two simultaneous equations called the normal equations.9 These are solved simultaneously to yield the following estimators: βˆ0 = Y − βˆ1 X. (Yi − Y )(Xi − X) βˆ1 = . (Xi − X)2
.
(3.151) (3.152)
The estimated linear trend model, with .Xi = t, is then Yˆt = βˆ0 + βˆ1 t.
.
8 Also
(3.153)
see the Wikipedia article “Generalized Linear Model” at https://en.wikipedia.org/wiki/ Generalized_linear_model. Last accessed November 5, 2021. This article lists several more link functions and the cases when they are used. 9 These equations have nothing to do with the normal distribution.
80
3 Information Extraction: Basic Time Series Methods
Fig. 3.10 This is an example of some linear trend data
The procedure is the same when more than one feature is involved, but the setup is more extensive and involves matrix algebra. In this case, (3.151) and (3.152) result as special cases. Several statistical measures can be derived to assess the quality of the fitted line. One is the .R 2 which shows the proportion of the variation in Y explained by X, or how much is explained by time in our current framework. As a proportion, you must have .0 ≤ R 2 ≤ 1. Values close to 1.0 are preferred: a large proportion is explained (Fig. 3.11).
3.9 Linear Trend Models
Fig. 3.11 This is an example of estimating the parameters of the linear trend data in Fig. 3.10
81
82
3 Information Extraction: Basic Time Series Methods
Fig. 3.12 This is the ANOVA table for the regression in Fig. 3.11. I used the residual degrees of freedom in the prediction calculations in Fig. 3.15. This table is based on the decomposition of the target’s variance into a part for trend and a part for residual (i.e., error). The mean sq. is the sum of squares divided by the df . The .R 2 is .SST rend/SST rend +SSResidual and the F is .MST rend/MSResidual
A useful summary table is the Analysis of Variance (ANOVA) table, which I show in Fig. 3.12. This summarizes the quality of the fit. There are four steps for estimating an OLS model in Python: 1. 2. 3. 4.
Define a formula—this is the specific model to estimate. Instantiate the model—this is the basic setup for estimation. Fit the instantiated model—this is the estimation stage. Display a summary of the fitted model.
I illustrate these four steps in Fig. 3.11. The summary output shows that the slope for the trend variable is 0.1366 and that the fit of the line to the data is very good based on the .R 2 of 0.955.
3.9.2 Linear Trend Extension You can extend the linear model to estimate growth rates. In this application, you first have to write the time trend model in an exponential, not a linear, form. But you can linearize it by using a natural log transformation so that all that I described above is applicable. A growth rate model is very important in many business applications because business quantities (e.g., sales, revenue, profits) grow over time. Knowing those growth rates is important for measuring the overall health and future direction of the business (i.e., increasing, stagnant, or decreasing)10
10 Applications are also in economics, demography, and biology, to name just a few areas where growth rates are important.
3.9 Linear Trend Models
83
There are two ways to estimate a growth rate: • A discrete calculation using a time series; and • A regression estimation using a time series. The first is a basic calculation that may be taught in a basic statistics course, especially one targeted to business majors, or in a basic financial analysis course. This relies on the geometric average of period-to-period growth rates. Suppose you have a times series of annual data, perhaps on sales: .X0 , X1 , X2 , . . . , XT . The growth from one year to the next is simply .(Xt/Xt−1 ) − 1. The growth from .X0 to .XT , g, can be written as g=
.
XT − 1. X0
(3.154)
XT X1 X2 ... − 1. X0 X1 XT −1 T 1/T −1 Xt+1 = − 1. Xt =
(3.155)
(3.156)
t=0
=
XT X0
1 T −1
− 1.
(3.157)
where the exponent, .1/T −1 reflects the fact that there are .T − 1 ratios (sometimes called compounding ratios) in the product. The exponent is one less than the number of observations in the time series data set. This reflects the fact that we want the growth rate which is a net number.11 The calculation is simple: divide the last value by the first, raise the result to the .1/T −1 power, and subtract 1. The second approach uses a regression model to estimate the growth rate. The growth rate is called an instantaneous growth rate in this case. The regression model is Yt = β0 eβ1 t t .
.
(3.158)
I describe the basis for this model in this chapter’s Appendix. I assume that the disturbance term is log-normally distributed. A random variable is log-normal if the natural log of that random variable is normally distributed. See Hill et al. (2008) for background on the log-normal distribution. This model is nonlinear. You can, however, linearize it using the natural log transformation. Then (3.158) becomes .
11 .1 + g
ln (Yt ) = ln (β0 ) + β0 t + ln t
is the gross growth rate and g is the net growth rate.
(3.159)
84
3 Information Extraction: Basic Time Series Methods
where the usual OLS assumptions hold. In this linearized version, .β1 = g is the gross (instantaneous) growth rate.
3.9.3 Linear Trend Prediction Return to the linear trend model in (3.149). How do you predict the h-step ahead forecasts? That is, how do you find .YT (h)? This is easy. If T is the last time index, then simply insert the values .T + 1, T + 2, . . . , T + h into the estimated model, (3.153). This is a simple mechanical way to create the predictions. A better way, and one that will be useful for all that follows, is to use the predict method associated with the fitted model. I show how you can do this in Fig. 3.13. The way I set up the prediction is to create a scenario. A scenario is what I expect to happen to the independent variable in the future. The expectations could be based on another forecast you develop, an outside agency forecast (e.g., of real GDP growth produced by a macroeconomic forecasting agency), or what management believes are the most likely future values for the independent variables. I will discuss scenarios in Chap. 8. For a linear trend model, the scenario is very simple: it is the next sequence of the time trend. The length of the sequence is merely the number of steps ahead, h, for the prediction. For this example, .h = 5. The approach requires the scenario for the independent variable used in the regression model as well as that model itself. If you look at the regression output in Fig. 3.11, you will notice that I saved the regression results in a variable named reg01. I then used this variable in the code in Fig. 3.13 and chained to it the predict method. This method’s argument is the scenario as a DataFrame. The predictions are returned and stored in another DataFrame, df_fct. You can get a summary of the h-step ahead forecasts and prediction intervals as I show in Fig. 3.14 for the predictions in Fig. 3.13. This summary is useful to check the prediction intervals. These intervals have two components just as do confidence intervals: • The projected value; and • The standard error adjusted by a distribution quantile. In Fig. 3.14, the projections are in the column labeled “mean.” This is the value you get by inserting the independent variable (i.e., trend in this example) into the estimated model: .meani = βˆ0 + βˆ1 T rendi , for .i = 1, . . . , h. The column labeled “mean_se” is the standard error of the projection based on the formula
1 (X − X)2 2 + .mean_se = σ n SSX
(3.160)
where .σ2 is the variance of the disturbance term (estimated by the mean square error of the residuals from the regression ANOVA table); X is the scenario value for
3.9 Linear Trend Models
85
Fig. 3.13 This is an example of predicting the linear trend data in Fig. 3.10. The predictions are five steps ahead from the last actual
the independent variable; .X is the mean of the independent variable in the actuals DataFrame; and SSX is the sum of squares of the independent variable (i.e., .SSX = n 2 i=1 (Xi − X) ). SSX is calculated from the variable’s variance multiplied by the degrees of freedom: .SSX = (n − 1)V (X). I show these calculations for .h = 1 in Fig. 3.15. There are six columns in the summary table of Fig. 3.14. The distribution quantile is from a t-distribution with degrees of freedom from the regression ANOVA table’s residual sum of squares (23 in this case).
86
3 Information Extraction: Basic Time Series Methods
Fig. 3.14 These are the prediction intervals for the linear trend prediction I show in Fig. 3.13. The first column is labeled “mean,” but this is really the predicted value of the dependent variable. These are out-of-sample predictions
Hill et al. (2008, Chapter 4) show that the variance of the forecast error, the difference between the actual (when it occurs) and its one-step ahead forecast at the scenario value, is V (f. | Scenario) = σ 2 +
σ2 σ2 + (XScenario − X)2 n . 2 n i=1 (Xi − X)
(3.161)
where f is the forecast. This can be rewritten as V (f | Scenario) = σ + σ
.
2
2
1 (XScenario − X)2 + n 2 n i=1 (Xi − X)
.
(3.162)
3.9 Linear Trend Models
87
Fig. 3.15 These are the calculations for the prediction intervals in Fig. 3.14. I only show them for the one-step ahead forecast and for the lower prediction limits; the upper limits are obvious. The standard error for the prediction observation (.ob_se) is not shown in Fig. 3.14 but is needed for the interval calculation
Also, see Kmenta (1971, Chapter 7) for a detailed derivation of this result. The square root of (3.162) is the standard error used in the calculation of the prediction interval. I show the calculation of (3.162) along with the prediction intervals in Fig. 3.15 for the one-step ahead prediction of our example. I show the standard error of the forecast in Fig. 3.15, but it is not in the Statsmodels output in Fig. 3.14. The prediction interval’s lower and upper limits are labeled obs_ci_lower and obs_ci_lower, respectively. The prefix “obs” represents the “observed” scenario value. You can compare the results in Figs. 3.15 and 3.14. The variance formula, (3.162), indicates several features of regression-based predictions. According to Kmenta (1971, Chapter 7), the variance will be smaller:
88
3 Information Extraction: Basic Time Series Methods
1. the larger is the sample size, n; 2. the smaller is the squared deviation of a scenario value from the mean; and 3. the larger is the squared difference between the scenario value and the mean. This regression model can be applied to the constant mean model of Sect. 3.6 by recognizing that the regression model is just a constant without a slope factor: .Y = β0 + . The sample mean is just the OLS solution to this constant mean model. You can see this by recognizing that the only normal equation is .βˆ0 = Y . With this in mind, you can define the regression estimation with just the constant term. In the Statsmodels formula statement, the constant is represented as 1: .Y ∼ 1. I show the regression setup and estimation results in Fig. 3.16. Notice that the estimated constant term is the sample mean as reported in Fig. 3.2. I show the predictions in Fig. 3.17. The scenario is very simple: there is only the constant and nothing else. The scenario is specified by creating a DataFrame with a single variable that has a None value for each of the h steps ahead you want for the forecast. In Python, a None value is just that: it is nothing; it is not a null value per se as null is defined in other languages. In those languages, null is defined as zero (0); this is not the case in Python. None has its own data class, NoneType, and it is the only member of this class.12
3.10 Appendix This Appendix provides some formal derivations and mathematical/statistical results that are useful background for the material I discussed in this chapter. I felt that it is better to present these results here rather than needlessly take up space and be a distraction in the chapter.
3.10.1 Reproductive Property of Normals 2 .Xi ∼ N (μi , σ ), .i = 1, 2, . . . , n and the .Xi are all independent. Assume that i n Let .Z = i=1 Xi . The Reproductive Property of Normals says that .Z ∼ N ( ni=1 μi , ni=1 σi2 ). This holds whether the random variables are summed or differenced. For instance, .Z = X1 − X2 is normally distributed. So is the average of the random variables. See, for example, Yost (1984, p. 37).
12 For
more on None, see the helpful discussion at https://stackoverflow.com/questions/3289601/ referring-to-the-null-object-in-python. Last accessed July 14, 2022.
3.10 Appendix
89
Fig. 3.16 This is the setup for the constant mean model of Sect. 3.6. Notice how the constant is specified in the formula statement
90
3 Information Extraction: Basic Time Series Methods
Fig. 3.17 This is the setup to predict with the constant mean model of Sect. 3.6. Notice how the scenario is specified using the None
3.10.2 Proof of MSE = V (θˆ ) + Bias 2 Define the mean square error for an estimator, .θˆ , of a parameter, .θ , as MSE = E (θˆ − θ )2 .
.
(3.163)
Rewrite the term in the square brackets by adding and subtracting .E(θˆ ) and rearranging to get θˆ − θ = [θˆ − E(θˆ )] + [E(θˆ ) − θ ].
.
(3.164)
Now square both sides to get ˆ − θ ]2 + 2[θˆ − E(θˆ )][E(θ) ˆ − θ ] (3.165) (θˆ − θ )2 = [θˆ − E(θˆ )]2 + [E(θ)
.
and then take the expectation of both sides. It is easy to show that the last term has an expected value of zero (all terms cancel) which leaves E(θˆ − θ )2 = E[θˆ − E(θˆ )]2 + [E(θˆ − θ )]2 .
.
Variance
Squared Bias
The first term is the variance and the second is the bias squared.
(3.166)
3.10 Appendix
91
3.10.3 Backshift Operator Result I mentioned a useful property of the backshift operator in the text of this chapter. If B is the backshift operator defined as .BYt = Yt−1 and .φ is a constant, then −1 = 1 + φB + φ 2 B 2 + φ 3 B 3 + . . .. To see this, just do polynomial long .(1 − φB) division: 1 + φB + φ 2 B 2 + . . .
1 − φB 1 − (1 − φB) . φB − (φB − φ 2 B 2 ) φ2B 2
(3.167)
Suppose you have .β0 (1 − φB)−1 . Using the same polynomial long division method, you can show that β0 (1 − φB)−1 = β0 + φβ0 + φ 2 β0 + φ 3 β0 . . . .
.
= (1 + φ + φ + φ + . . .)β0 . 2
3
(3.168) (3.169)
One more generalization shows that (β0 + β1 )(1 − φB)−1 = β0 + β1 + φβ0 + φβ1 + φ 2 β0 + φ 2 β1 + φ 3 β0 + φ 3 β1 . . .
.
(3.170)
= (1 + φ + φ 2 + φ 3 + . . .)(β0 + β1 ). I will use these results in Chap. 4.
3.10.4 Variance of h-Step Ahead Random Walk Forecast I stated in the text that the variance of the h-step ahead forecast for a random walk is hσ2 . This can be proven by induction. Assume that .V [(YT (1)] = σ2 which is also the variance of any one-step ahead forecast. Let .V [YT (h − 1)] = (h − 1)σ2 , .h > 2. Then,
.
V [YT (h)] = V [YT (h − 1)] + V [YT (h − 1)(1)].
.
=
(h − 1)σ2
= hσ2 .
+ σ2.
(3.171) (3.172) (3.173)
92
3 Information Extraction: Basic Time Series Methods
3.10.5 Exponential Moving Average Weights I will show two useful properties of the weights for the exponential moving average. First Property i The first property is that the sum of the weights is 1.0: . ∞ i=0 α(1 − α) = 1. To see this, write the sum as S = α + α(1 − α) + α(1 − α)2 + . . . .
.
(3.174)
= α[1 + β + β + . . .] where β = (1 − α). 1 =α based on polynomial long division. 1−β α by substituting the definition of β. = 1 − (1 − α)
(3.175)
= 1.
(3.178)
2
(3.176) (3.177)
Second Property The second ∞property is ithat the sum of the weights times the index of the term is (1−α)/α . To see this, note that .(1−α)/α : . i=0 α(1 − α) i = S=
∞
.
α(1 − α)i i.
(3.179)
i=0
= 0 + α(1 − α) + 2α(1 − α)2 + 3α(1 − α)3 + . . . .
(3.180)
= α[β + 2β 2 + . . .] where β = (1 − α).
(3.181)
Define the bracketed summation as .S = β + 2β 2 + . . .. Multiple both sides of .S by .β to get .βS = β 2 + 2β 3 + . . .. Subtract .βS from .S and collect terms to get S (1 − β) = β + β 2 + . . . .
.
= β(1 + β + . . .). =
β 1−β
(3.182) (3.183) (3.184)
so that .S = β/(1−β)2 . You can verify this using polynomial long division. Substituting the definition of .β, multiplying by .α .(= 1 − β), and collecting terms gives S=
.
α(1 − α) . α2
(3.185)
3.10 Appendix
93
=
1−α . α
(3.186)
The limit of the variance weights, . (1−α)/(1+α)(1+α T )/(1−α T ) as .α, approaches 1 is not simple to find. Taking the limit of this expression results in an indeterminant form: .0/0. However, re-expressing this as .
1 − α + α T − α T +1 1 + α − α T − α T +1
then applying l’Hopital’s Rule yields .1/T as the limit. This limit is then multiplied by .σ 2 to give .YT (1).
3.10.6 Flat Exponential Averaging Forecast To see that .YT (2) = YT (1), note that you can write YT (2) = αYT (1) + α(1 − α)YT + α(1 − α)2 Y(T −1) + . . . .
(3.187)
.
= αYT (1) + α(1 − α)YT + α(1 − α)2 BYT + . . . . = α 2 YT
∞
(3.188)
(1 − α)i B i + α(1 − α)YT [(1 − α)0 B 0 + (1 − α)1 B 1 + . . .].
i=0
(3.189) = α 2 YT
∞
(1 − α)i B i + α(1 − α)YT
i=0
= α 2 YT
∞
∞ (1 − α)i B i .
(3.190)
i=0
(1 − α)i B i + αYT
i=0
∞ ∞ (1 − α)i B i − α 2 YT (1 − α)i B i. i=0
i=0
(3.191) = αYT
∞
(1 − α)i B i.
(3.192)
i=0
= YT (1) where the last line is due to (3.133).
(3.193)
94
3 Information Extraction: Basic Time Series Methods
3.10.7 Variance of a Random Variable The variance is defined as .V (X) = E[X − E(X)]2 for a random variable X. If 2 .E(X) = μ, then .V (X) = E(X − μ) . Assume for simplicity that X is continuous. Then, V (X) =
∞
.
−∞
(X − μ)2 fX (x)dx.
(3.194)
A property of the variance is .V (X) > 0. If .V (X) = 0, then the distribution is degenerate and the entire density is concentrated at one point. The density is then infinite (i.e., undefined). Now consider the special cases. Case 1 V (cX) =
∞
.
=
−∞ ∞ −∞
(cX − cμ)2 fX (x)dx.
(3.195)
c2 (X − μ)2 fX (x)dx.
(3.196)
= c2 V (X)
(3.197)
where c is a constant. Case 2 V (X + Y ) = V (X) + V (Y )
.
(3.198)
for two independent random variables, X and Y . Since they are independent, their covariance is zero, so .Cov(X, Y ) = 0. The covariance between two random variables, X and Y , is defined as Cov(X, Y ) = E{[X − E(X)][Y − E(Y )]} ∞ ∞ = [X − E(X)][Y − E(Y )]f(X,Y ) (x, y)dxdy
.
−∞ −∞
= σX,Y .
The joint pdf for two continuous random variables X and Y is represented as fX,Y (x, y). The marginal pdf for X is represented as .fX (x); similarly for Y . Note that the covariance is a general case and the variance is a special case of random variables so that
.
Cov(Y, Y ) = E[Y − E(Y )][Y − E(Y )].
.
(3.199)
3.10 Appendix
95
= E[Y − E(Y )]2.
(3.200)
= σY,Y .
(3.201)
=σ .
(3.202)
2
If the two random variables, X and Y , are independent, then from basic probability theory, you can write the joint pdf of X and Y as .fX,Y (x, y) = fX (x)fY (y). You cannot do this otherwise. Now it is easy to show that the covariance is zero: Cov(X, Y ) =
∞
∞
.
= =
−∞ −∞ ∞ ∞ −∞ −∞ ∞ −∞
[X − E(X)][Y − E(Y )]fX,Y dxdy.
(3.203)
[X − E(X)][Y − E(Y )]fX (x)fY (y)dxdy. (3.204)
[X − E(X)]fX (x)dx
∞
−∞
[Y − E(Y )]fY (y)dy. (3.205)
= E[X − E(X)]E[Y − E(Y )].
(3.206)
= 0 × 0.
(3.207)
=0
(3.208)
where .E[X − E(X)] = 0. It is easy to show that .E(X + Y ) = E(X) + E(Y ) regardless if X and Y are independent or not. Use this in V (X + Y ) =
∞
∞
.
−∞ −∞
= =
∞
(3.209) ∞
−∞ −∞ ∞ −∞
([X − E(X)] + [Y − E(Y )])2 fX,Y (x, y)dxdy.
[X − E(X)] fX dx +
+2
[(X + Y ) − (E(X) + E(Y ))]2 fX,Y (x, y)dxdy.
∞
2
∞
−∞ −∞
∞
−∞
(3.210)
[Y − E(Y )]2 fY (y)dy
[X − E(X)][Y − E(Y )]fX (x)fY (y)dxdy.
(3.211)
= V (X) + V (Y ) + 2Cov(X, Y ).
(3.212)
= V (X) + V (Y )
(3.213)
by the independence assumption. If X and Y are not independent, then .V (X + Y ) = V (X) + V (Y ) + 2Cov(X, Y ). If X and Y are not independent continuous random variables, then you can also see that
96
3 Information Extraction: Basic Time Series Methods
V (aX + bY ) = a 2 V (X) + b2 V (Y ) + 2abCov(X, Y )
.
(3.214)
where a and b are constants.
3.10.8 Background on the Exponential Growth Model Equation (3.158) is based on the exponential growth model: Yt = Aegt
.
(3.215)
where A is a constant and g is the average growth rate. This model is widely used in economic analysis (e.g., economic growth), financial analysis (e.g., compound annual growth), and demographic studies (e.g., population growth), to mention just a few. The model can be derived by considering the financial problem of compound growth on an investment. Let r be the real interest rate or rate of return on an investment. This is the rate an investment will grow, so .r = g. It is easy to see that the value of $1 invested for t periods (e.g., t years) grows to a future value of F Vrt = (1 + r)t .
.
(3.216)
This is the compound growth formula. I assume that the interest rate, r, is an annualized number. So, .r = 0.05 means the investment’s value will grow by 5% each year on average. Sometimes, interest is paid for sub-annual periods, say monthly. In this case, an investor receives .r/m each sub-period, m. Clearly, .m = 4 is the return if quarterly. The future value of the investment in this case is r mt t . (3.217) .F Vr = 1 + m The question is: “What is the future value if compounding is continuous?” That is, what is .limm→∞ F Vr = limm→∞ (1 + r/m)mt . It should be obvious that this limit, as it is written, is an indeterminant form.13 The limit, however, can be found by applying l’Hopital’s rule. To do this, write the future value function as r mt Z = 1+ . m
.
Then,
13 The
limit is .1∞ which is indeterminant. See Granville et al. (1941).
(3.218)
3.10 Appendix
97
r . ln(Z) = mtln 1 + m r ln 1 + m . = 1 mt h(m) = g(m)
.
(3.219)
(3.220)
(3.221)
Now apply l’Hopital’s rule: .
h (m) . m→∞ g (m)
lim ln(Z) = lim
m→∞
r 2 mr 1+ m . = lim 1 m→∞ − 2 m t
(3.222)
−
= lim m→∞
rt 1+
r . m
= rt
(3.223)
(3.224)
(3.225)
Since .ln (i.e, .loge ) is the inverse of e (.e = 2.7182818 . . .), a constant, then .
lim
m→∞
1+
r mt = ert . m
(3.226)
You can substitute g for r since an interest rate is a growth rate, but just for money. See Henderson and Quandt (1971) for this demonstration.14
14 Also
see the responses at https://math.stackexchange.com/questions/539115/proof-ofcontinuous-compounding-formula for interesting variations for the demonstration.
Chapter 4
Information Extraction: Advanced Time Series Methods
I developed a few commonly used prediction methods in Chap. 3, methods that rely on time series data. For many situations, primarily what I will later call operational scale-views (the day-to-day operations of a business that are narrowly focused), they are sufficient. There are, however, more complex situations for which these methods are insufficient. The data and problems are more intricate and often convoluted, and thus the prediction methods must also be more intricate. This will be the case for what I will later call tactical scale-views. The complexity of the data and problems dictate the methods. I will develop more intricate models in this chapter, models that still rely on time series data, but just more of it.
4.1 The Breadth of Time Series Data There are two dimensions to any data set, whether for descriptive (i.e., Business Intelligence) or predictive (i.e., Business Analytics) purposes, and whether they are times series or not: their depth and breadth. The depth is the number of observations (i.e., the number of rows) of a data table or data set. In Pandas, the data table is called a DataFrame.1 The breadth is the number of columns, variables, or features, all interchangeable words. A one-dimensional data table has n observations but only one feature. This is a Series in Pandas terminology.2 A multidimensional DataFrame has n rows and p columns, so its shape is .(n, p). The shape is the two-tuple .(n, 1) in Pandas terminology. The methods of Chap. 3 apply to a one-dimensional Series. Even the linear trend model is in this class because the time variable is not necessarily a variable but an
1 Note 2 Note
the uppercase letters. the uppercase letter.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_4
99
100
4 Information Extraction: Advanced Time Series Methods
index of a Series that distinguishes one row from another. The index gives meaning to each row. Just a Series is needed for simple problems. These problems involve predictions for, say, daily operations. For example, how many orders are expected tomorrow and the following day (a two-step ahead prediction) to plan capacity utilization and work assignments in, say, a job shop. Any of the methods of Chap. 3 would suffice. A single Series, however, is inappropriate for a more complex problem, usually one beyond operations. I will refer to these as tactical and strategic consistent with what I will later refer to as tactical and strategic scale-views. More “data” are needed to deal with these problems. For example, a tactical problem may be a price change that requires data on quantities ordered, price points, customer characteristics (e.g., income), competitive prices, product features, and time of year (e.g., Holiday Season). The depth is needed, but, more importantly, so is the breadth. A single Series prediction method is simply inadequate. A higher-level method is needed. It is these that I will discuss in this chapter. The time orientation will be maintained, but I will switch to cross-sectional data in Chap. 5 while maintaining a focus on the breadth of the data and more sophisticated prediction methods.
4.2 Introduction to Linear Predictive Models I introduced the linear model in Chap. 3, but for a single independent variable which was the special case of time. The model could, of course, handle a more general case of any variable that determines the dependent variable, Y , not just time. You could expand the model to include several independent variables; you are not restricted to just one. In this more general form, the linear model with p independent variables is Yi = β0 + β1 Xi1 + β2 Xi2 + . . . + βp Xip + i.
.
= β0 +
p
βj Xij + i .
(4.1) (4.2)
j =1
If the data are times series, then the subscript i is replaced by t for time. The .p + 1 terms, .β0 , .β1 , .β2 , . . . , .βp , are parameters to estimate and the . is the disturbance term. There are no hyperparameters to set. The parameters are estimated using the same procedure I outlined in Chap. 3: minimize the error sum of squares (SSE) which leads to normal equations that are solved simultaneously for the parameter estimators. Since there are now .p + 1 > 2 parameters, there are .p + 1 equations which are best solved using matrix algebra. I will not do this here. Suffice it to say that the solutions are generalizations of the two I showed in Chap. 3, as you should expect. In this case, (3.151) result as special cases.
4.2 Introduction to Linear Predictive Models
101
The .p + 1 parameters are estimated with data in the .n × (p + 1) DataFrame. The one added to p is for the target variable. The Classical Assumptions still apply. There is, however, a new one that states that there is no linear relationship between two and more independent variables. An exact or perfect linear relationship is called perfect multicollinearity. The linear trend model of Chap. 3 did not require this assumption because there was only one variable, but it is required now. The reason is purely mathematical: the normal equations cannot be solved if there is an exact linear combination of the independent variables. So, the assumption is that there is no multicollinearity at all: there is no relationship between or among the independent variables. I repeat the Classical Assumptions here for your convenience with the added multicollinearity assumption: Normally Distributed .i ∼ N , ∀i Mean Zero .E(i ) = 0, ∀i Homoskedasticity 2 .V (i ) = σ , ∀i Independence I .Cov(i , j ) = 0, ∀i = j Independence II .Cov(i , Xij ) = 0, ∀i, j ; j = 1, 2, . . . , p Non-stochasticity .Xj , j = 1, 2, . . . , p are non-stochastic and fixed in repeated sample. No Multicollinearity No relationship among the .Xj , j = 1, 2, . . . , p Model (4.1) is a linear model, but a twist and complication is the disturbance term, .t . We usually assume that .t ∼ N (0, σ 2 ) and .Cov(t , t−k ) = 0 for .k = ±1, 2, . . .. That is, the covariance between the disturbance at time t and any other period .t ± k is zero: there is no relationship among the disturbances. This covariance assumption, however, is usually violated with time series data, producing a problem known as autocorrelation. See Gujarati (2003), Goldberger (1964), Greene (2003), and Kmenta (1971) for discussions about autocorrelation. Kmenta (1971) is especially worth studying because he very meticulously derives the OLS estimators under autocorrelation. Under autocorrelation, the disturbance is assumed to be t = ρt−k + ut .
(4.3)
t = 1, 2, . . . , T .
(4.4)
t > k.
(4.5)
k = 1, 2, . . . .
(4.6)
−1 ≤ ρ ≤ +1.
(4.7)
.
102
4 Information Extraction: Advanced Time Series Methods
ut ∼ N (0, σu2 )
(4.8)
For a one-period lag (i.e., .k = 1), you have t = ρt−1 + ut .
.
(4.9)
This is a first-order autoregressive process succinctly written as .AR(1). This model is also called a first-order stationary Markov Process. A Markov Process is one in which only the most recent past period (plus the contemporaneous noise) determines the current value of a series. Notice that if .ρ = 1, this simplifies to a random walk model from Chap. 3. To implement this .AR(1) model, you need an estimate of .ρ. You can easily obtain one by recognizing that (4.9) is an OLS model sans the constant term. Consequently, you can simply write the estimator by emulating the OLS slope formula in (3.151), but without the mean terms, to get T et et−1 . ρˆ = t=2 T 2 t=2 et−1
.
(4.10)
The .et is the regression residual which is an estimate of the disturbance term, .t . To see this correspondence between .et and .t , consider the definition of .et : .et = Yt − Yˆt . Now substitute the equations for .Yt and .Yˆt and rearrange terms: et = Yt − Yˆt .
.
(4.11)
= (β0 + β1 Xt + t ) − (βˆ0 + βˆ1 Xt ).
(4.12)
= (β0 − βˆ0 ) + (β1 − βˆ1 )Xt + t .
(4.13)
= t
(4.14)
if .βˆ0 and .βˆ1 are unbiased estimators of .β0 and .β1 , respectively. Autocorrelation produces major estimation problems, so you must check for the violation of the Classical Assumption of no autocorrelation. There are two ways to do this: 1. Graphs of residuals; and/or 2. Formal tests. The simplest is just a plot of the residuals against time. Any assumption violation should be reflected in the residuals since they are estimates of the disturbances. Under the Classical Assumptions, a plot of the residuals against time should show a random pattern, so the signature for a violation is a nonrandom one. There are two possibilities:
4.2 Introduction to Linear Predictive Models
103
Fig. 4.1 These are the two possibilities for an autocorrelation plot of the OLS residuals against time
Sine Wave Indicates positive autocorrelation. Negative (positive) residuals are followed by negative (positive) residuals. Any jaggedness is due to random white noise, but the sine wave dominates. Jagged, Sawtooth Indicates negative autocorrelation. Negative residuals are followed immediately by positive residuals. There is no sine wave or any other discernible pattern. I illustrate these possibilities in Fig. 4.1. A simple scatter plot of the residuals (on the Y -axis) vs. their one-period lag (on the X-axis) is a second useful plot. You can draw a vertical and horizontal line in the graph, each centered at zero since the mean of the residuals is zero.3 This divides the plot space into four quadrants: • Upper left = I • Lower left = III
• Upper right = II • Lower right = IV
You then have the following possible patterns: Positive Autocorrelation Points cluster in Quadrants III and II. Negative Autocorrelation Points cluster in Quadrants I and IV. I illustrate these patterns in Fig. 4.2.
3 The
residual mean is zero because the sum of the OLS residuals is zero.
104
4 Information Extraction: Advanced Time Series Methods
Fig. 4.2 This is an example of two autocorrelation plots of the OLS residuals and their one-period lagged values. The .45◦ and .135◦ lines are drawn for references
Sometimes, residual plots are vague or ambiguous so a formal test is then required. The classic test is the Durbin-Watson Test, which is the oldest (from 1951) and the most used test. It relies on a test statistic, the Durbin-Watson d-statistic, calculated as T d=
.
t=2 (et − et−1 ) T 2 t=1 et
2 .
≈ 2(1 − ρ). ˆ
(4.15) (4.16)
The last line is easy to show by merely expanding the numerator of (4.15) and collecting terms: T d=
.
t=2 (et − et−1 ) T 2 t=1 et
2 .
(4.17)
T
2 ) − 2et et−1 + et−1 . T 2 t=1 et T T 2 t=2 et et−1 t=2 et . −2 ≈ 2 T T 2 2 t=1 et t=1 et
=
2 t=2 (et
≈ 2(1 − ρ) ˆ
(4.18)
(4.19) (4.20)
2 where I treated . Tt=2 et2 and . Tt=2 et−1 as approximately the same and used the definition of .ρˆ in (4.10). The Null Hypothesis is .H0 : .ρ = 0. There are ranges of values for d which I summarize in Table 4.1.
4.2 Introduction to Linear Predictive Models
105
Table 4.1 These are possible ranges for the Durbin-Watson Test Statistic in (4.15). The desirable value is .d = 2 for no autocorrelation. I usually recommend values in the interval .2 ± 0.10 as acceptable If. . . =0 .ρˆ = −1 .ρˆ = +1 .0 < ρˆ < 1 .ρˆ
Then. . . =2 .d = 4 .d = 0 .0 < d < 2 .d
For. . . No autocorrelation Perfect negative autocorrelation Perfect positive autocorrelation Some autocorrelation
There are several assumptions for the Durbin-Watson statistic: • The disturbances are generated by an .AR(1) process. • The model does not contain a lagged dependent variable. I discuss the effect of a lagged dependent variable below. • The model has a constant term. But there are also some problems: • It is applicable for the .AR(1) case only. Some disturbances are generated by .AR(2) or other more complicated processes. • There is an indeterminate region around 2. • The model cannot contain a lagged dependent variable. Many models use one to capture dynamics. A variant of the Durbin-Watson (known as Durbin’s hstatistic) can be used. This is defined as d T .h = 1− 2 1 − T Vˆ (βˆi )
(4.21)
where: – d is the Durbin-Watson statistic; – T is the number of observations used in the regression; and – .Vˆ (βˆi ) is the estimated variance of the lagged variable’s estimated parameter. The Null Hypothesis is that there is no autocorrelation in the disturbance term. • The statistic assumes that there are no missing observations. There are other tests designed to overcome these shortcomings of the DurbinWatson Test. An example is the Ljung-Box Test. See Gujarati (2003) and Stock and Watson (2011) for a discussion. Nonetheless, the Durbin-Watson d-statistic is still the major and most popular statistic. If any of the residual graphs and/or the Durbin-Watson Test suggest an autocorrelation problem, you can fix it with a transformation of the input variables based on an assumed .AR(1) process. The fix is either the Cochrane-Orcutt Procedure or a Generalized Least Squares Procedure. See Gujarati (2003), Stock and Watson (2011), and Greene (2003) for discussions.
106
4 Information Extraction: Advanced Time Series Methods
4.2.1 Feature Specification A model like (4.1) could include: • Contemporaneous X variables; • Lagged X variables; and/or • Lagged Y variables values on the right-hand side, making it a very general model. A contemporaneous feature variable has observations all in the same period and, more importantly, the periods are consistent with those of the target variable. In other words, the t subscripts indicating time all lineup. The lagged X variables are meant to capture any delay in the effect of those feature variables on the target. For example, in a demand study that includes real disposable household income, there may be a delay in the effect of an increase in income on purchases because households have to revive their consumption habits, and habits require time to adjust. The number of lags is an empirical question that would be answered by trying different lag structures. The lagged Y variables have the same interpretation as the lagged X variables: they reflect a delay or, in this case, a carryover of Y from one period to the next. I will illustrate these generalizations of the features in Sect. 4.8.
4.3 Data Preprocessing I did not cover data preprocessing for the linear time trend model in Chap. 3 because that was not my purpose. Preprocessing involves graphing your data to identify trends, patterns, anomalies, and relationships as well as transforming your variables either through standardization or using the natural log function to linearize them and stabilize their variations. See Paczkowski (2022b, Chapter 5) for an in-depth discussion of data preprocessing. Once you have created the required time series data set, which I will refer to as the master data set, you must divide it into at least two parts: Training data set To estimate or train a model; and Testing data set To test the predictive power of the trained model. As a rule of thumb, two-thirds or three-quarters of the master data set is randomly assigned to the training data set, and the rest to the testing data. See Paczkowski (2022b) for a detailed discussion about creating training and testing data sets. I will discuss this splitting in the following sections.
4.4 Model Fit vs. Predictability
107
4.4 Model Fit vs. Predictability A very common misconceived assumption is that the predictive usefulness of a model depends on how well it fits training data. This assumption is misconceived because it confuses two uses of a model. The first is to explain relationships and thus identify the key drivers for a target variable. Most academic studies are in this category since the academic charter is to expand our general knowledge base. One overarching theme of this research is the betterment or improvement of society in all its aspects: economic, political, safety, technological tools, and so forth. This expansion is reliant on relating factors encountered in the real world and determining causal relationships. There is a large philosophy of science literature that addresses the causality concept and whether or not we can identify causal relationships. See, for example, Pearl et al. (2016) and Pearl (2009). Also, see Pinker (2021, Chapter 9) for a readable discussion of correlation and causation. There is a large collection of articles at the Stanford Encyclopedia of Philosophy on causality ranging from Aristotle, to Hume, to Medieval Theories, to the modern day.4 The second use of models, and the one most commonly thought of, is to make predictions. This is certainly also an academic purpose as one of a two-step process to test a theory. First, a theory is developed to explain a natural or social phenomenon. See some interesting perspectives in Serlin (1987). A theoretical framework is very important because it provides a framework for all scientific (and nonscientific) activities. See Paczkowski (2022b, Chapter 1) for the role of theory in Business Analytics. There are “checks” to determine how well the theoretical framework performs such as its internal logic and consistency with other accepted theories. The second stage is to check the correspondence between the implications of the theoretical structure and experimental or observable data. This is an empirical check. In this stage, the theory itself may not be directly amenable to empirical checking just because of the nature of the theoretical structure. Some theories are just too complex, and too mathematical to be used or checked as is. They are very abstract which enables them to get to the main factors and their relationships for whatever problem the theoretical framework is designed to handle. Abstraction is probably the hallmark feature of a theory. See Steingart (2023) for an interesting new perspective on abstraction in theoretical areas, primarily in mathematics. Abstractions, by their nature, exist “in thought or as an idea but not having a physical or concrete existence.”5 Since they are detached from something that exists, they are difficult to work with to empirically test the theory. Instead, implications of the theory, called testable hypotheses, are used to test the theory. If a testable hypothesis is logically derived from or implied by the theory, then the implication is that if the hypothesis agrees with the data, then so must the theory. The testable hypothesis is a prose statement that expresses the essence of the theory, is logically
4 There
are 449 articles. Last checked on December 13, 2021. Oxford Languages. Last accessed December 15, 2021.
5 Source:
108
4 Information Extraction: Advanced Time Series Methods
derived from the theory, and encapsulates relationships among key factors and, perhaps, a target variable. A classic example is Einstein’s famous testable hypothesis that light bends when passing our Sun. For a very readable account of this testable hypothesis, see Isaacson (2007). What makes a hypothesis testable? A Wikipedia article succinctly states that “a hypothesis is testable if there is a possibility of deciding whether it is true or false based on experimentation by anyone. This allows [you] to decide whether a theory can be supported or refuted by data. However, the interpretation of experimental [or observational] data may be also inconclusive or uncertain.”6 What is a model? There are many definitions and varieties of models ranging from mental models that we all use daily, to play or toy models used by children, to statistical models of how the world works. See Forrester (1968, 1971) on mental models and their use. A statistical model is the one that concerns me here. It is the statistical representation of the testable hypothesis and is, in fact, the same as the testable hypothesis except it is in statistical terms. It embodies the essential variables and their assumed relationship where this relationship is based on the testable hypothesis. As an example, the theory of consumer demand is quite abstract, relying on the concept of a utility function that has no real-world counterpart; this is why the theory is abstract. See Samuelson (1947) and Hicks (1946) for a highly advanced, but historically early, abstract treatment and development of consumer demand theory. The main implications drawn from this theoretical framework are that the quantity demanded decreases as the own-price of a good increases and demand increases as income increases. See Paczkowski (2018); Ferguson (1972) for some discussions. A regression model relating consumption by consumer i of good j , the price paid by consumer i for good j , and the income of consumer i might be Qij = β0 + β1 Pij + β2 Ii + i
.
(4.22)
The statistical hypotheses for price and income are .
H0,p : β1 = 0.
(4.23)
HA,p : β1 < 0
(4.24)
H0,I : β2 = 0.
(4.25)
HA,I : β2 > 0,
(4.26)
and .
respectively.
6 See
https://en.wikipedia.org/wiki/Testability. Last accessed December 13, 2021. Clarifying comments added.
4.5 Case Study: Predicting Total Vehicle Sales
109
An estimated model might fit the data very well, and it might perform very well, as indicated by highly significant p-values for the estimated parameters and the F statistic and a high .R 2 . But this does not mean it will perform well when used for prediction. The fitting task and the predicting task are two different tasks. To check how well a fitted model predicts, you need to use it with a new data set, one that was not used in the fitting process. The one used in the model’s fitting stage is the training data set and the one used for testing the fitted model is the testing data set.
4.5 Case Study: Predicting Total Vehicle Sales I will use a case study to illustrate a more general linear predictive modeling framework. This case study involves a (fictitious) manufacturer of motor vehicle after-market accessories for cars, pickup trucks, and noncommercial vans. These include but are not limited to floor mats (rubberized and other types), headrests, charging stations for smartphones and tablets, trash bins, and storage units such as cargo nets. Although an owner of a vehicle of any age could buy an after-market accessory, the most dominant buyer is a new car owner, especially for floor mats. This manufacturer developed excellent supply chain relationships, even during the COVID period when many manufacturers were having severe supply chain problems. These relationships enabled it to supply products to the market to satisfy consumer demand. Despite these relationships, it does not take them for granted so it maintains a reasonable inventory using a sophisticated inventory management system. A key input into this system is a forecast, produced by the data science team, of the demand for new vehicles for the next 12 months.
4.5.1 Modeling Data: Overview The data science team assembled data on key measures of auto market activity and economic activity. These are summarized in a data dictionary which I show in Table 4.2. Some of this data were used to construct a series of new vehicle forecasting models that will be used in further predicting demand for their product lines.
4.5.2 Modeling Data: Some Analysis A distinguishing feature of a motor vehicle is that it is a durable good. A durable good, or a hard good, or consumer durable (all interchangeable terms) is a product that does not quickly wear out or, more specifically, one that yields utility over time rather than being completely consumed in one use. Items like bricks could
110
4 Information Extraction: Advanced Time Series Methods
Table 4.2 This is the data dictionary describing the data for the car sales Case Study. FRED: Federal Reserve Economic Database at https://fred.stlouisfed.org/ Variable Total new car sales Domestic new car sales Foreign new car sales Auto inventory Real disposable personal income Consumer sentiment index Expected inflation rate, 1 year out Consumer price index Consumer price index, new cars Consumer price index, used cars Price of regular gasoline Nominal auto loan rate Real price of new cars Real price of used cars Real price of regular gasoline Real auto loan rate Real auto loan rate, smoothed Recession dummy Structural break dummy
Values Millions Millions Millions Million Dollars
Source FRED FRED FRED FRED FRED
Mnemonic totCarSales domCarSales forCarSales autoInventory realDispIncome
Index Percent
FRED FRED
consumerSentiment expectedInflation1Yr
Index, 20XX = 100 Index
FRED FRED
CPI CPINewVehicles
Index
FRED
CPIUsedVehicles
Nominal dollars Percent, annual rate Real dollars Real dollars Real dollars
FRED FRED Constructed Constructed Calculated
regGasPrice autoLoan realPriceNewCars realPriceUsedCars realGasPrice
Percent, annual rate Percent, annual rate
Calculated Calculated
realInterestAnn smoothRealInterestAnn
Recession = 1; 0 otherwise December 2009 = 1; 0 otherwise
Calculated
Recession
Calculated
Dummy
be considered perfectly durable because they should (theoretically) never wear out. Highly durable goods such as refrigerators or cars are usually useful for several years, so durable goods are typically characterized by long periods between successive purchases.7 Other examples of consumer durable goods include bicycles, books, home appliances, consumer electronics, furniture, tools, sports and exercise equipment, jewelry, medical equipment, and toys. Durable goods are contrasted with nondurable goods which quickly wear out with use. Nondurable goods are constantly consumed and must, thus, be frequently replaced. Clothing is a prime example of a nondurable good.
7 See
https://en.wikipedia.org/wiki/Durable_good for a brief discussion of durable goods. Last accessed July 31, 2022.
4.5 Case Study: Predicting Total Vehicle Sales
111
In addition to durable goods’ longevity, they also usually have a high price. Houses and cars are prime examples. As a result, consumers typically take out a loan to finance their purchase. For a new home, it is a mortgage loan; for a new car, it is an auto loan. All loans have an interest rate associated with them. Payments are typically made monthly for a fixed period of time, although terms can certainly vary. The interest rate and the time to pay off the loan impact the decision to buy the durable good. For a new home, 15- and 30-year mortgages are typical, while for cars 48 and 60 months are typical. The periodic loan payment is an example of an annuity. An annuity is a fixed payment (or receipt of a payment) made regularly. The regularity is the period that payment is made. For example, an annuity (i.e., loan payment) could be made each month. The amount paid is the annuity payment and the payment period is monthly. The payment could be made at the beginning of the period or the end. An ordinary annuity is a payment made at the end of the payment period (e.g., paid at the end of the month), while an annuity due is a payment made at the beginning of the payment period (e.g., paid at the beginning of the month). A monthly payment on a rental property is an example of an annuity due. A car payment is another example. I will return to annuities in Chap. 7. The interest rate for a car loan annuity is a nominal interest rate. It is the posted rated, posted at the bank lobby, in a newspaper, or online. All posted interest rates are quoted at an annual rate. This rate has an inflation component built into it but that inflation rate hides the true or real interest rate you pay. The real rate is the nominal rate less the inflation rate. In this sense, the real rate is comparable in concept to other real economic concepts such as real GDP and real income. The relationship between nominal and real interest rates is given by the Fisher Equation which is .i ≈ r + ρ where i is the nominal or posted interest rate, r is the real interest rate, and .ρ is the inflation rate. The inflation rate should be the expected inflation rate because consumers and businesses plan for the next period. The current inflation rate is the rate of change of prices from the previous period to the current one and so it is inappropriate for forward-looking decisions. I will, however, use it as a proxy for the expected inflation rate since obtaining estimates of the expected inflation rate is not trivial. See Haubrich et al. (2011) for background on expected inflation calculations. Interest rates, nominal or real, are highly volatile. You can see this in Fig. 4.3 for the real rate on a new car loan. This volatility hides or obscures the underlying pattern in interest rates. It is best from an empirical perspective to smooth the interest rate time series using one of the methods from Chap. 3. I did this using the exponential smoothing method and show the smooth series in Fig. 4.4. I used .α = 0.2 for the smoothing factor. Consumers can hold onto their cars for an extended period depending on their economic circumstances, and thus “run their car into the ground.” The website Autotrader.com notes that “In general, . . . people do not really keep their cars forever. Research by R.L. Polk says that the average age of a modern vehicle is 11.4 years, while the average length of time drivers keep a new vehicle is 71.4 months—
112
4 Information Extraction: Advanced Time Series Methods
Fig. 4.3 This shows the real or inflation-adjusted interest rate on new car loans. Source: Federal Reserve Economic Database (FRED). Missing values linearly interpolated
Fig. 4.4 This shows the smoothed real interest rate on new car loans. Compare this series to the one in Fig. 4.3
4.5 Case Study: Predicting Total Vehicle Sales
113
Fig. 4.5 This shows the real price of new cars. It is the CPI for new vehicles divided by the CPI: All Item. Source: Federal Reserve Economic Database (FRED)
around 6 years.”8 Since cars are durable goods, the interest rate is a prime factor in a purchase decision. The interest rate, which determines the monthly interest payment on the loan, is added to the dollar price of the car to get its full price. Economic theory suggests that the higher the interest rate, the fewer cars will be sold because the full price is higher. Also, the higher the price of the car sans the loan payments, the fewer cars that will be sold. There are, however, two dollar prices—the nominal price and the real price—just as there are two interest rates. The nominal price is the one quoted for the purchase; it is the sticker or posted price. For a new car, for example, priced at $45,000, the nominal price is $45,000. This price varies over time in part because of the inflation rate. The real price is this nominal price adjusted for inflation. Economic theory states that the price is not the own-price of the good that matters, but the relative or real price does. The relative real price is the nominal price relative to the price of all other goods that can be purchased. It is the nominal price divided by a price index such as the CPI. This relative real price gives the consumer a more realistic view of the cost of the product in comparison to all other goods that he/she can buy. The higher the relative real price, the lower the rate of consumption. I show the time series for the relative real price of new cars in Fig. 4.5. Economic theory also posits that real disposable income is a prime factor determining consumption. The higher is the income, the more that is purchased
8 See
https://www.autotrader.com/car-shopping/buying-car-how-long-can-you-expect-car-last-24 0725#:∼:text=Automotive%20Averages,71.4%20months%20%E2%80%94%20around%206%20 years). Last accessed July 26, 2022.
114
4 Information Extraction: Advanced Time Series Methods
Fig. 4.6 This shows the real disposable (i.e., net of taxes) personal income. Source: Federal Reserve Economic Database (FRED). Notice the extreme volatility around 2020. This is the COVID period
because a consumer has more ability to buy more. This income is disposable income which is gross income net of taxes. I show the time series for real disposable income in Fig. 4.6. Another factor important for new car sales is their inventory. Consumers can, in most instances, see the inventory at the dealership. Cars are kept in a parking lot near the showroom. If the inventory is high (i.e., the parking lot is full), then cars are in high supply. So, consumers have more bargaining power and are more willing to shop for a new car which puts downward pressure on prices. If the inventory is low (i.e., the parking lot is empty), then cars are in short supply and prices will be bid up. I show the inventory of new cars in Fig. 4.7. Used cars cannot be ignored since they are a substitute for new cars. If the price of a new car rises, you should expect the sales of used cars to increase as consumers substitute from the now more expensive new cars to the less relatively expensive used cars. But this will drive up the price of those used cars. You can see the relationship in Fig. 4.8. Finally, overall economic activity impacts new car sales in part through its impact on expectations. For example, if the economy moves into a recession, auto sales will decline because people will anticipate becoming unemployed. In fact, auto sales are a leading economic indicator. “Next to real estate, auto sales are the most adversely affected by rising interest rates, with sales turning down historically 12–24 months prior to the end of an expansion.”9 There were two recessions during the period I used for this example: December 2007 to June 2009, and a very short one in April 9 Source:
“Do Snails Pace Auto Sales Mean the Economy is Slowing?” at https://realestate. business.rutgers.edu/news/do-snails-pace-auto-sales-mean-economy-slowing. Last accessed July 31, 2022.
4.5 Case Study: Predicting Total Vehicle Sales
115
Fig. 4.7 This shows the new car inventory. Source: Federal Reserve Economic Database (FRED)
Fig. 4.8 This shows the relationship between new and used car prices. Source: Federal Reserve Economic Database (FRED)
116
4 Information Extraction: Advanced Time Series Methods
Fig. 4.9 There appears to be a structural break in new car sales in January 2009. A Chow Test, shown in Fig. 4.11, confirms this. The break should be included in a regression model of sales
2020 which could be attributed to the COVID shutdown.10 See Romer and Romer (2020) for some technical discussions of business cycle dating. I show a plot of the domestic new car sales in Fig. 4.9. There appears to be a break in the sales pattern in January 2009. I tested this apparent break using the Chow Test which tests for a structural break or discontinuity in a time series that causes the regression line to have two components or pieces. I illustrate such a break in Fig. 4.10. The break is at time .t0 . See Chow (1960) for the original development of the test. Also see the Wikipedia article for the Chow Test at https://en.wikipedia. org/wiki/Chow_test. The test is conducted by specifying the time when a structural break is believed to have occurred. In Fig. 4.10, this is at time .t0 . Without a break, the line would continue as expected along the dashed path; with the break, the line deviates from this path and moves along the solid line with a reduced slope. A regression line would be incorrect because it assumes that the path of the data follows just one slope. The Chow Test involves conducting three regression runs. The first is a pooled regression which is the complete solid-dashed line in Fig. 4.10. This includes all the periods. The second run is for the periods up to the point where a break is assumed to have occurred. The third is for the periods after the break is assumed to have occurred. For each run, the regression error sum of squares (SSE) is saved. These are used to calculate an F-statistic defined by the number of independent variables, k. For this example, I chose to use only a time trend so .k = 1.
10 See
“US Business Cycle Expansions and Contractions” at https://www.nber.org/research/data/ us-business-cycle-expansions-and-contractions. Last accessed July 31, 2022.
4.5 Case Study: Predicting Total Vehicle Sales
117
Fig. 4.10 This illustrates a structural break in a time series. The break occurs at time .t0 . The solid line, pre- and post-.t0 , is actual data, while the dashed line segment is “what would have been” data. Notice that there are three, not one, solid line segments: the segment up to .t0 , the one after .t0 , and the entire solid line
The formula for the test is SSEp −(SSE1 +SSE2 )/k .
(SSE1 +SSE2 )/n1 +n2 −2k
∼ Fk,n1 +n2 −2k
(4.27)
where .SSEp is the pooled SSE, .SSE1 is the first regression SSE, and .SSE2 is the second regression SSE. The .n1 and .n2 are the respective number of observations. This statistic is distributed following an F-distribution with k and .n1 + n2 − 2k degrees of freedom. The Null Hypothesis is that there is no structural break and the Alternative Hypothesis is that there is one at time .t0 . I show the Chow Test for car sales in Fig. 4.11. The p-value for the F-test is less than 0.05 which indicates that the Null Hypothesis should be rejected: there is a structural break at January 2009. This break will be accounted for by defining a dummy variable that is 0 before the break and 1 after.
4.5.3 Linear Model for New Car Sales I specified a linear model for new car sales as a function of the inflation rate of new cars as measured by the CPI of new vehicles, the smoothed real interest rate on an annual basis, the level of the car inventory lagged one period, the dummy variable for the break in car sales, and a recession dummy for economic activity. The Classical Assumptions that I discussed in Chap. 3 are assumed to hold. I show some code in Fig. 4.12 for adding the dummy variables to the training data set.
118
4 Information Extraction: Advanced Time Series Methods
Fig. 4.11 These are the Chow Test results for a breakpoint of January 2009. The functions I used are listed in the Appendix for reference
Fig. 4.12 This is the code to add the structural break and recession dummies to the training data set
The model definition is shown on the left side of Fig. 4.13. I used a (natural) log transformation because this is, first, standard in demand modeling, but second, and more importantly, the estimated parameters for the logged variables are the respective elasticities. See Paczkowski (2018) for a simple proof of this. I show the regression results on the right side of Fig. 4.13. The basic statistics such as the .R 2 , F -statistic, and the p-values look very good. The signs on the estimated parameters make intuitive sense. For example, when the price of new cars increases, the sales of new cars decrease. The estimated elasticity is .−1.685, which indicates that new cars are elastic. The estimated elasticity for used cars is 1.8858, which indicates that as the price of used cars increases, the sales of new cars increase: they are substitutes so when the price of one increases, the sales of the substitute increase. Although the results in Fig. 4.13 look acceptable, there is one statistic that is not. This is the Durbin-Watson statistic which tests the Independence I Classical Assumption. Its value of 0.679 is too low; a value of 2.0 is ideal. This low value indicates positive autocorrelation in the disturbance term which must be corrected. I show the correction in Fig. 4.14. For this, I found that an AR(2) model worked best. Since cars are durable goods, I need a lagged dependent variable to reflect a delay in consumer purchases of a new car. I included one in the model specification and show the results in Fig. 4.15. The Durbin-Watson statistic is 1.949, which is very close to 2.0. Unfortunately, this model violates one of the conditions for the application of the Durbin-Watson Test: there cannot be a lagged dependent variable in the model. The appropriate
4.5 Case Study: Predicting Total Vehicle Sales
119
Fig. 4.13 This is the code to run a regression model and the regression results
Fig. 4.14 This is the code to run an autoregression correction model and the regression results. The Durbin-Watson statistic is now about 2.0
120
4 Information Extraction: Advanced Time Series Methods
Fig. 4.15 This is the code to run a regression model with a lagged dependent variable
test statistic is Durbin’s h-statistic which is a transformation of the Durbin-Watson statistic.11 I show the results of this test in Fig. 4.16. The Null Hypothesis is that there is no autocorrelation in the disturbance term. The Null Hypothesis is not rejected.
4.6 Stochastic (Box-Jenkins) Time Series Models I will consider a very broad, general class of models sometimes called stochastic time series models, or Box-Jenkins models, or just time series models. In this class, the random element, ., plays a dominant role rather than being just an add-on disturbance term to a strictly deterministic model. A model is needed just as in the econometric case. The difference here is that a specific model can be chosen from the class of models that is almost hierarchical: you can start with a very simple one and progress to a complex one by expanding or rewriting the terms. The initial model results from a process known as identification: you select a model based on characteristics or signatures of key data visualizations. This does not mean that the one you select is the best or final model. It is only a candidate to start working with and that will be modified. Most often, several candidates are identified. 11 See
https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic#Durbin_h-statistic. Last accessed August 3, 2022.
4.6 Stochastic (Box-Jenkins) Time Series Models
121
Fig. 4.16 This is Durbin’s h-Test as the substitute for the Durbin-Watson Test when a lagged dependent variable is used. Notice that the Null Hypothesis of no autocorrelation is not rejected. The function is listed in the Appendix
It is important to note that at least one candidate model is needed. This holds for an econometric approach as well as for a time series approach. You need a starting point, an initial model, that you can subsequently modify based on statistical tests and predictive accuracy. Econometric models have the advantage of an economic theoretic structure for the initial model. There is no guide for what the initial model should look like for time series as there is for an econometric model since the times series model is primarily based on one series. Aside from how a candidate model is specified, the estimation process is the same. Each candidate is estimated as a second stage of model development using a training data set and then tested, as a third stage, using a testing data set. A final model is selected and used for forecasting. See Box et al. (1994) for a description of the four time series modeling stages (identification, estimation, testing, and forecasting) and Hill et al. (2008) for the econometric stages.
4.6.1 Model Identification The key tools for candidate stochastic model identification are the autocorrelation function and the partial autocorrelation function. Autocorrelations are the correlations of each value of a time series against other values in the same series at different lags, that is, a one-period lag, a two-period lag, and so on for a k-period lag. If .ρk is the autocorrelation at lag k, then
122
4 Information Extraction: Advanced Time Series Methods
COV (Yt , Yt−k ) . V (Yt ) γk = γ0
ρk =
.
(4.28) (4.29)
where .γk = COV (Yt , Yt−k ) is the covariance between a value of the time series, Y , at time t and time .t − k, and .γ0 = V (Yt ) is the variance of the series (i.e., the covariance at lag 0). The empirical version of the ACF is T rk =
.
t=k+1 (Yt T
− Y¯ )(Yt−k − Y¯ )
.
(4.30)
(Yt − Y¯ )2
t=1
Since γk = COV (Yt , Yt−k ) = COV (Yt−k , Yt ) = COV (Yt , Yt+k ).
.
= γ−k
(4.31) (4.32)
where .t = t − k. It follows that .ρk = ρ−k and so only the positive half of the ACF is needed. The autocorrelations as a function of the k lags are referred to as an autocorrelation function (ACF) or, sometimes, a correlogram. The ACF is usually plotted against the lags with 95% confidence bounds. While the basic concept of (auto)correlations is familiar to you from a basic statistics course, the partial autocorrelation function (PACF) may be new. This function adjusts or controls for the effects of intermediate periods in the lag structure when calculating the ACF. The calculation of the partial autocorrelations is more complex than for the autocorrelations. See Box et al. (1994) and especially (Wei, 2006) for discussions. As I stated above, the ACF and PACF are used to identify candidate models for a time series. There is a hierarchical family of models with the simplest being the autoregressive of order 1 (AR(1)) model. This is written as Yt = φYt−1 + at
(4.33)
.
where .| φ |< 1 and .at is white noise. A random process .{at : t = 1, 2, . . .} is white noise if .E(at ) = 0, ∀t and γk = COV (at , at+k ) =
.
σ2 0
if k = 0 otherwise
(4.34)
If .φ = 1, then this is a random walk model as a special case. A general version of the autoregressive model is AR(p) for p lags. Other models are variations on this basic one. Each model has a signature given by the ACF and PACF. I state the
4.6 Stochastic (Box-Jenkins) Time Series Models
123
Table 4.3 This table provides the signatures for the AR(p) model. More complex models have signatures that are variations and extensions of these Process Autoregressive
Autocorrelations Decay exponentially
Partial autocorrelations Spikes at lags 1 to p, then cuts off
Fig. 4.17 This illustrates the two key graphs for identifying a time series model. Panel (a) is the ACF and Panel (b) is the PACF. Notice how the autocorrelations in Panel (a) rapidly decay exponentially and become statistically zero at lag 4. The partial autocorrelations in Panel (b) have significant spikes at lags 1 and 2 and are zero thereafter. Also, note that the correlations in both panels are 1.0 at lag 0 since the whole series is correlated with the whole series: .ρ0 = γ0/γ0 = 1
signature for the ACF and PACF for the AR(p) in Table 4.3. See Box et al. (1994) and especially (Wei, 2006) for discussions. I illustrate the ACF and PACF for some illustrative annual sales data in Fig. 4.17. Each plot displays the correlations for 10 lags. Since the lags are discrete, spikes or needles are shown for each of the 10 lags. The shaded areas are 95% confidence intervals that allow you to judge the statistical significance of each correlation. For any spike inside the shaded area, the corresponding correlation is interpreted as statistically insignificant from zero. Usually, once a spike falls inside the shaded area, then that correlation and all succeeding correlations are interpreted as statistically insignificant, even if an occasional spike is outside the shaded area; such an aberrant spike may represent a correlation due to random noise and so can be ignored.
4.6.2 Brief Introduction to Stationarity An important simplifying assumption in time series analysis is stationarity in which the generating process for the times series is said to be in “statistical equilibrium.”
124
4 Information Extraction: Advanced Time Series Methods
Fig. 4.18 These are three possible time series patterns. Panel (a) shows a stationary series that is completely random around a mean of zero. The other two panels are nonstationary. The stationary pattern in Panel (a) is the most desirable. (a) Random. (b) Wandering. (c) Trending
Simply stated, a time series is stationary if its mean and variance are unaffected by when you look at the series; it fundamentally looks the same at any point in time: it neither explodes, nor trends, nor wanders without returning to its mean. See Hill et al. (2008). Otherwise, the series is nonstationary. I illustrate a stationary and two possible nonstationary series in Fig. 4.18. Recall my discussion of the moving average window in Chap. 3. You should recognize this stationarity concept as a window of size m that just slides horizontally through a time series graph of the data. If the window trends sideways to encompass the data points, then the series is stationary. Otherwise, it is nonstationary. The AR(1) process .Yt = ρYt−1 +at is stationary when .| ρ |< 1. But when .ρ = 1, it becomes a nonstationary random walk process. Because of this, you should test
4.6 Stochastic (Box-Jenkins) Time Series Models
125
whether .ρ is equal to one or significantly less than one. Two tests, known as unit root tests for stationarity are the Dickey-Fuller Test and the Kwiatkowski-PhillipsSchmidt-Shin Test (KPSS). The Dickey-Fuller Test for the Null Hypothesis is a unit root (i.e., .φ = 1). If the Null Hypothesis is not rejected, then there is evidence that the series is nonstationary. The Null Hypothesis for the Kwiatkowski-PhillipsSchmidt-Shin test is level or trend stationary. This is the opposite of the ADF test. Dickey-Fuller Test Consider the AR(1) model: .Yt = ρYt−1 + at . If you subtract .Yt−1 from both sides, you get Yt − Yt−1 = ρYt−1 − Yt−1 + at .
.
(4.35)
Yt = (ρ − 1)Yt−1 + at .
(4.36)
= γ Yt−1 + at
(4.37)
where .γ = ρ − 1. Clearly, if .ρ = 1, then .γ = 0. So, you test for non-stationarity by testing the Null Hypothesis that .ρ = 1 or .γ = 0 against the alternative that .| ρ |< 1. Or simply .ρ < 1. The hypotheses are then H0 : ρ = 1 ⇐⇒ H0 : γ = 0.
(4.38)
HA : ρ < 1 ⇐⇒ HA : γ < 0
(4.39)
.
A variant of the Dickey-Fuller Test includes a constant term, .δ, which is a drift factor as in the random walk with drift model in Chap. 3: Yt = δ + (ρ − 1)Yt−1 + at .
.
(4.40)
This is the constant mean model if .ρ = 1. A third variant includes a constant and a trend: Yt = δ + (ρ − 1)Yt−1 + λt + at
.
(4.41)
which is the linear trend model if .ρ = 1. The Null and Alternative Hypotheses for the three versions are the same. An important extension allows for the possibility that the disturbance is autocorrelated. This is the problem I discussed earlier for the econometric formulation. This extension is the Augmented Dickey-Fuller Test (ADF). In practice, you should always use the Augmented Dickey-Fuller Test. To apply it, or any of the variants, first plot the time series and examine the pattern around the sample average. If the series fluctuates around a zero sample average, use the basic ADF. If it fluctuates around a nonzero sample average, use the ADF with a constant. Otherwise, if it fluctuates around a linear trend, use the ADF with a constant and trend.12 12 Based
on Hill, R. C. et al. Principles of Econometrics. (John Wiley & Sons, Inc., 2008).
126
4 Information Extraction: Advanced Time Series Methods
Kwiatkowski-Phillips-Schmidt-Shin Test An alternative to the ADF test is the Kwiatkowski-Phillips-Schmidt-Shin Test (KPSS). The Null Hypothesis is that a time series is stationary around a deterministic trend (i.e., trend-stationary); the Alternative Hypothesis is that there is a unit root. My recommendation is to use the ADF test.
4.6.3 Correcting for Non-stationarity What do you do if a time series graph or a statistical test indicates non-stationarity? The easiest correction to the data to induce stationarity is to first-difference the series. The first difference of a time series, .Yt , is defined as Yt = Yt − Yt−1 .
.
(4.42)
Using the backshift operator, this can be expressed as Yt = Yt − BYt .
.
= (1 − B)Yt .
(4.43) (4.44)
You can see that . = 1 − B. The . is called the first difference operator. In most instances, you only need a first difference to induce stationarity. Occasionally, a second difference is needed and this is defined as 2 Yt = ( Yt ).
.
(4.45)
= (Yt − Yt−1 ).
(4.46)
= (Yt − Yt−1 ) − (Yt−1 − Yt−2 ).
(4.47)
= Yt − 2Yt−1 + Yt−2 .
(4.48)
If your data are in a Pandas DataFrame, then you can use the diff method to difference a Series. The default is a one-period difference. You can specify other differencing as an argument. So, .period = 2 would return a two-period differencing. I illustrate this in Fig. 4.19.
4.6.4 Predicting with the AR(1) Model To generate a forecast using any of the estimated stochastic models, you follow four steps:
4.6 Stochastic (Box-Jenkins) Time Series Models
127
Fig. 4.19 This illustrates how to calculate the first difference of a time series, so .period = 1 which is the default. The domestic car sales variable was used. Notice how the first difference is stationary around zero
128
4 Information Extraction: Advanced Time Series Methods
1. Compute .aT −k , k ≥ 0 from the model fit. 2. Substitute observed or expected values for all terms in the model to get .YT (h), .h ≥ 1. The expected values of all future .aT +h values are 0. The expected values of future .YT +h values are given by the predictions .YT (h), h = 1, 2, . . .. 3. Compute .YT (1), YT (2), . . . , YT (h). 4. Calculate variance formulas to get confidence intervals for the predictions.
4.7 Advanced Time Series Models Advanced models are extensions of the basic .AR(1) model and thus form a family. The .AR(1) is just one member. The other family members are Moving Average of order q (MA(q)) This model is .Yt = at + θ1 at−1 + . . . + θq at−q , so it is a function of the contemporaneous and past white noises. The name “MA” is misleading since the weights do not necessarily sum to 1.0. Do not confuse this with the moving average method to smooth a time series. Note that the dual of an MA(q) process is an AR process of infinite order, or .AR(∞). This model comes from the AR(1) model. To see how, consider the model: Yt = ρYt−1 + at .
(4.49)
= ρBYt + at .
(4.50)
.
−1
= (1 − ρB)
at .
(4.51)
I noted above when I introduced the backshift operator that .(1 − ρB)−1 = 1 + ρB + ρ 2 B 2 + ρ 3 B 3 + . . .. Using this gives you Yt = (1 − ρB)−1 at .
(4.52)
.
= (1 + ρB + ρ 2 B 2 + ρ 3 B 3 + . . .)at .
(4.53)
= at + ρat−1 + ρ at−2 + . . . .
(4.54)
2
Truncating the infinite sum gives the more practical result I showed above. Autoregressive Moving Average of Order p and q (ARMA(p, q)) This model is a combination of the AP(p) and MA(q) models and has the form .Yt = φ1 Yt−1 + . . . + φp Yt−p + at + θ1 at−1 + . . . + θq at−q . Autoregressive Integrated Moving Average of Order p, d, q (ARIMA(p, d, q)) This model is similar to the ARMA(p, q)) but with the added component of integration (the I ). It is applicable when the time series is nonstationary which can normally be “fixed” by taking the first difference of the series. The inverse of differencing is integrating by putting the differenced series back on the original (undifferenced) scale. I show an example in Fig. 4.20 of the first differencing
4.8 Autoregressive Distributed Lag Models
129
Fig. 4.20 This illustrates the effect of first differencing a nonstationary time series. Panel (a) is a repeat of the series in Fig. 4.18, Panel (b). Panel (b) of this figure is the first difference of the series in Panel (a). Notice how this resembles the “most desirable” graph in Fig. 4.18, Panel (a). (a) Nonstationary. (b) First Differenced
of the series I previously showed in Fig. 4.18, Panel (b). Notice how this first differenced series looks like the stationary series in Fig. 4.18, Panel (a). Seasonal Autoregressive Integrated Moving Average of Order p, d, q, s (SARIMA(p, d, q, s)) This is a further extension of an ARIMA(p, d, q) model to account for seasonality in the data. This is, of course, applicable only for data measured at a subannual level (e.g., weekly, monthly, quarterly). This model formulation is far more complex. See Box et al. (1994) and especially Wei (2006) for discussions of these models and other variations.
4.8 Autoregressive Distributed Lag Models The multiple regression framework is very powerful and flexible. It allows you to incorporate several different types of independent variables. The general model is called an autoregressive distributed lag model (ARDL) which is a composite of the models I described above. The autoregressive component is an AR(p) model for the dependent variable, but, of course, is included on the right-hand side. The distributed lag component is the lag structure of the independent variables, i.e., the Xs. The model can also contain a component for a time trend and seasonality if needed. I
130
4 Information Extraction: Advanced Time Series Methods
will not discuss seasonality in this book because the topic is quite complicated and is worthy of a book unto itself. The general model structure is13 .Yt
=α+
k
δi t i +
i=1
s−1
γi Si +
P
φp Yt−p +
i=0
p=1
Trend
Seasonal
Autoregressive
Qk M k=1 j =0
βk,j Xk,t−j + Zt + t
Fixed
(4.55)
Distributed Lag
where: • • • •
α is the constant; Si are seasonal dummies; .Xk,t−j are the exogenous regressors; .Zt are other fixed regressors that are not part of the distributed lag specification; and • .t is the white noise disturbance. . .
This provides a richer array of model possibilities that allow you to extract more Rich Information from your data. This is an advantage. But there is also a cost. The cost is the added complexity introduced by the trend and lag structure of the dependent and exogenous regressors. The more possibilities you have (i.e., the more choices), the more time you have to spend experimenting with the combinations in search of the optimal setting for the lags. In addition, the more models you have to compare could add a level of confusion and perhaps inaction simply because you are befuddled. In general, there is an example of a Paradox of Choice where the more choice objects you have, the higher the probability that nothing will be chosen; you just cannot choose. See Schwartz (2004) for an interesting discussion of the paradox. I show an example ARDL setup in Fig. 4.21. The base model is ln (domCarSalest ) = β0 + φ1 ln (domCarSalest−1 )+ β1 ln (realP riceNewCarst )+ β2 ln (realP riceU sedCarst−1 )+ .
β3 smoothRealI nterestAnnt−1 +
(4.56)
β4 ln (autoI nventoryt−1 )+ β5 Dummyt + β6 Recessiont .
13 This
specification is direct, with minor modification, from Python’s statsmodels documentation. See https://www.statsmodels.org/stable/examples/notebooks/generated/autoregressive_ distributed_lag.html. Last accessed August 4, 2022.
4.8 Autoregressive Distributed Lag Models
131
Fig. 4.21 This is an example of a general setup for an ARDL model. The example is for domestic new car sales. Notice that a formula is not used in this setup. Instead, the exogenous and endogenous variables are explicitly specified. In addition, the lag structure is specified for each. I used a Python dictionary for the endogenous variable lag specifications. I show the estimation results in Fig. 4.22
I used the (natural) log transformation because the estimated coefficients are interpreted as elasticities. See Paczkowski (2018) for an explanation of this transformation. I do not use a formula to define this model as I did in the OLS examples in this setup. Instead, it requires an explicit specification of the endogenous (i.e., Y ) variable and the exogenous (i.e., X) variables. So, I created these upfront in the setup. The lag structure for both types of variables must be specified. You need one for the autoregressive part of the model; this is the set of lags for the lagged dependent variable. In this example, I specified a one-period lag as a list (i.e., .[1]), so the autoregressive component is .Yt−1 . The lag structure for the exogenous variables is specified as a dictionary with one key:value pair for each exogenous variable. I then instantiated the model, fitted it, and displayed the results as before. I show the fitted model in Fig. 4.22. The lag structure is indicated in Fig. 4.22 as .L0 ≡ Contemporaneous, .L1 ≡ One Period Lag, and so on.
132
4 Information Extraction: Advanced Time Series Methods
Fig. 4.22 These are the regression results for the domestic new car sales model that I showed in Fig. 4.22
The summary statistics in Fig. 4.22, except for one, are familiar from other regression results as detailed in Paczkowski (2022b). The new one is the HannanQuinn Information Criterion (HQIC) defined as H QI C = −2llf + 2 ln(ln(nobs))(1 + dfmodel )
.
(4.57)
where llf is the log-likelihood function value, nobs is the number of observations, and .dfmodel is the degrees of freedom for the model. For the results in Fig. 4.22: • .llf = 297.747; • .nobs = 201; and • .dfmodel = 10, is the number of parameters estimated.
4.8 Autoregressive Distributed Lag Models
133
This is used as an alternative to the AIC and BIC to select a model. See the Wikipedia article on HQIC for some comments about its use.14
4.8.1 Short-Run and Long-Run Effects There is an advantage to using an autoregressive distributed lag model. You can estimate the long-run effects of a change in an explanatory variable. This is a connection to Fig. 2.7. To see how this is done, consider a simple model with a one-period lagged dependent variable and a contemporaneous and one-period lagged independent variable: Yt = α + φYt−1 + β1 Xt + β2 Xt−1 + t
.
(4.58)
where the disturbance term is white noise. Consider a change in the X variable. There are two possibilities: 1. X could temporarily change in period t and then return to its original value in period .t + 1; or 2. X could permanently change in period t. I will only consider the first change. Assume that X changes only once in period t and then returns to its previous value in period .t + 1. When it first changes, it has an immediate effect on Y in that period; this is what the model says will happen. The magnitude of the effect is given by the parameter, .β1 : .
∂Yt = β1 . ∂Xt
(4.59)
Now consider period .t + 1. The X returns to its past value but the effect of the change in period t lingers on. You can see this by advancing the model in (4.58) one period: Yt+1 = α + φYt + β1 Xt+1 + β2 Xt + t+1 .
.
(4.60)
The effect of the change in X in period t is .β2 . However, the Y changed in period t and this change has an additional effect in .t + 1 in the amount of .φ, but adjusted by the amount it changed in t, which is .β1 . The total effect in .t + 1 is .φβ1 + β2 . You can see this from
14 See
https://en.wikipedia.org/wiki/Hannan%E2%80%93Quinn_information_criterion. accessed August 17, 2022.
Last
134
4 Information Extraction: Advanced Time Series Methods
∂Yt+1 ∂Yt =φ + β2. ∂Xt ∂Xt
.
(4.61)
= φβ1 + β2 .
(4.62)
Now progress to period .t + 2. The model is Yt+2 = α + φYt+1 + β1 Xt+2 + β2 Xt+1 + t+2
.
(4.63)
and the total effect is .
∂Yt+2 ∂Yt+1 + β2. =φ ∂Xt ∂Xt
(4.64)
= φ(φβ1 + β2 ).
(4.65)
You can continue this as long as you want. But logically, there should be a restriction on how the effects develop in time. The restriction is that the effect of a temporary change in period t should dampen out and eventually go to zero. Otherwise, the one-time change will eventually cause an explosion in the Y series: it will continually get bigger. The restriction is .| φ |< 1. You can see this if you rewrite the model, (4.58), using the backshift operator. The model is then Yt = α + φBYt + β1 Xt + β1 BXt + t .
.
(4.66)
Or, collecting and rearranging terms, α + .Yt = 1−φ
β1 + β2 B 1 − φB
Xt .
(4.67)
Then, .
∂Yt = ∂Xt
β1 + β2 1−φ
.
You must have .| φ |< 1.
4.9 Appendix 4.9.1 Chow Test Functions I used a Chow Test in this chapter. The functions are shown in Fig. 4.23.
(4.68)
4.9 Appendix
Fig. 4.23 These are the functions to run a Chow Test
135
Chapter 5
Information Extraction: Non-Time Series Methods
I focused on time series data for predicting an outcome in Chaps. 3 and 4. This is sensible for two reasons: 1. Most business data are time series. For example, • • • •
Sales; Prices; Revenue; Order backlog per month
are just a few that require a forecast. The focus of the forecast, the level of detail, and the audience (C-Level or line manager), given their respective scale-view, vary, even for the same business quantity. For example, the future trend of sales is of interest at all levels, all scale-views, but the level of detail will vary. The CEO and her C-Level Team may just want to know the total sales volume by product category for the enterprise. This would gloss over individual items and even component parts sold to support a product line. A printer manufacturer, for example, would not only sell several series of printers, but also ink cartridges and spare or replacement parts (e.g., power cords and printer heads). The CEO may only want to know just predicted sales of printers and parts, but no detail on individual printers and certainly not every part. The product manager for a specific printer line, however, with a narrow scale-view for that line, would want to know the predicted sales for the individual printer but also each part. The manufacturing division manager would also want to know this level of detail so that he could order the production of parts if necessary. But the information is not just at one point in time. It is the trend in these series that matters. Sales trending downward for several quarters is more important to know than sales at one point in time. A trend requires action: parts ordering (an operational view); pricing, advertising, and promotion, or a redesign of the product itself (a tactical view); or product line termination (a strategic view).
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_5
137
138
5 Information Extraction: Non-Time Series Methods
2. Decisions are for the future (i.e., time-oriented) with lingering, and perhaps growing, effects as time passes. This is often overlooked. All decisions do not have an effect at only one point in time. Not only do decisions have implications and ramifications on other units of a business, the complex system, as I noted in Chap. 2, but they also impact the business in time. Consequently, decision makers at all scale-view levels need to know about key decision factors and KPMs in time. They could also ask or want to know about these KPMs across spatial dimensions of the business where a spatial dimension does not have to refer to geography (i.e., space). I use the word “spatial” to refer to non-time-based measures and concepts. This is also referred to as cross-sectional data in econometrics. I will use the two terms interchangeably. For example, spatial would encompass geographic markets, but also customer segments. But even for these spatial dimensions, managers still need to know how they change or will be affected over time; there will still be a time dimension. The data even for the spatial dimensions of a business are still, ultimately, time-oriented. Consequently, the data are what econometricians referred to as panel data: a combination of times series and cross-sectional data. See Paczkowski (2022b) for a discussion of panel data in terms of a Data Cube representation of data. Since the data are ultimately time series, the information for decisions is timebased. There is information decision makers use; however, that is cross-sectionallybased and not time-based. Cross-sectional data is at one point in time across units, say customers, their preferences, and their satisfaction with services and products. Survey data is the main example. See Paczkowski (2022c) for the importance of survey data and how they can be analyzed. This type of data provides information, after the appropriate extraction, on a wide array of issues that have to be addressed, whether the people and internal organizations have a narrow scale-view or a wide one. The salient point about this data is that they do not have a time dimension. The data are for one point in time. A survey could be repeated, perhaps each quarter or annually, to provide a time series. Such surveys are referred to as tracking studies. The majority of surveys, however, are not done for tracking. A major reason for this is the high cost of conducting them. A common mistake when a decision maker commissions a survey is to use a survey’s results in isolation from other data analysis methods, i.e., time series data and models. This is a mistake because the insights from surveys can provide additional input into a simulation of a business process. This also depends on the scale-view of the decision maker and, therefore, the scale-view of the simulation. A simulation for a product manager with a narrow scale-view will differ from that of a CEO. Survey results pertinent to the product manager will not be relevant for the CEO. As an example, a product manager with a narrow scale-view might be interested in his customers’ purchase intent for a product. Will they buy it? The answer can only come from survey data and analysis. The issue is how the data are collected and analyzed to yield Rich Information. As another example, a survey of shareholders
5.1 Types of Surveys
139
and their assessment for a merger or acquisition could be used in a merger simulation which would be of interest to the CEO (and the BOD), but not a product manager. I will cover different types of surveys in the following section.
5.1 Types of Surveys There is no limit to the use of surveys, not only in the business domain (my focus in this book) but also in the public domain. In the business domain, it can be used at multiple scale-views to aid the decision makers in those views. As an example for a product manager with a narrow scale-view, surveys allow them to collect data on: • customer satisfaction; • customer attitudes, interests, and opinions (AIO); • product ratings and many others too numerous to list. See Paczkowski (2022c) for examples. This list is for research topics in the market research function, that part of a business complex system responsible for collecting, tracking, analyzing, and reporting on market conditions; customer behaviors, compositions, and changes; and the company’s position in the market(s) it serves, to list a few. This function is probably the most important and dominant developer and maintainer of surveys in a business. Market research is both a functional area in a business and a set of analytical methodologies so the term “market research” can become confusing; the context matters. This is analogous to the CFO, a designation in most businesses for a functional area –the Chief Financial Organization– and the executive, C-Level officer in charge of and responsible for all financial matters: the Chief Financial Officer. As a functional area, the market research organization (MRO), typically headed by an executive at the VP level, is responsible for all research dealing with the markets the business is currently in as well as what it may enter. These responsibilities cover operational, tactical, and strategic issues. As a set of methodologies, classical statistics and multivariate data analysis are the primary tools. See, for example, Iacobucci and Churchill (2015) for a detailed discussion of market research methodologies. The items on my partial list above allow market research analysts to segment the market, propose pricing strategies, advise the R&D department about potential new product ideas based on customer input, measure advertising campaign effectiveness, and much more. Consequently, the MRO has a major impact on and is a major influencer in a business. It is one of the many parts of the complex business system because it interacts with and feeds (i.e., supplies information to) many of the other parts, not to overlook a direct interaction with the decision makers. I illustrate these interconnections in Fig. 5.1. The function of the MRO is not limited to a narrow scale-view. It also has direct input into the strategic function of a business. This is the level at which organization
140
5 Information Extraction: Non-Time Series Methods
Fig. 5.1 This illustrates a possible set of interconnections between a market research organization and other major departments and functions in a major corporation. This is a narrow scale-view of an organization
Fig. 5.2 This is a wide scale-view of the same market research organization in Fig. 5.1
structure is handled, decisions about new market entry are made, shareholder values and issues are dealt with, and many more that are outside the purview of line managers. I illustrate these interconnections in Fig. 5.2.
5.2 Discrete Choice Analysis
141
In both functions, surveys are a key data-collection tool. I will focus on the tactical scale-view for a product management function. There are two types of data (for my purpose) collected by surveys at this level: data regarding customer choice of a product to buy and data on purchase intentions for a product. Customer product choice studies are often referred to as discrete choice studies because the products are discrete units; it is Product A, Product B, and so on. There is no such thing as half a product: there is no such thing as half a car. Even when you pay by some continuous measure, say weight, you are still buying the whole product; the product per se is not divisible. The amount purchased is a separate issue. A discrete choice study is concerned with the choice of one discrete object from a set of objects. An example discrete choice question is: “Which of these two products, A and B, will you buy under the following conditions?” Purchase intent studies are concerned with the likelihood that a customer will buy a product. The likelihood in this context is sometimes loosely interpreted as the probability of purchasing the product. The likelihoods are also interpreted as chances: “nine out of ten chances to buy,” “eight out of ten chances to buy,” and so on. This is not a choice question. An example purchase intent survey question is: “How likely are you to buy this product if its price is $X?”
5.2 Discrete Choice Analysis Discrete choice analysis is very popular in the market research domain. The objective is to identify customers’ preference for one product over another where the products are described by attributes (such as price, weight, and size), each at discrete levels. In the choice literature, which deals with more general cases than just products, the items to choose from are called alternatives. Combinations of the attributes define or describe a product or one alternative. The attributes typically have discrete levels and the number of levels is minimal (but at least two), just enough to reflect realistic possibilities for the products. For example, price might have three levels: $1.25, $1.75, and $2.00. A fourth level of $100.25 is unrealistic given the other three and thus should not be included. An example question might be: “Which of these three products, or none of them, would you buy based on .AT T1 , .AT T2 , . . . , .AT Tn ?” where there are n attributes (.AT Ti , i = 1, 2, . . . , n) describing each product. Notice that a “None” alternative, also referred to as the “no choice” alternative, is included. This option is usually used because consumers, in real-life situations, do walk away from buying if they decide that the products they see are not what they want. Excluding a “None” option forces a choice that may be unrealistic. Consumers are presented with each product, including a “None” option, not one at a time but in S sets called choice sets. A set s, .s = 1, 2, . . . , S, is defined by combinations of the attribute levels. The consumers are asked to evaluate the products in each choice set and to choose the one they would buy, or None of them. This is called a choice experiment with S choice sets.
142
5 Information Extraction: Non-Time Series Methods
Since there are usually several attributes, each at several (discrete) levels, it is possible that the number of combinations can be very large. This implies that consumers could be presented with a very large number of choice sets. For example, if there are four attributes, each at three levels, then one product could be defined in .34 = 81 ways. But this is one product! If there are three products, there is a total of 243 (.= 81 × 3) possible arrangements. This is far more than any consumer could tolerate. Consequently, experimental design principles are used to create a subset of all possible combinations. The subset allows you to efficiently estimate the parameters of a model. The subset is called a design and consists of several choice sets. Not all designs are optimal in a statistical sense, but software exists that implements methods to create an optimal design. Some design concepts are reviewed in Paczkowski (2018) and Paczkowski (2022c). Also see Cochrane and Cox (1957), Hinkelmann and Kempthorne (1994), and Box et al. (1978) for classical experimental design concepts and methods. Finally, see Kuhfeld (2008) and especially Louviere et al. (2000) for discrete choice designs. Once a choice design is created, the S choice sets are presented one at a time to each consumer. For each presentation, the consumer selects that product in the set they believe would give them the most satisfaction if they actually bought it. In economic terms, they would select the product that maximizes their utility. Let .Uij s be the utility consumer i receives from product j in choice set s. Product j is chosen over another product, say k, in the same choice set if Uij s > Uiks .
.
(5.1)
But this is for consumer i. You can assume there is an average market utility for each product, averaged over all consumers in the market. I will refer to the market as the addressable market because it is the one applicable to the product and, therefore, the one the business must address with its product. This average utility is called the systematic utility, represented as .Vj for product j . Consumer i’s utility differs from this average by some random factor, .ij , the randomness reflecting the consumer’s idiosyncrasies. These are unobservable by you, the analyst, but are known (perhaps) to the consumer. The individual’s utility for product j is then .Uij s = Vj + ij s . Because of this random component added to the utility, the resulting model is referred to as a Random Utility Model (RUM). Since .Vj is an average, you can assume that it is a linear function of the attributes of the product and write this quantity as .Vj = Xj β. The .β terms are weights, called part-worths, because they indicate, for each attribute, the weight or value or level of importance (all interchangeable terms) of each attribute in the addressable market. These weights are comparable to those in an OLS model so, therefore, they are unknown parameters to be estimated from data. In my formulation, these weights are constant for each consumer in the addressable market. This may not be so, in which case the weights vary by consumer or groups of consumers. This constancy is a condition.
5.2 Discrete Choice Analysis
143
The data come from the choice study survey. Once the data are collected and processed, a model is estimated. Estimation is based on individual-level data from the sample. The model reflects the discrete nature of the choice exercise: to select one, and only one, product from each choice set. This means that the dependent or target variable is not continuous but discrete measured as a series of Yes and No responses for each product in a choice set and for each choice set presented to the consumers. For my example above of three products, the target has four possible responses: three products plus the None option with three No answers and only one Yes answer. These are coded as 0 for No and 1 for Yes. So, the target is a dummy variable. The model estimated is not for the 0/1 values, but for the probability of seeing a 0 or 1 target value. If .X is the vector of attributes, then the probability of attribute j , .j = 1, 2, . . . , J , being selected from a choice set is eVj P ri (j ) = J
.
k=1 e
(5.2)
Vk
where .Vj = Xj β is the systematic utility. This results from the use of a particular probability distribution assumption for the random disturbance term, .ij , added to the systematic utilities. This random factor and how it results in (5.2) is beyond the scope of this book. See Paczkowski (2018), Paczkowski (2022b), and Train (2009) for discussions of this model. Also, see McFadden (1974) for the original development of this model in terms of utility maximization in econometrics. It is easy to show that the sum of the probabilities for all the products in a choice set is 1.0. Since the denominator in (5.2) is the same for all the choice probabilities for consumer i, you can write Js .
P ri (j ) =
j =1
Js j =1
eVj Js
= J s
k=1 e
Vk
Js
1
Vk k=1 e
(5.3)
.
eVj .
(5.4)
j =1
= 1.0.
(5.5)
for choice set s, .s = 1, 2, . . . , S. If the “None” option is included in the choice sets, then (5.2) is modified as
P ri (j ) =
.
1+
eVj Js
Vk k=1 e
(5.6)
where the “1” in the denominator is .e0 for no attributes: “None” has no attributes.
144
5 Information Extraction: Non-Time Series Methods
Equation (5.2) has several interpretations. One is a take rate: the proportion of the market that will “take” or buy product .j, j = 1, 2, . . . , J , rather than one of the other products. Other interpretations are market share, share of preference, and share of wallet. I prefer the take rate interpretation and this is the one I use. This model has several properties. The first is the Equivalent Differences Property which states that any constant, say .δ, added to the systematic utility of each product automatically cancels for all products and, therefore, has no effect. To see this, consider the basic choice model for a problem without the “None” option, (5.2). Notice that eVj +δ . P r(j ) = J s Vk +δ k=1 e
(5.7)
.
=
e
eδ eVj Js δ
k=1 e
eVj = J s
k=1 e
Vk
Vk
.
.
(5.8)
(5.9)
The last line is just (5.2). This property may seem like a restriction on the usefulness of the model. Income, employment status, and household income are just three consumer features that are examples of constants for the alternative products, although they could certainly vary by consumers. Yet, these factors may be important for explaining product selection. They can still be incorporated by the clever use of dummy variables. In this case, the constant added to an alternative is called alternative-specific constant (ASC). See Paczkowski (2016) for a discussion of ASC. Incidentally, the systematic utility cannot contain a constant term as in a linear model since it too cancels from the numerator and denominator. The second property is more insidious. This is the Independence of Irrelevant Alternatives (IIA). This says that the ratio of the probabilities for any two alternatives does not depend on any other alternative in the choice set. This ratio is just the difference in the systematic utilities for the two alternatives in the ratio. That is V V .P r(j )/P r(k) = e j − e k . All other alternatives in the choice set are irrelevant. This is problematic because there could be a third alternative that is indistinguishable for all practical purposes from the two forming the ratio. The probabilities will be distorted by this third alternative when they should not be, thus resulting in incorrect conclusions. This is summarized by the famous “red bus-blue bus” problem in which the probability ratio of a red bus to a car is distorted by the introduction into the choice set of a blue bus. The two buses only differ by color, which should be inconsequential. The distorted probabilities have implications for cars vs buses in this classic example. There are tests for the IIA property and ways to deal with it.
5.2 Discrete Choice Analysis
145
See Paczkowski (2016, Chapter 6), Paczkowski (2018, Chapter 5), and Train (2009, Chapter 3) for discussions of this property.
5.2.1 Discrete Choice Model Extensions The model in (5.2) can be modified to handle numerous special cases. One is the nested logit case. This is applicable when there is a correlation among the alternatives so that they are not independent. They are substitutes. The alternatives, however, can be grouped so that the choice problem can be viewed as first selecting a group and then selecting a specific product within that group. An example is selecting a restaurant for dinner. Restaurants have different cuisines, so a choice problem can be viewed as first selecting a cuisine and then a specific restaurant conditioned on that cuisine. The cuisines form the groups. The specific restaurants are nested in (or under) the cuisines. In terms of a probability statement, you have P r(Restuarant & Cuisine) = P r(Resturant | Cuisine)P r(Cuisine).
.
In this example, it makes logical sense (at least to me) to first select the cuisine. I illustrate this choice in Fig. 5.3. In many other applications, however, it is not clear which choice is made first. For example, Hensher et al. (2005) describes an example of choosing to rent or buy a living space and in what neighborhood. The nested problem could be framed as choosing to rent or buy and then choosing a neighborhood, or choosing a neighborhood and then whether to rent or buy. I illustrate these two choice possibilities in Fig. 5.4. The choice problem is more complex: Which choice comes first? See Hensher et al. (2005) for strategies for nested modeling.
Choice for Dinner
Cuisine A
Resturant A1
Cuisine B
Resturant A2
Resturant B1
Resturant B2
Fig. 5.3 This is a simple example of a nested choice problem for choosing where to have dinner
146
5 Information Extraction: Non-Time Series Methods
Choice of Where to Live
Neighborhood A
Buy
Neighborhood B
Buy
Rent
Rent
(a) Choice of Where to Live
Buy
Neighborhood A
Rent
Neighborhood B
Neighborhood A
Neighborhood B
(b) Fig. 5.4 This is a complex example of a nested choice problem for choosing where to live. Which choice is made first? (a) Choose neighborhood first. (b) Choose rent or buy first
5.2.2 Types of Discrete Choice Studies There are two basic types of discrete choice studies: • Stated Preference; and • Reveal Preference. A stated preference choice study (SP ) is survey-based. Survey respondents are presented with artificial products grouped or organized into choice sets. Each set represents a combination of products. The sets are presented in a randomized order to each respondent. A respondent is then asked to select the one product from a set they would buy (or none of them if a “None” option is available). In this case, the respondents state their preference for the products. The benefit of SP is that artificial, proposed products can be tested before development costs are incurred, costs which can be very high, only to discover postproduct launch that the product demand is very low. I am not saying that choice
5.2 Discrete Choice Analysis
147
modeling before product launch will guarantee a successful product; just that it will help minimize the chance of introducing an unwanted product. Paczkowski (2020a, Chapter 4) discusses some reasons for the failures of new products and how choice modeling can be used to minimize these failures. The disadvantage of SP is the study design itself and the cost of conducting a sufficient number of surveys to collect the right amount of data. A revealed preference choice study (RP ) is not survey-based. Consumers are observed in an actual market situation and their choice in a product category is observed. The attributes and features of the products are known before the study is conducted. The early applications of RP were in the transportation mode analysis. Travelers (e.g., people commuting to and from work each day) were observed in their selection of a bus, light-rail train, personal car, or carpool for their commute. Their preference was revealed, not stated, for their preferred commuting mode. The benefit of RP is that you have data on people’s actual purchases. Sometimes you can combine this data with personal interviews to learn more about why a choice option was selected rather than any other options. For example, in a travel mode study, why was a train selected? The disadvantage of RP is the complexity of data collection. Although these two approaches have been developed separately, albeit with the common core of utility maximization and the choice model in (5.2), they have been successfully combined into an SP-RP framework that combines the positive aspects of both types of studies. The drawback to this approach is, of course, the added complexity of a study and the cost associated with overall data collection. I am only concerned about stated preference choice problems and models. See Louviere et al. (2000) for a thorough treatment of SP modeling.
5.2.3 Discrete Choice Experimental Designs Designing a discrete choice study is not simple. There is a large literature on experimental design concepts for choice models. See, for example, Hensher et al. (2005), Louviere et al. (2000), Raghavarao et al. (2011), and Kuhfeld (2008). The basic problem for these designs is the large number of choice sets that have to be developed, not to overlook conditions that have to be placed on the designs to allow for real-world product and choice situations. These include the number of products that may be in a choice set, product availability (some product definitions may not be available or are infeasible due to unrealistic factor combinations), and the presence of fixed factors for some products and not others, just to mention a few. Louviere et al. (2000) provides some examples and how to deal with them. These conditions are not present in industrial or agricultural design problems where the fundamental design concepts were initially developed, although these problems are also not trivial. There is extensive literature on these types of designs. The classic and highly readable book by Box et al. (1978) should certainly be studied before attempting the books on choice designs.
148
5 Information Extraction: Non-Time Series Methods
5.2.4 Discrete Choice Estimation Estimation of the parameters of a choice model such as (5.6) is more complicated than for an OLS model. OLS estimation is straightforward and easily programmed. It only involves basic arithmetic operations.1 The linearity of an OLS model is the reason for the simpler estimation equations. Model (5.6), however, is nonlinear, which complicates estimation. McFadden (1974) showed that maximum likelihood methods can be used for estimation. Maximum likelihood methods are another way to estimate the unknown parameters of a model, either linear or nonlinear. Usually, just the OLS method is taught in a basic statistics course, and for only one explanatory variable. This is also the way regression analysis is taught in an introductory econometrics course, even when multiple explanatory variables are introduced. When there are multiple variables, matrix algebra is needed to efficiently handle all the variables at once. Matrix algebra requires more advanced mathematics training which is usually beyond the requirements for a basic statistics or econometrics course. This algebra results in a set of equations that simplify to the ones for a single explanatory variable.2 The matrix algebra approach is a general case for a linear model estimation—but it is still OLS. For nonlinear model estimation, more advanced methods are needed. Maximum likelihood estimation is advanced. Maximum likelihood involves maximizing a likelihood function which is comparable to a probability density function (pdf ) for a continuous random variable or probability mass function (pmf ) for a discrete random variable. A pdf is interpreted as a function of the random variable conditioned on the parameters. The likelihood is interpreted as a function of the parameters conditioned on the observed data. Regardless of the interpretation, the functional form is the same. Nonetheless, the procedure is more advanced and requires more sophisticated mathematics and software to handle. See Cramer (1986) for the application of maximum likelihood methods in econometrics.
5.2.5 Discrete Choice Example In this fictitious example, I will illustrate how to design, estimate, and interpret a discrete choice model. The intent is not to develop this topic in detail because it is quite extensive. See Paczkowski (2018), Paczkowski (2022b), and Paczkowski
1 At least for a simple model with one independent variable such as time. I showed the estimation equations for a simple model with time in Chap. 3. See Sect. 3.9. 2 Some introductory econometrics textbooks eschew the use of matrix algebra, except for the most basic of notation. See, for example, Hill et al. (2008, Chapter 5), a leading textbook in this area, that does not use matrix algebra.
5.2 Discrete Choice Analysis
149
(2022c) for more detailed examples. Also, see Train (2009) and Hensher et al. (2005) for extensive coverage of the whole choice topic and examples. This is a marketing problem for the car mat company of Chap. 4. The product manager for the mats wants to know the preference or take rate for the company’s mat versus a leading competitor. This is a tactical scale-view. The product manager, assisted by the competitive assessment group, selected four mat attributes for the study: 1. 2. 3. 4.
Price at two levels: High and Low; Shape at two levels: Square and Rectangular; Size at two levels: Large and Small; and Trimable: Yes and No.
The manager realizes that consumers may not like either alternative in a choice set, so she agrees to have a no-choice option included in the study. For this option, the values for price, shape, size, and trimability are zero since “no choice” does not have any attributes; the values must be zero. For this example, I chose to develop a choice experimental design using R. My preference is to do all analytical work in Python; however, there are times when other software packages are needed. Python offers a wide suite of packages to handle many data analytic problems, but occasionally another package such as R has a capability that is either not in Python or is just superior. In this case for an experimental design for a choice study, R has a package, idefix, that is capable of handling complex designs for choice models. These designs differ from standard DOE designs such as full and fractional factorials, Latin Squares, Balanced Incomplete Designs, and many others. The difference is due to the nature of the alternatives in a choice set and the relationships between these alternatives that must be maintained. As one example, two alternatives cannot appear in the same choice set where one of the two is clearly superior to the other. The superior one will always be chosen so that nothing is learned from that choice set. For instance, everything in the two alternatives could be identical except for, say, price: one is priced high, the other low. The low-priced alternative will always be chosen. A special class of design strategies and calculations is needed for these choice problems. The R package idefix handles these. Python’s pyDOE2 package can handle many standard designs. As a procedural note, I chose to write the R script in an R editor (i.e., RStudio) rather than use the Python package r2py which is an interface to R from Python. You can download and install r2py using pip install rpy2. You can also install R either through Anaconda or through the R project homepage: https://www.r-project.org/. I illustrate an R script in Fig. 5.5 for a choice design based on idefix.3 The idefix library is imported and some parameters are set.4 The parameters are:
3 You 4I
must first have R installed on your computer and idefix installed in R. also set a random number seed as set.seed( 42 ). I explain the reason for the seed in Chap. 9.
150
5 Information Extraction: Non-Time Series Methods
Alternatives Number of alternatives .+1 for a “no choice” option. Choice Sets Number of choice sets to create: 1/2 fraction in this example. Attribute Levels Number of levels for each attribute. Coding Type of coding: E .= Effects Coding, D .= Dummy Coding.5 Prior Means Arbitrary weights: no choice and attributes in that order. A candidate set of choice designs is created based on the number of alternatives and their levels. For this problem, there are four attributes each at two levels. There are .24 = 16 combinations. These are the candidate sets. Unfortunately, 16 choice sets would be too onerous for a consumer; fatigue or frustration could cause a consumer to withdraw from the study. A fraction of the 16 is needed. For this problem, a one-half fraction was selected to yield .24−1 = 8 combinations. These are rearranged to produce the two alternatives I want for this problem. Incidentally, I also want to include the “no choice” option, referred to as no choice in this R package, effectively adding a third option to each choice set. The resulting design has 24 rows: eight choice sets optimally arranged so that in each one none of the alternatives is superior to the others in the set. The final design is written to a text file (a csv-formatted file) for importing into a Pandas DataFrame. I also wrote the key parameters to a text file for later use in a Python script. Once I have the design, I import it into Pandas, along with the parameters. The design, however, is not the final product. As it is, it can be replicated in a questionnaire for survey respondents to access. In that questionnaire, there may be some introductory material about the product concepts and how they are supposed to answer the choice questions: review the product alternatives shown to them and then select the one they would buy or neither if nothing is appealing. These responses are put into a Pandas DataFrame and merged with the choice design. However, there is more involved than a simple merger of two DataFrames before choice probabilities (i.e., take rates) can be estimated. The merged data have to be in a special form, called long form, that includes indicators for the choice sets, alternatives (including the None), and which of the alternatives was selected. A long-form arrangement has the data organized in many rows, but few columns. Each row is an alternative with the selected alternative indicated. This is in contrast to a wide-form organization in which the alternatives are all in one row. Most estimation packages use the long-form arrangement, while most survey tools provide data in wide-form because that is how they collect data from respondents. Since this study is a stated preference choice study, a survey is used, hence the wide-form that must be reorganized to long-form. I show a comparison of the two layouts in Fig. 5.6.
5 See
Paczkowski (2018) for a detailed discussion of effects and dummy coding.
5.2 Discrete Choice Analysis
151
Fig. 5.5 This is an R script for generating a choice experimental design. See Traets et al. (2020) for complete documentation for the script components
Fig. 5.6 This shows the two data layouts: the wide-form on the left and the long-form on the right
152
5 Information Extraction: Non-Time Series Methods
For this example, I prepared a Python script that I show in Fig. 5.7 that imports the choice design (along with the parameters) I created in R and creates indicators for: • a respondent ID (rid) which is unique for each respondent; • a choice set ID (choiceSet) which is 0–7 with each digit repeating three times for each respondent; • an alternative ID (alternative) which is 1–3 and repeats eight times for each respondent; and • an observation ID (oid) to uniquely identify each choice set and ranges from 0 to 3999 for 4000 choice sets. It is important to emphasize that, even though the parameters are constant in (5.2) and (5.6), the choice probabilities are at the individual level where the “individual” is an observation in the sample. That individual could be a consumer (which is how I am using “individual”), a firm, or any other entity. The observation is a choice set. Since this is a fictitious problem, I had to randomly generate the choices. I will explain in Chap. 9 how random choices are generated. For now, let it suffice that I randomly generated their choices as an integer between 1 and 3 which I then compared to the alternative indicator. If the choice and the alternative numbers matched, then the choice indicator was set to 1; otherwise, it was set to zero. This is a choice dummy variable. For this problem with three alternatives per choice set, each choice set has a single “1” and two “0” values for this choice dummy variable. I show the Python code for this data management in Fig. 5.7 and the data in Fig. 5.8. I show only the first ten rows of the DataFrame to illustrate what the data look like. The entire DataFrame has 12,000 rows. This large number is due to the choice design replicated 500 times for 500 respondents: 8 choice sets of 3 alternatives each (including the None option) times 500 or .3 × 8 × 500 = 12,000. Once the data are arranged in a proper long-form format, the choice model can be estimated. The Python package choicemodels handles the maximum likelihood estimation of discrete choice models such as (5.6) and its extensions such as a nested logit model. choicemodels can be installed on your computer using pip install choicemodels or conda install choicemodels–channel conda-forge through Anaconda. An alternative is pyLogit which choicemodels relies on, so pyLogit must also be installed. pylogit can be installed using pip install pylogit or conda install -c conda-forge pylogit. pyLogit is more extensive but more complicated to use than choicemodels. I chose to use choicemodels. I show the estimated choice model in Fig. 5.9. First, notice that I used the four steps for estimating a model: 1. 2. 3. 4.
Model formula statement; Model instantiation; Model fitting; and Results displaying.
5.2 Discrete Choice Analysis
153
Fig. 5.7 This is a Python script to manage the discrete choice data to create the long-form data layout. The result is displayed in Fig. 5.8
The model formula is .−1 + noChoice + A + B + C + D. The “.−1” indicates that the constant is to be omitted. Normally, it is automatically included; it is a default. I noted that any constant variable cancels in a discrete choice model since it is the same in all the terms of the model, so it has no effect. However, the estimation procedure is general so it must be told to omit this constant. There is a constant, however, that must be included: the ASC for the “None” option. This is noChoice in the DataFrame and must be included. The remaining variables are clear. The instantiation uses the choicemodels’ MultinomialLogit estimation routine which implements (5.6). This has arguments that are the DataFrame (df_responses)
154
5 Information Extraction: Non-Time Series Methods
Fig. 5.8 This is the discrete choice data that result from the Python script in Fig. 5.7. The data are in a long-form arranged DataFrame. The entire DataFrame has 12,000 rows. This DataFrame was then used in the choice estimation I show in Fig. 5.9
from Fig. 5.7, the observation ID which identifies each choice set individually, the alternative ID in each choice set, the choice dummy variable, and the model formula. A maximum likelihood method was used for estimation. Note that 4000 observations were used. This is the number of choice sets times the survey sample size of 500 respondents (.4000 = 8 × 500). This is based on the oid variable. The log-likelihood function is maximized at a value of .−3339.506. The value for this function if the coefficients are all zero is the LL-Null value of .−4394.449. The pseudo-.R 2 (referred to as McFadden’s pseudo-Rsquared) is calculated as .1 − (log−likelihood/N ull log−likelihood ). For this problem, this is .1 − (−3339.506/−4394.449) = 0.240. This is comparable to the .R 2 in linear regression. For this model, only 24% of the variation is explained which is not very good. The pseudo-.R 2 -bar, comparable to the adjusted-.R 2 in linear regression, is .1.0 − (loglikelihood −#parameters/N ull log−likelihood ). This is .1 − (−3339.506−5/−4394.449) = 0.239. The AIC is a measure of the “badness-of-fit” and is calculated as .−2log_likelihood + 2#parameters. The BIC is the Bayesian version: .−2log_likelihood + log(#observations)#parameters. See Paczkowski (2022b) for a discussion of the AIC and BIC. The estimated parameter results in Fig. 5.9 are all statistically significant except for the attribute C which is marginally significant. These parameters are used to calculate the choice probabilities using (5.6). I prefer and recommend that these probabilities be simulated as part of a larger complex system simulation. I do not believe that the individual probabilities outside of the context of a system view based on a relevant scale-view are meaningful or useful. Therefore, I will delay discussing predicting probabilities until Chap. 10.
5.3 Purchase Intent Analysis
155
Fig. 5.9 This is the discrete choice estimation setup and results using choicemodels. The data are from the full DataFrame created by the Python script in Fig. 5.7
5.3 Purchase Intent Analysis Purchase intent is just that: intention to buy, perhaps based on price. An example survey question might be: “Would you buy X at $Y”? The key differentiator from the discrete choice framework is the word “choice.” Discrete choice involves comparing two or more items (e.g., products), while purchase intent is just the intention to buy without regard to other items. Purchase intent, however, could be given a choice interpretation in the sense that a consumer chooses (i.e., makes a decision) to declare his/her intention to buy or not to buy the product. This is a choice. I prefer, however, to refer to the outcome of this decision as a declaration of the chances that they will buy the product. As I will show below, the main output is a probability, but it is a binary probability. I will refer to this probability as a take rate.
156
5 Information Extraction: Non-Time Series Methods
Purchase intent could be used as a weighting factor or calibrator for discrete choice. For example, a discrete choice study survey respondent could say they would buy product X rather than products Y or Z. This is useful to know, but it does not indicate the strength of their intention to buy X—how likely are they to actually buy X? Purchase intent could be used to gauge the strength of commitment to buy and thus weight the probability of buying X.
5.3.1 Purchase Intent Survey Question A survey question for purchase intent is quite simple. A customer is simply asked if he/she will buy the product if the price is $Y. Some other attributes could be included, but these are usually used to describe the product. The scale used is often a 10-point Likert scale: 1 .= Definitely Will Not Buy; 10 .= Definitely Will Buy. This scale, however, is not used per se in statistical analysis because it is ordinal, and not continuous, and an ordinal regression model is slightly more complicated. Instead, the scale is typically recoded into top 3 box (T3B) with the following mapping: .
≥ 8 → 1 : will buy.
(5.10)
< 8 → 0 : will not buy.
(5.11)
This is now a binary measure that is easier to handle. There are some criticisms of this recoding, primarily the loss of information from the scale coded as 1–7. This is correct, but I believe that the loss is more than outweighed by the simpler model structure. Regardless, this T3B transformation is often used in practice. With this recoding, the model has a choice connotation. You are most likely familiar with OLS modeling, at least conceptually, from a basic statistics course as well as my high-level overview in Chaps. 3 and 4. In Chap. 3, I described a link function and noted that OLS used the Identity Link. But, as I mentioned in Chap. 3, there are other links, one of which is the Logistic Link. This link function is used to analyze the T3B data via logistic regression, which is a member of the GLM or regression family. The OLS model is inappropriate because it has several inherent problems when the target variable is binary. One problem is the non-normality of the target: it is assumed to be normally distributed in the OLS framework although normality is inconsequential for estimation; it is needed only for hypothesis testing. An appeal to the Central Limit Theorem, however, can minimize or eliminate any concerns with this problem. A second problem is the heteroskedastic nature of the disturbance term. In the basic OLS framework, this term is assumed to be homoskedastic. This problem, however, like the first one, is inconsequential because there are well-developed methods for dealing with heteroskedasticity. A third problem, however, is more serious and cannot be ignored or eliminated. This problem is concerned with the predictions from the model, and since our concern is with predictions, this problem is very important.
5.3 Purchase Intent Analysis
157
Remember that the target is transformed to be binary: just 0 and 1. An OLS model can produce a prediction of any value, not only between 0 and 1 but also outside the range of 0 and 1. If a prediction is between 0 and 1, then a simple rule can be set to convert that answer to 0 or 1. For example, the rule could be that any predicted value greater than 0.5 can be converted to 1.0; 0 otherwise. If, however, the predicted value is outside the bounds of 0 and 1, then no rule can be used because the model predicts something that is not possible or interpretable. For example, what happens if the model predicts 2? What is 2? Similarly, what if the model predicts .−2? What is .−2? In fact, what is any negative number in this context? This last problem can be handled by using a function that will always return a value that lies between 0 and 1. Then the logit link is based on the logistic distribution, which I review in the next subsection.
5.3.2 The Logistic Regression The logistic regression model is based on the logistic cumulative distribution function (CDF). A CDF is a function that maps a random variable to the open interval .(0, 1). It is defined as .P r(X ≤ x) where x is a specific value of the random variable. This means that any random variable can be mapped to an interval that is easily interpreted as a probability. A CDF exists for a wide range of probability distributions. One, in particular, has a functional form that is very conducive for statistical analysis because of its almost simplistic form. This is the logistic distribution with CDF: e Zi . 1 + e Zi Zi = β0 + β1 Xi .
P ri (X = x) =
.
(5.12) (5.13)
See Paczkowski (2022c) for a discussion of this distribution. You can now use (5.12) as an empirical model. Notice that this model is a special case of the discrete choice model. It should be a special case since the purchase intent problem could be interpreted as a choice problem: to buy or not buy. The “1,” as before, represents the “None” option; i.e., do not buy. With this interpretation, the model is
P r i (Buy) =
.
and
e Zi 1 + e Zi
(5.14)
158
5 Information Extraction: Non-Time Series Methods
P r i (Not Buy) =
.
1 . 1 + e Zi
(5.15)
The two probabilities sum to 1.0, as they should for probabilities, which is easily seen. The variable Z is a linear combination of features of the product: price, size, weight, and so on.
5.3.3 Purchase Intent Study Design The design of a purchase intent study depends on how you want to collect your data. You have two options: 1. surveys; and 2. click-stream observations. A survey-based approach for collecting purchase intent data is simpler than that for a discrete choice study. In the discrete choice case, choice sets have to be created to accommodate the complexities of the products and the different possible choice situations. In addition, the attributes and levels for the product features have to be specified, which may not be easy to do. My experience has been that product managers often do not fully understand the products they want to be tested. And they may not understand competitive products which have to be included in the choice sets. Finally, the levels of the competitive products have to be varied for the choice design development, but product managers may not, and often do not, have sufficient competitive intelligence to specify what levels will change, if any, and how. Their input into the choice design is critical, but may be limited and tenuous at best. This just makes the SP discrete choice analysis more challenging, to say the least. A purchase intent study, however, although limited in scope compared to a discrete choice study, either SP or RP , is much simpler to design. You only need a properly worded question for intent. A good example question is the one I provided in Sect. 5.3.1. And the scale I provided there usually suffices. Conditions for purchase can, of course, be stated before the purchase intent question is asked. The sale price is one obvious condition.
5.3.4 Purchase Intent Estimation Purchase intent probabilities are estimated using the maximum likelihood method because of the nonlinearities of (5.12). The Python statsmodels package has a module, Logit, that can be used to estimate a binary logit purchase intent model.
5.3 Purchase Intent Analysis
159
5.3.5 Purchase Intent Example This example is based on fictitious data of consumers arriving at an online store to buy a car floor mat. They see a price and a wait time before the product will ship if they place their order at that moment. Based on this information, they either choose to place their order or leave without placing one. The website keeps track of visitors, the price they were shown, and the wait time they were shown. In addition, a variable records if an order was placed or the customer left the site. This variable is a dummy variable: 1 if the order was placed; 0 otherwise. I show in Fig. 5.10 how I generated fictitious data for this example. I explain the functions to generate the random numbers in Chap. 9. You can immediately see that the data arrangement is simpler than the one for the discrete choice example. Each observation is just one person as opposed to a choice set. In addition, there are only two alternatives for a choice: buy or do not buy. The dummy variable reflects this. Estimation is still done at the individual level using maximum likelihood. I show the estimation setup in Fig. 5.11 and the results in Fig. 5.12. For this example, I used statsmodels. The interpretation of the results is the same as for the discrete choice example.
Fig. 5.10 This is a Python script to generate the fictitious purchase intent data
160
5 Information Extraction: Non-Time Series Methods
Fig. 5.11 This is a Python script to set up the estimation of the purchase intent data
There is one additional piece of information I requested after the estimation: elasticities. These are based on the average values from the sample data as well as the estimated parameters. Averages were used because the estimated probabilities are at the individual level, which is not of any practical use. The elasticities show the change in the average probability for a change in the average variable. In this example, the elasticities are what you should expect. The price elasticity is .−1.8, which is highly elastic. This makes sense since there are alternatives and mats are not essential. The wait elasticity is .−2.5, which is negative and highly elastic. This indicates that people are impatient and will leave without placing an order if they have to wait too long for delivery. So, the longer the wait, the lower the intent to buy.
5.4 Choice Predictions Whether you use the discrete choice framework or the binary logit version, the data and estimation are at the individual level. The resulting probabilities are, therefore, at the individual level. As Cramer (1991, p. 83) noted, this is by the design of the basic theoretical choice model. For some scale-views, this level of the probabilities may be needed; for others, definitely not. As an example, from an operational perspective, price and product promotions can be offered in real time in an online ordering system if it is determined that a customer may leave the site without
5.4 Choice Predictions
161
Fig. 5.12 These are the results for the logit estimation based on the setup in Fig. 5.11
placing an order. The determination is based on the probability of that customer placing an order, so the needed probabilities are at the individual level. For a tactical or strategic scale-view, however, aggregate probabilities are needed. Aggregation is to the market level which is larger than the sample level used to estimate the probabilities. See Cramer (1991) for comments about aggregation levels. Aggregation is an important step for demand modeling because, ultimately, it is not the individual but the market that is of interest. The market summarizes the
162
5 Information Extraction: Non-Time Series Methods
joint behavior of large groups of people. Businesses can then target their marketing campaigns to these groups rather than a specific campaign for a specific individual. The study of individual behavior, as in the formulation of an underlying consumer demand model, is merely the vehicle for market analysis. Most times, you will need aggregate predictions of choice probabilities for a population, i.e., the market. The predictions are what you would expect to see in the population, on average. The population depends on your scale-view. For a tactical problem, say for a single product in a multiproduct business, the population is the addressable market for that product. For a strategic scale-view for that same enterprise, the population is the customer base for all its products. Also, the customer base could be broader if mergers are being considered. So the definition of the population varies by the scale-view. Market demand estimates are computationally simple to obtain. As a first step, you average the choice probabilities. If .Pri (j ) is the choice probability for individual i for option .j, j = 1, 2, . . . , J , and M is the size of the relevant market, then an aggregate average probability is P r(j ) =
.
M 1 P ri (j ). M
(5.16)
i=i
JSee Cramer (1991, p. 85). These are estimated market shares. Note that j =1 P r(j ) = 1 as should be expected. See the Appendix for this. Aggregate demand for product j is then
.
Dˆj = MP r(j ).
.
The variance of this prediction is a function of the probability to buy and not buy; they are binomials. Consequently, the variance will reach a maximum at 0.25 (.= 0.5×0.5). But it is heavily weighted toward zero if the market is even moderately large. See Cramer (1991, pp. 85–86) for a discussion of the variance. There is an implicit assumption regarding the aggregate estimator in (5.16): you have data on the M individuals in the addressable market. How do you obtain this data? One source is the business’s databases. This will depend, however, on the size of the business, not just in terms of market share (the percentage of its market) but also in terms of the absolute number of customers it has and the nature of its data on these customers. This would be a case of Big Data. Paczkowski (2018, Part IV) has an extensive discussion of the nature of Big Data, how to build econometric models using Big Data, and how to use Big Data for pricing. If the Big Data database is extensive enough, then it could be used to represent the market. Assuming that the individual-level characteristics (e.g., age, gender, income) used in the choice model estimation are available in the Big Data database, you merely have to query it and extract all the individuals with these characteristics. You then have a “market” view based on the same characteristics as those used
5.4 Choice Predictions
163
Table 5.1 This is an example of a cross-tab of the joint distribution of age and income. The cells are the number of individuals with that combination of age and income .$100,000
.
σ2 n
+ σ2
wi2
σ2 ¯ = V (X). n
So the variance of .X¯ is smaller than the variance of any other estimator of the population mean. Notice that the proof depends on n .
ki2 =
i=1
n 1 i=1
n
2 + wi
1 2 wi + + wi2 2 n n 1 2 wi . = + n
=
What happened to .2/n wi ? It cancels because of the unbiasedness condition. This is an analytical demonstration of the statement: “The mean of the means is the population mean.” I show in Fig. 10.14 how you could verify this. The number of iterations (iter) is set to 1000, which means that I took 1000 separate draws
282
10 Examples of Stochastic Simulations: Monte Carlo Simulations
¯ = μ. The descriptive statistics, Fig. 10.14 This illustrates a Monte Carlo simulation of .E(X) histogram, and KDE plot are based on 1000 sample draws, each of size .n = 100
10.3 Monte Carlo Examples
283
Fig. 10.15 These are the Jarque-Bera Test results for the 1000 runs from Fig. 10.14
from a Gaussian normal distribution. I set the population mean at .μ = 10 and the standard deviation at .σ = 1. The normal distribution was based on Numpy’s random package’s normal function that I described in Chap. 9. The population of values is the draws. The population size was set to .N = 10,000. For each iteration in the simulation, a sample was drawn from the population, the sample is selected using the random package’s sample function, which I also described in Chap. 9. The mean of each sample was calculated and stored (i.e., appended) to a list along with the iteration number. The descriptive statistics were calculated and a histogram was drawn with a Kernal Density Estimator (KDE) overlaid to show the distribution. See Paczkowski (2022b) for a discussion of the KDE graph and its uses. You can see from the descriptive statistics that the mean of the 1000 samples of size .n = 100 is 10, the population mean.5 This is exactly what the analytic demonstration showed. Incidentally, notice that the descriptive statistics panel shows the standard deviation to be 0.10. I had set the population standard deviation at .σ = 1. These are consistent √ because .σX¯ = σ/ n. A simple calculation will show you that the 0.10 value is 6 correct. The histogram and the accompanying KDE plot suggests normality for the 1000 sample means. I used the Jarque-Bera Test. See Hill et al. (2008) for a description of this test. I show the test for normality of the 1000 calculated sample means in Fig. 10.15. This test involves comparing the skewness and kurtosis of the data distribution to that of the assumed normal distribution. The test statistic is n 1 .J B = (10.24) S 2 + (K − 3)2 6 4
5 The
mean shown is 9.99, which differs by rounding. careful if you verify the 0.10 value. It is based on the square root of the fixed sample size, .n = 100, not on the number of runs, which is 1000. Each of those 1000 runs is based on the .n = 100. 6 Be
284
10 Examples of Stochastic Simulations: Monte Carlo Simulations
where S is the skewness measure and K is the kurtosis measure. These are defined as: Skewness A measure of symmetry. A distribution is symmetric if it is the same on both sides of the center. .S = 0 indicates symmetry. Kurtosis A measure of the thickness of both tails of a distribution relative to the normal distribution. A distribution with high kurtosis has thick tails. .K = 3 for the normal distribution. See NIST (2012, Section 1.3.5.11: Measures of Skewness and Kurtosis). The Null Hypothesis is that the data are distributed normally: there is no statistical difference between the normal distribution and the empirical distribution. You can see from Fig. 10.15 that the Null Hypothesis of Normality is not rejected.
10.3.4 Use-Case 4: Central Limit Theorem For another example of a Monte Carlo simulation, consider the Central Limit Theorem(CLT), which is a major theorem in statistics. It states that for a random variable from any distribution, the sum of the values for that random variable is normally distributed. Theorem If .X1 , X2 , . . . , Xn , . . . are random samples drawn from a population with overall mean .μ and finite variance .σ 2 , and if .X¯ n is the sample mean n √ of the first samples, then the limiting form of the distribution, .Z = limn→∞ n X¯ n −μ/σ , is a standard normal distribution.7 This is a very powerful theorem because it allows you to use the asymptotic properties of the normal distribution to simplify analytical results. You can use a Monte Carlo simulation to show that the CLT holds. For this example, I use the chisquare distribution with .k = 5 degrees-of-freedom as the underlying distribution for sampling. It can be shown mathematically that if the random variable X is distributed following a chi-square distribution with k degrees-of-freedom, then .E(X) = k and .V (X) = 2 × k. I show this distribution in Fig. 10.16 and then the Monte Carlo simulation in Fig. 10.17. The simulation uses the Numpy random package’s chisquare function (see Table 9.1) to draw the random variates. This requires the degrees-of-freedom and the sample size to draw, which is .n = 1000 in this case. The mean of the .n = 1000 values is calculated using the mean function. These means are then appended to a list. The histogram suggests normality but the
7 Source:
Wikipedia: “Central limit theorem”. https://en.wikipedia.org/wiki/Central_limit_ theorem. Last accessed March 1, 2022.
10.3 Monte Carlo Examples
285
Fig. 10.16 This shows the chi-square distribution with .k = 5 degrees-of-freedom
Jarque-Bera Test was done to confirm this suspicion. The results, which I show in Fig. 10.18, confirm it; the Null Hypothesis is not rejected.
10.3.5 Use-Case 5: Integration This use-case may seem odd, almost out of place in a general discussion of the use of Monte Carlo simulations to solve statistical, distributional problems. Monte Carlo integration, however, can be given a statistical interpretation. Before I discuss this interpretation, first recall from basic calculus that integration is fundamentally the reverse of differentiation. For example, suppose you have the function
286
10 Examples of Stochastic Simulations: Monte Carlo Simulations
Fig. 10.17 This is a Monte Carlo simulation of the Central Limit Theorem. The underlying distribution is the chi-square I show in Fig. 10.16
f (x) = 2x 2 − 1/3x 3 .
.
(10.25)
The first derivative is .4x − x 2 . You can check this derivative using the Python package sympy which does symbolic math. See the Appendix for this example. Suppose, instead, that you have the function g(x) = 4x − x 2
.
(10.26)
10.3 Monte Carlo Examples
287
Fig. 10.18 This is the Jarque-Bera Test results for the .n = 1000 samples from Fig. 10.16
and you want to know a function .f (x) such that its derivative is .g(x). That is, you want to find the antiderivative of .g(x). You could get your answer by using the indefinite integration of .g(x). This is defined as
f (x) =
g(x)dx.
(10.27)
= f (x) + C
(10.28)
.
where C is the constant of integration and is needed because you do not know if the original function you seek had a constant or not. It is this C factor that is why this integration is called indefinite integration. Using the rules of integration, the integral of (10.26) is .2x 2 − 1/3x 3 + C. If .C = 0, then clearly this is .f (x) as desired. See the Appendix for the sympy version. Many calculus textbooks discuss indefinite integration methods. See, for example, Granville et al. (1941) and Thomas (1966) just to mention two classic textbooks. Another way to view integration is to use the integration methods established for the indefinite problem to find the area under a curve that results from .f (x) for all values of x between a lower limit, a, and an upper limit, b; that is, for .x ∈ [a, b]. This is a definite integral because it is “definitely” defined by the bounds of x. The definite integral is written as
I(x) =
b
g(x)dx.
.
(10.29)
a
Since the goal is to determine the area under the curve, the integral result must be evaluated for the two bounds, a and b. This is done by substituting the upper bound, b, for x in the integral solution, then substituting the lower bound, a, for x and subtracting the two. For my example above, suppose .x ∈ [0, 4]. I show a graph of the function in Fig. 10.19. The integral solution is
288
10 Examples of Stochastic Simulations: Monte Carlo Simulations
Fig. 10.19 This is a graph of the example integration function
4
I(x) =
(4x − x 2 )dx.
(10.30)
4 = 2x 2 − 1/3x 3 .
(10.31)
= (32 − 64/3) − 0.
(10.32)
= 32/3.
(10.33)
= 10.667.
(10.34)
.
0
0
See the Appendix for the sympy version. You can interpret the 10.667, the area under the curve for .x ∈ [0, 4], as a theoretical value. I show in Fig. 10.20 how you could approximate or estimate the area using rectangles drawn for various x
10.3 Monte Carlo Examples
289
Fig. 10.20 This illustrates how to approximate the area under a curve using Riemann’s rectangular approximation. Notice how the area is more closely approximated as the number of rectangles increases
290
10 Examples of Stochastic Simulations: Monte Carlo Simulations
values and the corresponding function values. The rectangles are drawn so that the left side of each meets the curve. The area of each rectangle underestimates the area under the curve for all rectangles to the left of the center line (the value of x that maximizes the function) and overestimates to the right. The total area is the sum and is called a Riemann Sum. The rectangles could have been drawn so that the right edge meets the curve in which case the rectangles to the left would overestimate and those on the right underestimate. They could also have been drawn so that the center of the rectangle meets the curve in which case they would all overestimate the area. Regardless of how the rectangles are drawn, as the number of rectangles becomes large (32 in this example), the over- and underestimates balance so that the true area is determined. You can see that for 32 rectangles, the total area of the rectangles is 10.667, which matches the theoretical value. I illustrate a Monte Carlo integration for this problem in Fig. 10.21. You can see that the estimated area almost matches the theoretical area.
Fig. 10.21 This is a setup for a Monte Carlo integration for the function .f (x) = 4x − x 2
10.4 Appendix
291
10.4 Appendix 10.4.1 Using Symbolic Math Python has a very useful, but underutilized package, that does symbolic math: sympy. This can be very useful for simple calculations, algebraic manipulations, calculus solutions (i.e., derivatives and integrals), simultaneous equations solutions, and linear algebra solutions. The sympy documentation states that this is a substitute for Mathematica and Maple. To use sympy, you must first import the package using import sympy as sym, where I used the alias sym. Next, you have to define any symbols (e.g., the traditional “x” and “y” of mathematics). You could also symbolically define your equation on a separate line, although this is not necessary; you could define it directly in whatever function call you use. I provide examples in Figs. 10.22, 10.23, and 10.24. These correspond to the derivative and integral examples in the text. Fig. 10.22 This is an example of symbolic differentiation using sympy. I defined the function to differentiate and then used it in the diff method which finds the derivative. Notice that this result agrees with the text
292
10 Examples of Stochastic Simulations: Monte Carlo Simulations
Fig. 10.23 This is an example of symbolic indefinite integration using sympy. I used the integrate method. Compare the result to the one in the text
Fig. 10.24 This is an example of symbolic definite integration using sympy. I used the integrate method as in Fig. 10.23, but I specified the limits of the integral as 0 to 4 in the parentheses along with the “x” symbol. Compare the result to the one in the text
Part IV
Melding the Two Analytics
This last part is the heart of the book. I will argue why the two approaches should be brought together and illustrate concepts for the operational, tactical, and strategic scale views.
Chapter 11
Melding Predictive and Simulation Analytics
I introduce the melding of Predictive and Simulation Analytics in this chapter. This is best done by examples. The perspective for the examples varies by the scale view of a decision-maker. Since the scale views can be many and varied, I chose to use the three I frequently mentioned in earlier chapters: operational, tactical, and strategic. These are only meant to illustrate my ideas.
11.1 A Framework All analysis of data must be done within the context of a framework, whether they are simple analyses involving means and standard errors or complex analyses involving multiple regressions with time series data. A framework is sometimes referred to as a theory, but it does not have to be a theory per se. It could simply be a series of well-defined and logical steps you follow and adhere to in order to make your analyses successful. It is successful if you extract from your data the Rich Information you need for a complex decision. No analysis for information extraction is ever done within a vacuum. I prefer to refer to this framework as an Analytical Framework. A theoretical framework is a subset. An Analytical Framework helps you organize your thoughts regarding what has to be done as well as what can be done with your data; guides you in identifying hypotheses worth testing; and specifies what results are likely to occur from your analyses so that you can judge their reasonableness. Without a framework, you could never answer the key questions about the results of any analysis you do: • “Do they make sense?” • “Are they Rich Information for decision-making?” As an example of a theoretical framework, consider the economic concept of a demand curve. There is a very well-developed theory of consumer demand involving © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_11
295
296
11 Melding Predictive and Simulation Analytics
Complex System
Scale-View Dependent Block SMEs/KOLs
Policy Factors
Hyperparameters
Modeling Criteria Predictive Module
Simulation Module
Experimental Design
Analytics Module
Analysis Criteria
Analytical Data
Rich Information Reports
Fig. 11.1 This is a schematic diagram of an Analytical Framework for melding Predictive Analytics and Simulation Analytics
utility maximization subject to a budget constraint. This leads to a conclusion about the relationship between the consumption (i.e., purchasing) of a product and its own price, the cross-prices of complementary and substitute products, and income. The relationship between consumption and the own-price, for instance, is shown to be negative, the infamous downward sloping demand curve. This consumer demand theory is the framework for all the empirical demand studies in economics as well as in business. In fact, it is the framework that was really behind a lot of the predictive analyses I covered in prior chapters. For instance, it is explicit in the choice model I described in Chap. 5. It is, thus, also an Analytical Framework. Consider the Analytical Framework I show in Fig. 11.1 for melding Predictive Analytics and Simulation Analytics. This is a high-level depiction of how to bring together collections of analytical methodologies embodied in modules in which a series of analytical actions take place. There are three main modules and six supporting input blocks that provide needed input to the modules. The three modules are Predictive Module The predictive models Simulation Module The simulations Analytics Module
11.1 A Framework
297
The analysis of the simulation and predictive results The six supporting input blocks are: Analytical Data Provides any required input data Experimental Design May be required for the simulation experiments Analysis Criteria Benchmarks, precision requirements, correlations, etc., for the decisions Policy Factors Key levers that decision-makers want to manipulate SMEs/KOLs Subject-matter Experts and Key Opinion Leaders advice and recommendations Modeling Criteria Modeling specifications The Analytical Data block encompasses the data streams, conventional and real time, that I discussed in Chap. 6. You should refer to Fig. 6.4 for the complete dataflow. The experimental design block accepts a policy factors block if policy factors are to be analyzed. The Analytical Criteria are all the externally specified quantities needed to judge the adequacy and appropriateness of both the predictions and simulations working together. In this scheme, estimated or trained models from the Predictive Module are inputs into the Simulation Module so that the simulations execute those models. However, the Simulation Module could feed back into the Predictive Module to update it. I represent this feedback with a dashed line. The Simulation Module also indirectly feeds back to the Predictive Module through any hyperparameters in the predictive models. Not all predictive models have hyperparameters, OLS and logistic regression being two examples. But others, such as Decision Trees, do. Recall that hyperparameters are not estimated using data but are, instead, specified by you before any modeling is done. The hyperparameters are critical for many models that we looked at. If they are misspecified, then the prediction models will be incorrect, leading to incorrect predictions and decisions. This is worse than providing Poor Information. Unfortunately, you have no way of knowing beforehand how to set them; it might just come down to trial and error. Simulations can help with this issue, hence the feedback. The policy factors block could be a source for the hyperparameters. The Analytics Module is where simulations results are distilled for their Rich Information. This Rich Information is then passed to a reporting system that may include dashboards. Finally, there is an input block for SMEs and Key Opinion Leaders (KOLs). The SMEs can provide valuable insight and leadership in the model design through the hyperparameters. The KOLs, as industry thought-leaders, are typically outside the business. They can provide similar input but from a broader, outside perspective.
298
11 Melding Predictive and Simulation Analytics
Both sets of experts are vital for the hyperparameters since these are not estimated parameters but prespecified. The Predictive and Simulation Modules are actually encased in a scale view, meaning that they are only relevant and have meaning within the context of a particular scale view. There is not one and only one predictive model or simulator in their respective modules but many that have meaning only for a specific view. There are three scale views I focus on: 1. Operational 2. Tactical 3. Strategic Before I discuss these in the context of Scale View Dependent Block, let me discuss the overall melding process.
11.2 Melding Process Overview Figure 11.1 is a schematic diagram of three major modules and six supporting blocks for a predictive-simulation perspective for providing Rich Information for decision-making in a complex system. The modules are where detailed procedures and methods are handled, the specific procedures and methods dependent on the scale view of a decision-maker. The blocks are inputs to the modules. The complex system defines the scale view. There is not one system size or complexity measure; it is a generalized concept without a one-size-fits-all definition. A system such as AT&T, IBM, and Microsoft are complex just as a midsized manufacturing company is complex. Complexity is having a number of interacting and interconnected parts. Regardless of how you define a complex system, there are still different views depending on a decision-maker’s level and responsibilities. These are the scale views I discussed earlier. For each scale view, there are two functions I am concerned with in this book. The first is a predictive model or modeling function that is in the Predictive Module. This guides the managers, regardless of their level and scale view. The guidance is for future business procedures and actions. The operative word is “future.” All decisions, regardless of the scale view, are future oriented. They result in returns (e.g., earnings, market share, and shareholder value) in the future; the past is, of course, gone, but the adage “What is past is prologue” holds.1 In our case, the past is reflected in data and the “prologue” is what will happen. My focus in Part II of this book was to provide the highlights of predictive models, their logic, forms, and key summary measures. That part should stand on its own for an independent introduction to predictive modeling. More detailed development and presentations
1 From
William Shakespeare, The Tempest, Act 2, Scene I.
11.2 Melding Process Overview
299
are available beyond what I showed there. You are encouraged to look at the references if you need or want more understanding of predictive analytics. Most importantly, the different predictive models come into play for a business problem depending on the scale view of the manager needing the prediction, needing the information for a decision based on the prediction. An operations manager, one concerned with daily processes and functions, requires a prediction that is different from that required by a strategic manager. But this does not mean that the process of predicting is different. The focus and requirements are different; the predictive process is the same regardless of the needs of the end user of the predictions. Data must be collected and analyzed before a model is estimated or trained; a model must be specified based on economic reasoning, experience, or industry standards; statistical and optimization methods must be applied correctly; and standard errors must be calculated and assessed. Just the details, scope, and end results vary by scale view. The second function is a simulation in the Simulation Module that determines and assesses development paths of processes. These lead to the predictions and their predictive intervals that also differ based on the scale view. Simulations are necessary because the prediction standard errors are usually more complex than what the models in prior chapters indicate. More importantly, a predictive model is often just one input into a larger modeling/analysis framework. This is an analysis system that is fundamentally no different than the complex systems that are a focal point of this book. Of course, I run the risk of overusing and abusing the word “system,” but the essence of the concept, that there are many interactive and interconnected parts that contribute to a whole and that produce more than the sum of the parts (i.e., emergence), applies here as well. In short, the predictive analytics process is the same for a complex system such as a business, government agency, the economy, human interactions, and so forth. There is a feedback loop between the Predictive Module and the Simulation Module. What I described above is unidirectional from the Predictive Module to the Simulation Module. But the flow could certainly be from the latter to the former as a way to update the predictions. Part of the updating is via the hyperparameters. The simulations can feed both the predictive models through updated model components (e.g., random noise in a random walk model) and through the hyperparameters. The experimental design, an input into the Simulation Module, guides the update order so that the simulationist can judge which update causes a change. The design enables an attribution of differences. An attribution of differences is a procedure for dissecting a change in a KPM into constituent parts to determine which part contributes the most to the KPM. For example, suppose a predictive model for sales (the KPM) for a 5-year planning period shows they will increase if price is steadily reduced during those 5 years, competitors will also reduce their prices but not as aggressively, household income will increase due to stronger economic growth, and technology improvements in logistics systems will occur. Which of these contributes the most to the improved sales forecast? This would be very Rich Information that is insightful (i.e., which
300
11 Melding Predictive and Simulation Analytics
factors are the main contributors), actionable (i.e., where can action directives be issued and to what extent), and useful (i.e., how much to change a factor, if possible). There is one last module in Fig. 11.1, the Analytics Module. This is where the results of the other two modules are drawn together and analyzed to extract Rich Information for decision-makers. This module is as much a part of the analysis system as the other two. The tools and methods for analysis, however, are not dependent upon the scale view. They are generic and can be used for any scale view, which is why the Analytics Module is outside the scale views block. However, the depth of, and use of the tools, depends on the scale view. These tools include but are certainly not limited to: • Data visualization • Regression analysis • Classification analytics These are the main tools of statistical and econometric analysis in the context of Business Analytics. See Paczkowski (2022b) for a detailed treatment of Business Analytics as well as Paczkowski (2020a) for new product development applications. It is also possible that the results could be posted to a dashboard for management review and possible data visualization interaction and querying. This is Business Intelligence. Data visualization is a mainstay of any data analytics. We are visual creatures, even though we are handicapped by being unable to see in more than three dimensions. Nonetheless, the creative use of visual displays at higher dimensions can help us see patterns in the data that are not readily evident; we can get a lot of Rich Information from data visualization. However, being able to see the Rich Information is only partly dependent on the visual displays. There is also a skill set needed to fully visually extract Rich Information. There are a set of Gestalt Principles for examining and studying visual displays of data. See Paczkowski (2022b, Chapter 4) for discussions of these principles. I focus on four: • • • •
Proximity Principle Similarity Principle Connectedness Principle Common Fate Principle
All four Gestalt Principles help you to not “look” at a data display but to “see” patterns, trends, and associations that otherwise would not be visible in graph objects such as points in a scatter plot, lines in a line plot (e.g., time series plot), and bars in a bar chart or histogram. Seeing is a cognitive function and looking is a physical function. The Proximity Principle says to see clusters of data objects that are close to each other while the Similarity Principle says to identify objects that appear to be the same, perhaps in size, orientation, and location in the display area. Connectedness, fundamentally, refers to correlation, while Common Fate refers to trend (e.g., the objects decay at the same rate).
11.2 Melding Process Overview
301
Regression analysis is often misunderstood because of the way it is introduced in a basic statistics course. It consists of a series of approaches that constitute a family of methods. The family has one objective: to explain variations in a dependent or target variable using a set of independent or factor variables. OLS is the first, and in many instances the last, member of the family that is learned, which is what causes the misunderstanding: everything is looked at through the lens of OLS. Other family members are more advanced and require more training. Nonetheless, the regression family should be in your data analytics toolkit. I outlined some of these members in Part II. Also see Paczkowski (2022b) for advanced discussions of many of them. Classification Analytics enables you to classify new objects into one of several groups. Some methods rely on at least one member of the regression family: logistic regression. Others are distinct enough, but yet share a commonality, which is why I prefer to lump them into one category. Support Vector Machines and clustering methods are main examples. The Analytics Module is where Rich Information is extracted. But the type of Rich Information, its potential use, the requirements of the users, and the scope of the information must be identified and known before any analytics of any type can be done. In most textbooks on, say, data mining, there are upfront comments and recommendations on how you should plan an analysis of your data. These include but are certainly not limited to: 1. 2. 3. 4. 5. 6.
Define the project purpose. Compile the needed data. Preprocess the data including visualization. Identify the analytic approach and software. Perform the modeling and simulations. Analyze and report the results.
See Shmueli et al. (2020, Chapter 2), for example, for a similar list for data mining projects. Consequently, there is an Analysis Criteria Block that is an input into the Analytics Module. This block specifies what the decision-makers need for their decisions and thus sets the parameters for the type of analysis that must be done. For example, a pricing manager may need to know the demand elasticity and the effect on sales, revenue, and contribution margin for the next 5 years as a result of a price change. Competitive moves have to be assessed as well. These are the criteria: elasticity, sales, revenue, margin, and competitive response. These define the type of analysis to be done and the analytical tools that should be used. I mentioned in Chap. 8 that simulation experimental design is a new area so that a lot of developmental work is needed. Nonetheless, if there are many factors that have to be tested through a simulator, then an experimental design is advisable to ensure that all the factors are tested and sufficient data are collected on each, plus their interactions, to ensure that appropriate analysis can be done in the Analytics Module. The policy factors block is the collection of policy levers (e.g., price, advertising budget, number of servers or manufacturing stations, and number of manufacturing robots) that are relevant and that can be experimentally changed. These should
302
11 Melding Predictive and Simulation Analytics
correspond to the Analytics Criteria Block requirements that lead to the Rich Information so there is a connection between the two blocks. The connection may not be strong or always relevant, however, hence the dashed line in Fig. 11.1. Finally, there is the Analytical Data Block that provides the data for the Predictive Module. This is the result of the schematic in Fig. 6.4. Whatever type of predictive model is developed, it will require data. The two types of data that I described in this book are time series and cross-sectional data, although I more extensively emphasized the former. This is due to the very important nature of the decisions that motivate most of the analytics done in a business context: the decisions are time oriented. Whether a price is changed to combat competitive actions, a new technology is introduced that more efficiently services customers, or a contract with a new, more reliable supply-chain vendor is negotiated, the actions will have consequences into the future; these are time-based decisions. Cross-sectional data, and the corresponding cross-sectional models, are definitely used, but to a lesser extent. I discuss their use in Chap. 12.
11.3 Three Scale Views I argued in Chap. 2 that there is an inverse relationship between the complexity of a business as a system and the scale of viewing it. For instance, the smaller the scale view, the more detail is revealed, and the more complex the interactions and interconnections in the system that have to be managed. I illustrated this in Fig. 2.9. There is a continuum of scale views since the system is typically viewed at once from multiple angles and perspectives. These are due to the multiple layers of management in any business. Some businesses have a very deep multilayer management system while others have a shallow one. A company with multiple layers of management has a vertical or hierarchical management structure while one with few layers (minimum layer is, of course, one) has a flat structure. It is more informative to present only three scale views than an actual continuum of views because the continuum would be too unwieldy to interpret and use. As I mentioned so often, the three scale views are operational, tactical, and strategic. I summarize the three scale views and provide some example functions for each in Fig. 11.2. I provide some commentary on them in the following sections.
11.3.1 Operational Scale View The operational scale view is narrowly focused on day-to-day business management. Details count, but they can be overwhelming at times. This is where the complexity of a system becomes evident. There could be many different interlocking departments and issues (i.e., crises as for Southwest Airlines in 2022) that have to be handled daily. For my purposes, flows of some sort are constant.
11.3 Three Scale Views
303
Scale-Views
Operational Daily Processing Personnel Inventory Down Time
Tactical Marketing Ps Competitive Response Advertising Customer Relations
Strategic Capacity Planning New Products Mergers/Acquisitions Organizational Structure
Fig. 11.2 These are just a few examples of main focus points for the three scale views I use in this book
For example, a problem I have often referred to in this text is an ordering system. Customers arrive at random times intending to place an order, then they place their order, and either join a waiting list (called a queue) to receive their product or receive it immediately. “Immediately” should not be literally interpreted. After all, the product has to be produced, stocked, and delivered. For a drive-through fast-food restaurant, “immediate” could be within 5 min after ordering. For an order placed on Amazon, “immediate” could be by 10 p.m. that evening (if the order is placed by noon that day or some other time restriction). The same could hold for home delivery of, say, groceries. For a home-shopping channel, “immediate” could be within 5–10 business days. For a queue, the customer is notified of a backlog in the supply chain and they will be further notified when the product will be available for shipping, the actual delivery date could be that 5–10 business days which adds to the time. Customers could also randomly arrive to place an order, join a queue at that point before they can place their order, then either receive their order immediately, or join yet another queue to await fulfillment. Referring to a drive-through fast-food restaurant again, a queue could form leading to the order window or kiosk, and then another one could form for the payment window, and then a final one could form for the pickup window, so three queues could be involved. I illustrate a queueing system of orders in Fig. 11.3 in which an order is placed immediately but then they have to wait, perhaps, to receive their product. An online ordering system (e.g., Amazon) or a home-shopping channel where orders are placed by telephone are examples. Notice that this view incorporates predictive modeling of arrival and service rates based on historical data. For an online ordering system, the arrivals are to a website; for telephone ordering, it is calls into a call center (which could be automated or not). In either case, a predictive model is needed for resource planning as well as inventory and logistics management. These models could predict in real time (which means that they will constantly update the operational managers regarding flow) or they could just be daily at the opening of business for that day. Any of the models I discussed in Part II could be used with
304
11 Melding Predictive and Simulation Analytics
Fig. 11.3 This is an example of an operational scale view problem. A simple ordering system is shown. The predictive models are for the random customer arrival rates and the service rates
varying degrees of sophistication. The model I assume for simplicity for Fig. 11.3 is the constant mean model of Chap. 3. I will discuss this in more detail in Chap. 12. Simulations are used in this view to gauge variations in arrival and service rates. Figure 11.3 is a typical rendering of a queueing model, although I added components for the source for mean arrival and service rates. I discuss queueing models more extensively in Chap. 12. Most such example models just note that there are means, but do not mention, except possibly in a footnote, where those means come from. This is due to their focus: the operational model. A good example is provided by Ingalls (2013) who describes a drive-through fast-food restaurant queueing system. A customer randomly arrives to buy food, checks the number of cars waiting in line (i.e., the queue) to place an order at the order window or
11.3 Three Scale Views
305
Data Warehouse
Arrival Data: Time Series
Data Science Team
Estimate Constant Mean Arrival Rate
Customer Arrives Inventory Management Service Data: Time Series
Estimate Constant Mean Service Rate
Service Status
Manufacturing/Procurement
IT/Data Managers
Logistics/Distribution Immediate Order Processing
Join Order Queue
Order Fulfilled
KPM Marketing
Dashboard
Customer Satisfaction
Sales
Pricing
Fig. 11.4 This depicts an expansion of the operational scale view of Fig. 11.3
kiosk, and either leaves the line disgruntled because the line is too long or waits and eventually places an order. My Fig. 11.3 is this process. It is really very standard. As another example for a call center, say for a home-shopping channel, see White and Ingalls (2018); the process is the same. Although what I depict in Fig. 11.3 is standard, it is at the same time slightly naive, misleading, and incomplete. It is an isolated process, unconnected from the other parts of the business. At least one other part I could mention is the supply chain of raw ingredients or materials. Ingalls (2013) does use scenarios to examine the effects of different situations on this one process for a drivethrough fast-food restaurant. For example, a new technology scenario is examined to gauge the technology’s effect on reducing waiting time to place an order. But the interconnections and interactions with other parts of the business are not illustrated. I expand Fig. 11.3 in Fig. 11.4 to include interdepartmental interconnections and interactions. Interconnections are connections of one business unit to another. Think of this as a one-way flow from one unit to another. In Chap. 2, I described a flow of information (e.g., a quarterly forecast of sales) from a forecasting department to a sales department. This is a one-way interconnection because there is no, or perhaps little, reason for any information to flow from sales back to forecasting. Any sales
306
11 Melding Predictive and Simulation Analytics
data the forecasting organization would need would come from the data warehouse maintained by IT. An interaction is a two-way flow: the units affect one another and transmit information to each other. As an example, there could be a two-way interaction between the inventory management system and the manufacturing system. Suppose a simple .(s, S) inventory model is used in which the optimal level of inventories is S and the minimal amount to have before placing a restocking order is s. If inventory is below s, then manufacturing is requested to increase production to restock the inventory. Of course, vendors in the supply chain could also be contacted to supply more from outside sources as a precaution. If, however, manufacturing has production problems, it could send an alert to the inventory system to begin taking precautions to manage stocks, perhaps by buying product from an outside source such as a supply house. There could, of course, be a three-way interaction involving the ordering system. For the .(s, S) inventory system, see Zheng and Federgruen (1991). There are definitely more sophisticated inventory management models and systems. Two are the Economic Order Quantity model (EOQ), which “minimizes the total holding costs and ordering costs in inventory management”,2 and Just In Time inventory management (JIT), which manages inventory in real time.3 Regardless of the inventory management system used, a predictive model is needed. I do not show this in Fig. 11.4 for simplicity, but I do show an interaction between the service status and the inventory system. I show the interconnections between a customer satisfaction survey measuring process and a marketing, sales, and dashboard process in Fig. 11.4. The customer satisfaction survey is actually administered and handled by the market research or business research department, which could be a division of the marketing organization. There is an interaction between a service order center, inventory management, manufacturing, and logistics/distribution. The IT department cannot be ignored in any system. This is key to data management. I depict flows of data in Fig. 11.4 from the data warehouse, managed by IT, to the data scientists department for the development of predictive models for average customer arrival rates and average service rates. The IT function also captures the ordering KPMs for input into the data warehouse and dashboards for executive management monitoring. What I just described is a highly simplified and abstract version of a multitude of interconnections and interactions in a real-world business complex system. The larger the enterprise, the more convoluted the interconnections and interactions. Understanding and documenting all the flows are what make designing a simulation study so difficult and challenging. And this is just for operations – basically, getting
2 See
the Wikipedia article “Economic order quantity” at https://en.wikipedia.org/wiki/Economic_ order_quantity. Last accessed January 20, 2023. 3 See “What Is Just In Time Inventory (JIT)?” at https://www.forbes.com/advisor/business/just-intime-inventory/ for a simple discussion. Last accessed January 20, 2023.
11.3 Three Scale Views
307
the product out the door at the end of the day to satisfy a customer order. And there are many operational functions besides the ordering system. Logistics, inventory, and manufacturing are just three others that have complex flows of their own that, incidentally, intermingle with the ordering system that I just discussed.
11.3.2 Tactical Scale View The tactical scale view is less concerned with operational issues and more with how to compete for customers, thwart or counter competitive actions, and position a product or product line in the market. It is more longer-term focused. This is where, for example, the 4-Ps of Marketing become important: • • • •
Price Product Promotion Position
The details are different from those of the operational scale view. The latter, recall, has more detail because of the vase amount of interactions necessary to get a product out the door to a customer. I illustrate this tactical scale view in Fig. 11.5 using the same queueing framework of Fig. 11.3 but with price policy incorporated through a balking function. This function estimates the likelihood that a customer will leave without placing an order because of dissatisfaction with the product price and/or the wait time for product fulfillment. The balking function is a predictive model based on a logistic regression from Chap. 5. Simulations are used to test price points and waiting times. I show a more extensive version of a tactical scale view in Fig. 11.6 that incorporates more complexity through a product predictive choice model with product attributes. Simulations include price points but also different settings for the product attributes.
11.3.3 Strategic Scale View The strategic scale view is focused on the enterprise. The enterprise is the entire complex system. This scale view is the proverbial “big picture” in the sense that detailed operations and even tactical considerations are unimportant, but the longrun structure, future direction, and perhaps even survival of the enterprise are most important. The business planning cycle of Chap. 6 is a component of this view. This is not to say that detailed operations and tactics are or should be ignored. These become important when crises arise. The Southwest Airlines 2022 debacle is a good example. Executive management was concerned with the overall well-being of the company, until the winter storm and “an outdated computer system for crew scheduling” and a heavy “reliance on shorter, point-to-point flights, rather than a
308
11 Melding Predictive and Simulation Analytics
Arrival Data: Time Series
Estimate Constant Mean Arrival Rate
Waiting Time
Customer Arrives
Price Point
Product Ratings
Joining Function
Net Arrivals
Service Data: Time Series
Estimate Constant Mean Service Rate
Service Status
Immediate Order Processing
Join Order Queue
Order Fulfilled
KPM Marketing
Dashboard
Customer Satisfaction
Sales
Pricing
Fig. 11.5 This is an example of a tactical scale view marketing problem. This builds on the ordering system in Fig. 11.3. The predictive models are for the random customer arrival rates and the service rates but with the addition of a balking function
’hub and spoke’ model” used by its competitors4 put such a strain on reservations and flights that nearly the entire schedule of flights during the 2022 Holiday Season had to be canceled, stranding millions of paying passengers. The C-Level Team had to turn its attention to the everyday operations of flight schedules, customer anger management, personnel frustrations and overwork (not to overlook abuse by angry customers), and so forth to get the company through the crisis. This may be an extreme case, but the point is still that executive management does not always focus solely on the big picture.5
4 See
https://www.npr.org/2022/12/27/1145616523/southwest-airlines-flight-cancellations-2022. Last accessed December 30, 2022. 5 See, for example, https://www.cnn.com/travel/article/southwest-flight-cancellations-wednesday/ index.html for a description of operational issues that had to be handled. Last accessed December 30, 2022.
11.3 Three Scale Views
309
Arrival Data: Time Series
Estimate Constant Mean Arrival Rate
Waiting Time
Price Point
Product Ratings
Joining Function
Customer Arrives
Net Arrivals
Assess Customer Profile
Predicted Take Probability
Pr(Take) < Threshold
Promotional Offer Made Before Customer Leaves
Customer Leaves
Product Attributes
Purchase Intent Model
Pr(Take) > Threshold
Service Data: Time Series
Estimate Constant Mean Servicel Rate
Service Status
Immediate Fulfillment
Join Order Queue
Ordering Performance
Financial Contribution Calculations
Customer Satisfaction
Fig. 11.6 This is an example of a more complex tactical scale view marketing problem. This builds on the ordering system in Fig. 11.3 and the marketing problem in Fig. 11.5. The predictive models are for the random customer arrival rates, the service rates, balking, and product take
The C-Level Team usually focuses on strategies to create and maintain shareholder value, but they also work to ensure that operations and tactics align with the goals and strategies they establish. Sometimes, however, the oversight of operations falls short of what is needed as evident by the Southwest Airlines issue. I illustrate a strategic scale view problem in Fig. 11.7. The predictive models of Part II are more aggregate models than disaggregate ones. Aggregate models are based on data representing large chunks of the enterprise. For example, total head count, total sales by categories, etc. The simulations might be more system dynamics or non-stochastic simulations to gauge the impact on high-level, aggregate KPMs. Stochastic variations are of lower importance but Most Likely views (the MLs) are important to bound a true ML view.
310
11 Melding Predictive and Simulation Analytics
Fig. 11.7 This is an example of a strategic scale view problem
Chapter 12
Applications: Operational Scale View
This is the first of three chapters that contain examples of the melding of predictive and simulation analytics. The perspective for these examples varies by the scale view of a decision-maker. Since the scale views can be many and varied, I chose to use the three I mentioned many times in earlier chapters: operational, tactical, and strategic. This chapter focuses on the operational. The examples are only meant to illustrate my ideas. For each one, I will begin with a discussion of a simulation and then introduce predictions into the simulation framework.
12.1 Application I: A Queueing Problem A common operational application of a simulation is a queueing problem. Just a simple perusal of the tutorials for the Winter Simulation Conference supports this.1 A queue is a waiting line.2 This could be customers waiting to be served (e.g., bank, fast-food restaurant), jobs in a jobs shop, or stages in order fulfillment, to mention a few examples. Customers who see a long line (i.e., the queue) may balk and go elsewhere. The existence and length of a queue have definite implications for a business. For example, a price elasticity study suggests a price decrease to stimulate orders. But fulfillment becomes an issue as customers have to wait for their order (they are in a queue), resulting in dissatisfied customers. More capacity and personnel will be needed to handle the increased orders. These are the I&R of a business decision.
1 An
archive of tutorials and presentations for the conference is available at https://informs-sim. org/. 2 Trivia: “queue” is one of the few words with four consecutive vowels. “Queueing” (correctly spelled) is one of the few words with five consecutive vowels. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_12
311
312
12 Applications: Operational Scale View
How a queue is handled, and what issues must be addressed, depends on a manager’s scale view. At the operational level, assigning resources to meet queue members’ needs and avoid or minimize customer complaints is paramount. Customers already in a long line (i.e., the queue) may become frustrated, renege, and leave. This is a day-to-day issue involving, perhaps, a single organization or area. At a tactical level, the interaction between a queue in, say, the ordering system and a downstream operation, say production, becomes paramount. I will discuss this in Chap. 13. These may not be day-to-day but instead span longer periods and involve interorganizational coordination issues. This is now a system problem. I will comment on this in Chap. 13.
12.1.1 Anatomy of a Queue A queue has a structure consisting of: Arrivals Customers who expect to be served, buy something on the spot (in a spot or cash market), or place an order for later delivery; or jobs to be completed in a job shop. Whatever the situation, I will refer to them as agents or objects. An arriving agent is a random discrete event following a probability distribution usually assumed to be a Poisson distribution with mean arrival rate .λ. The time between arrivals is exponentially distributed with mean time .1/λ.3 Servers Agents who wait on or serve an arrived customer. These could be physical servers such as waiters in a restaurant or tellers in a bank; machines such as kiosks for placing a lunch order in a fast-food establishment or an ATM at a bank or other nonbank location; or a shopping cart in an online ordering system. From the business point of view, these are resources they provide. At any moment, they are either busy serving a customer or are free to serve the next customer. These two possibilities are states for the server. Any service location may have one or more servers. Most beginning development and application of queueing theory assumes a single server. There is only one server at a fast-food location with a drive-up order window. An ATM in a grocery store is one server. There could also be several servers. A bank would have several tellers (although only one could be open so there is still just one server); a supermarket could have several checkout stations; an online ordering platform would have (what appears to be) an infinite number of servers (i.e., the shopping cart).
3 The exponential distribution is negative so it is sometimes referred to as the negative exponential distribution. As noted by Cooper (1981, p. 42), the adjective “negative” is often omitted.
12.1 Application I: A Queueing Problem
313
Queue The waiting line itself that a customer may or may not join after arriving. This queue has two states: it is either empty or occupied. If a customer arrives for service and finds that the server is busy, then he/she enters the queue; the queue, as a line, lengthens. However, if the server is not busy, then the customer is immediately served and the queue length is zero. So the queue length is zero if the server is not busy; greater than zero, otherwise. A queue has two characteristics: its average length and the average time someone has to wait in the queue. These are related. Queue Discipline A rule to move a customer from the queue to service. For a single-server system, the most logical, and probably equitable, rule is first-in-first-out (FIFO) in which customers are served by the order of their arrival. Other rules are a random selection, priority, and last-in-first-out (LIFO), which is rare (and which could lead to confrontations). Srivastava (2003) notes that a priority rule may cover two classes: a priority class (with gradations or levels of priority) and a nonpriority class. There could also be a preemptive priority discipline in which “a lower-priority unit is taken out of service whenever a higher-priority unit arrives, the service on the preempted unit resuming only when there are no higher-priority units in the system. Contrary to this is the non-preemptive, or the head-of-the-line, priority rule, in which priorities are taken into account only at the commencement of service, and once started, the service is continued until completion.” See Srivastava (2003). I will assume a FIFO discipline below. Departures The agents are served. This is a random variable. As a customer departs after being served, the next customer is either in the queue, in which case the queue decreases, or a newly arrived customer is served. How long a customer takes to be served is the service time that is a random variable assumed to follow a probability distribution, usually an exponential distribution with mean service time, .μ. The service time is a random variable because each customer has his or her issues that require varying amounts of time for the server to handle. Service is not mechanical, but customer-centric. I illustrate a general queueing model process in Fig. 12.1. The major assumptions are that the arrival rate of customers and the service time are random variables. This immediately implies that the queue length and the time spent in the queue, the waiting time, are also random variables since they are functions of the arrivals and service. The arrivals are assumed to be randomly drawn from a Poisson distribution and are said to follow a Poisson process. A stochastic process is Poisson if the events occur individually and independently at random moments, but collectively occur at an average rate per unit time. The average rate is the number of events per unit time, or .λ = 1/time where .λ is the average rate. For example, assume that an online store has 86,400 orders per month on average. A month has 43,200 min (.= 60 × 24 × 30) that implies that there are, on average, 2 orders per minute (.= 86,400/43,200), so .λ = 2 per minute. This is equivalent
314
12 Applications: Operational Scale View
Customer Arrives
Check Server
Server Empty
Server Busy
Enter Service
Enter Queue
Check Server
Server Empty
Server Busy
No Next for Service?
Advance in Queue
Yes Enter Service
Depart System
Fig. 12.1 This is a schematic diagram of a basic queueing process. I assume a FIFO queue discipline
12.1 Application I: A Queueing Problem
315
Fig. 12.2 This is the Python script to generate the graphs of the Poisson and exponential distributions in Fig. 12.3. The parameter .λ is .λ = 2 and .λ = 9
to 0.5 min (or 30 s) between two orders (.= 1/2): 30 s between each order. So the interarrival time, the time between orders, is .1/λ = 0.5 min. The assumption for servicing each customer once they enter service is that the rate is also Poisson distributed. This means that the time between customers being serviced is exponentially distributed. This is the service time. If .μ is the mean rate that customers enter service, then .1/μ is the mean service time. For example, if two customers are served every minute on average (2 per minute), then the average service time per customer is 30 s (.= 1/(2 per minute). The Poisson distribution and exponential distribution are related. These are equivalent statements: basically, two sides of the same coin. I plot both functions in Fig. 12.3 with the scripts for both in Fig. 12.2. See this chapter’s Appendix for how they are related. The Poisson probability mass function (pmf ) is:
316
12 Applications: Operational Scale View Possion Distribution –λ t pmf = e × λ
Exponential Distribution pmf = λ × e–λ × t
t!
λ=2
λ=2 2.00 1.75
0.20
Probability Density
Probability Mass
0.25
0.15 0.10 0.05
1.50 1.25 1.00 0.75 0.50 0.25 0.00
0.00 0.0
2.5
5.0
7.5 10.0 12.5 15.0 17.5 Time (t)
0.0
2.5
5.0
λ=9
λ=9
0.12
8
0.10
Probability Density
Probability Mass
7.5 10.0 12.5 15.0 17.5 Time (t)
0.08 0.06 0.04
6
4
2 0.02 0.00
0 0.0
2.5
5.0
7.5 10.0 12.5 15.0 17.5 Time (t)
0.0
2.5
5.0
7.5 10.0 12.5 15.0 17.5 Time (t)
Fig. 12.3 These are graphs of the Poisson and exponential distributions from the Python script in Fig. 12.2. Notice how the distributions change
p(x) =
.
(λ)x e−λ . x!
(12.1)
and the exponential probability density function (pdf ) is: f (x) = λe−λx .
.
(12.2)
These are for arrivals. The service rate is also assumed to be exponential with a mean service rate .μ so that the service time is .1/μ. Exponentially distributed random interarrival times can be generated in Python using the Numpy package and the exponential function in that package with the scaling factor .1/λ. I will show an example below. The exponential distribution is based on two parameters as are most distributions. The parameters are the location and the scale. The location is the minimum value for the distribution that is necessary to specify since the distribution has to decline exponentially from somewhere; the location is that point. This location could be
12.1 Application I: A Queueing Problem
317
positive or negative, but for times, a positive number is the most logical. A positive location shifts the distribution to the right; negative to the left. The default is zero in SciPy. For a service time interpretation, this says zero minutes are spent servicing a customer. You may want to specify a positive value if you believe (or require as business policy) that a minimum amount of time is required for an activity (e.g., serving a customer); otherwise, use the default. The scale gives the speed of decline of the distribution; how quickly it asymptotically approaches the X-axis. Notice in the lower right panel of Fig. 12.3 for .λ = 9 that the curve almost immediately goes to zero. The distribution declines and the shape changes. It shifts to the left and stretches out to the right, having a more elongated right tail; that is, it becomes more right-skewed so that large values have more chance of being observed. The inverse of the scale parameter is the mean (i.e., expected value) of the distribution and the inverse of the squared scale is the variance. If .λ is the scale parameter and X is the exponentially distributed random variable, then .E(X) = 1/λ and .V (X) = 1/λ2 . See the Appendix for this chapter for derivations. In SciPy, the scale is 1 by default. The actual value used is .β = 1/λ > 0. This is sometimes confusing, but SciPy expects .β.
12.1.2 A Note on the Exponential Distribution in Python There is only one exponential distribution, but there are two implementations in Python: Numpy’s exponential and random’s expovariate. The underlying formula is the same in each package: it is (12.2). However, random.expovariate “uses the ‘rate’ or ‘inverse scale’ as the parameter, . . . while” numpy.random.exponential “uses the scale as the parameter. The parameters of the two versions are just inverses of each other.”4 The advantage of using numpy.random.exponential is the specification of the number of elements to return: it has a size argument.
12.1.3 Queue Performance Measures I noted before that businesses have KPMs for all levels of the business. What are the interesting KPMs of a queueing system? Some possibilities are: • Average time in the system (W ) • Average time in the queue (i.e., average waiting time: .Wq ) • Average number in the system (L)
4 Source: https://stackoverflow.com/questions/48066370/diff-between-numpy-random-exponential-
and-random-expovariate#:˜:text=expovariate%20%3F%22%20The%20difference%20is%20that,the %20scale%20as%20the%20parameter. Last accessed November 26, 2022.
318
12 Applications: Operational Scale View
• Average number in the queue (i.e., average queue length: .Lq ) • Proportion of time a server is empty
12.1.4 Restriction on Arrival and Service Rates There is a restriction on the arrival and service rates. This is sometimes called the utilization factor represented by .ρ. If .λ is the average arrival rate and .μ is the average service rate, then ρ=
.
λ < 1. μ
(12.3)
This condition must hold; otherwise, the system could grow infinitely large since more customers would enter than leave. The analogy is a sink: if water flows in from the faucet at a faster rate than the water leaves through the drain, then the sink will overflow. The following KPMs can be derived for a simple queueing system given the utilization factor: • • • • •
Mean queue length: .LQ = λWQ = ρ 2/(1−ρ) Mean number in system: .L = λW (Known as Little’s Law) Mean waiting time in system: .W = WQ + 1/μ Mean waiting time in queue: .WQ = LQ/λ Proportion of time server is empty: .1 − ρ.
See Gross and Harris (1974), Kleinrock (1975), Cooper (1981), and Ross (2014). Also see Little (1961) for many of the original queueing model result derivations.
12.1.5 Queueing Theory Notation Queueing theory has a very definite, well-established, and widely accepted notation to indicate the components of a queue due to D. G. Kendall.5 A generic notation structure is .Arrival/Service/Channels. For a single-server queue, this is .M/M/1. I describe only the basic notation here. M .→ M .→ c .→
Markovian or memoryless Poisson process (or random); the arrival process (i.e., exponential interarrival times) Markovian or memoryless exponential service time The number of service channels or servers
My example below is for a .M/M/1 queueing system.
5 See
https://en.wikipedia.org/wiki/Kendall%27s_notation for a comprehensive notation list.
12.1 Application I: A Queueing Problem
319
12.1.6 Illustrative Queueing Process Figure 12.4 illustrates a general .M/M/1 queueing system for five objects that could be customers, orders in a fulfillment system, records to be processed, and so on. Each object passes through four states: 1. 2. 3. 4.
Arrival into the system Wait in a queue Service Departure from the system
A clock keeps track of when each object enters or passes through each state. It is a bookkeeping system recording when an object enters and leaves a state. For example, the clock will record when an object enters the system, enters a queue, enters service, and departs the system. It is necessary to record when the event occurs so times, such as time waiting in a queue, in service, and in the system (i.e., time in the queue plus time in service), can be calculated. I show an example of times in Table 12.1. The clock only records arrivals into the system when the event of an arrival occurs. It could, however, record each minute (or other appropriate time, say second) regardless if an event occurred or not. In short, the clock could be event-driven or not. If it is not event-driven but records each time point whether or not an event occurs, then the data set could potentially become extremely large with many time
Fig. 12.4 This illustrates an .M/M/1 queueing process. The down arrows next to the numbered blocks indicate an arrival. The other arrows indicate the direction of movement: queue to service to departure
320
12 Applications: Operational Scale View
Table 12.1 These are illustrative event times for the queueing system in Fig. 12.4. The times, in minutes, are approximate. The arrival, “when,” and departure times are clock times; all others are duration times in minutes Object 0 1 2 3 4 5
Arrival time 0 1 8 15 17 23
When enters queue 0 1 8 15 17 23
When enters service 0 5 10 15 23 25
Wait time in queue 0 4 2 0 6 2
Service time 5 5 5 8 2 5
Departure time 5 10 15 23 25 30
In system time 5 9 7 8 8 7
points without any arrival data; that is, it will have a lot of missing data. The size of the data set will certainly affect data processing and calculations just because of the sheer volume of data. An event-driven clock, of course, is more efficient to manage and the resulting data set will be smaller. Most applications use an eventdriven clock. See Gross and Harris (1974, p. 392) who refer to these as time-oriented bookkeeping and event-oriented bookkeeping. I will use the latter.
12.1.7 Determining the Interarrival and Interservice Rates In most textbook examples of queueing theory, the values for .λ and .μ are just stated as examples without further discussion. In addition, those stated values are fixed for the example. In practice, however, they must be either calculated or assumed. Calculated values are preferred, and those calculations must be based on sample data. In addition, their constancy is a problem since values calculated from sample data have a range of indeterminacy; that is, they have confidence or prediction intervals that should be accounted for or included in any queueing analysis. How are they calculated and how are the prediction intervals included? I will focus on .λ; the calculations and prediction intervals are similar for .μ. The most obvious way to calculate .λ is to first collect data on arrivals at each period for the system being modeled, say each minute, then aggregate the data to a higher level if necessary, and finally calculate the average of the aggregated data. As an example, the Python script in Fig. 12.5 generates a data set of observations by minute for 5 days. I chose to collect (i.e., artificially generate) data by minute just for illustration. For each minute, I randomly set a flag if a customer arrives or not: 1 if a customer arrives, 0 otherwise. The average number of arrivals per minute is 9 with a standard deviation of 2.79. The average time between arrivals is .1/9 = 0.111 min till the next customer.
12.1 Application I: A Queueing Problem
321
Fig. 12.5 This is a Python script to generate artificial data for arrivals. The average number of arrivals is 9 per minute. I use this number in other examples
The fact that the arrival rate, .λ, is estimated by a sample average may be surprising. It turns out that for a Poisson process, the maximum likelihood estimator ¯ See the Appendix for a proof. of .λ is .X.
12.1.8 Queueing Example I illustrate a Python script in Fig. 12.6 for the calculations of a queueing system. I used the average number of arrivals per minute from the data I generated using the script in Fig. 12.5. The output in Fig. 12.7 is divided into six parts: Customer ID Number This is comparable to the ID number in Fig. 12.4. It indicates the customer’s action: arrives, joins the queue, gets served, and departs. Clock Time When an action begins. The clock starts at 0.0 for customer 0. For customer 1, the clock starts from the point customer 1 arrives. Arrival Time and Interval
322
12 Applications: Operational Scale View
Fig. 12.6 This is a Python script to illustrate queueing calculations. This script uses the average number of arrivals per minute (.λ = 9) from Fig. 12.5
12.1 Application I: A Queueing Problem
323
Fig. 12.7 This is the output from the queueing script in Fig. 12.6. The random seed was set at 45. Notice that the second customer joins the queue because the first customer’s service was not completed. This second customer has NaN values for its service. The clock time of each person is the arrival time of the previous one since the interarrival time will be added to the clock
Customer 0’s arrival time interval is 2.85 min after the clock starts at 0.0. This customer’s arrival time is, of course, 2.85 min. Customer 1’s arrival time interval is 0.76 min after customer 1 arrives, so that arrival time is 3.62 (.= 2.85 + 0.76). Queue Measures Customer 0, as the first customer, immediately enters service so that the customer does not join the queue and the queue length is, therefore, zero. Customer 1 enters the queue at 3.62 when he arrives. The queue length is now 1. Services Times Customer 0 enters service immediately upon arrival at 2.85 min. The service time for that customer is 2.67 min. Customer 1 entered the queue at 3.62 min since customer 1 is being served. Departures and Total System Time Customer 0 departs at 5.53 min (.= 2.85 + 2.67) with a total in-system time of 2.67 min(.= service time). Customer 1 enters service at 5.53 when customer 0 departs. The queue time is 1.90 min (.= 5.53 − 3.62) min and the queue length drops to zero. That customer is in service for 1.65 min and departs at 7.18 (.= 5.53 + 1.65) min with an in-service time of 3.56 (.= 7.18 − 3.62) min. All the other customers follow a similar pattern.
324
12 Applications: Operational Scale View
12.1.9 A Critical Assumption You should quickly notice that the average number of arrivals for this example is assumed to be constant at 9 per minute. If you used this queueing example to predict the key queue statistics, you would do so under this assumption. This is just the constant mean prediction model of Chap. 3. The mean number of arrivals per minute tomorrow, .YT (1), is just 9 per minute. Same for .YT (2). This means that the advantages and disadvantages of that model apply to this queueing problem. For a short-term operational view of the queue, in which the average number of arrivals per period tomorrow is most likely the same as the average number today, this assumption may suffice. However, from a longer-term perspective, a term which certainly depends on the problem, this assumption is not only naive but also potentially dangerous. It is dangerous because the results may under- or overestimate actual results leading to incorrect resource allocations. The fact that the standard deviation of the arrivals per minute is 2.79 (see Fig. 12.5) strongly suggests that this may be the case. One way to handle this problem is to model the arrival rate using, perhaps, the linear trend model from Chap. 3. You could then predict one or two steps ahead (i.e., .YT (1) and .YT (2)) and compare results to determine if there are any significant changes in the key queueing statistics such as average wait times. Another way to handle the problem is to run a stochastic simulation of the basic queueing script in Fig. 12.6 in which the average number of arrivals per minute is randomly selected from the prediction interval for the constant mean model. That interval, from Chap. 3, is .9 ± 1.96 × 2.79 (or 3.5 to 14.5) where I assumed that .T = 0.
12.1.10 Queueing Simulation A basic simulation involves running the queueing function in Fig. 12.6 numerous times, each time being a run or iteration, terms I use interchangeably. For each run, the queue output is collected and descriptive statistics are calculated. The mean of the queue KPMs are extracted from the descriptive statistics and stored in a DataFrame along with the run number. I show a Python script in Fig. 12.8 to do this. Part of the script also plots three main KPMs, each versus the run number, as well as a scatter plot of the time in the queue versus the time in service. I display the script output and graphs in Fig. 12.9. Notice that the scatter plot seems to have an exponential pattern: the longer the service time, the exponentially longer the wait time due to more customers arriving. I estimated a regression model of the time in the queue as a function of the time being served. The Null Hypothesis is that there is no relationship, but the Alternative Hypothesis is there is a positive one. This Alternative Hypothesis is suggested by
12.1 Application I: A Queueing Problem
325
Fig. 12.8 This is a Python script to implement a queueing simulation
the graph in the lower right of Fig. 12.9, but it should also be intuitively obvious based on your everyday experience at banks, grocery stores, and gas stations. Figure 12.9 also suggests an exponential relationship that makes sense because of the underlying exponential distributions. Consequently, I used a natural log transformation of the two variables to estimate a linear relationship. As a side benefit of using the natural log transformation, the estimated coefficient for the time-inservice variable is an elasticity. See Paczkowski (2018) for an explanation. The elasticity provides an appealing interpretation in terms of percentage changes in both variables. I show the regression setup and results in Fig. 12.10. Notice that the coefficient for the log of time in service is positive and significant. But also notice that the estimated coefficient is 1.8314. This implies that if the time in service increases 1%, then the time customers have to spend in the queue, on average, increases 1.8%. This is a substantial amount of time which could cause customer dissatisfaction.
326
12 Applications: Operational Scale View
Fig. 12.9 This is the output for the Python script in Fig. 12.8
12.2 Application II: Linear Programming Problem A classic approach to an operational problem of resource allocation is linear programming (LP), and, of course, its many variants such as nonlinear, integer, and mixed programming. Linear programming is a method to locate the best outcome (e.g., maximum profit) using a mathematical model of an objective function (representing the desired outcome) and linear constraints on the inputs to that function. See Hadley (1962) for an older but excellent and thorough treatment of the principles of linear programming. Also, see Baumol (1965) for an excellent
12.2 Application II: Linear Programming Problem
327
Fig. 12.10 This shows the regression setup and results for the time in queue regressed on the time in service. The data are from the simulation using Fig. 12.8
328
12 Applications: Operational Scale View
Table 12.2 These are the prices, costs, and net contributions per unit for the car floor mat and trunk liners. The net contribution is .Market Price − Cost per Unit. This is a per-unit number Product Car mat Trunk liner
Market price $65 $95
Cost per unit $26 $75
Net contribution $39 $20
overview from an economic and business perspective as well as Liebhafsky (1968) for a purely economic treatment. As an example, consider the firm in Chap. 4 that produces car mats. It can produce car mats and truck liners using one robotic machine merely by switching a template that is done by the operator via a computer terminal. The output of each is .QM and .QL for the mats and liners, respectively. The market price for the mats is $65 and $95 for the liners because they are larger and more intricate. The costs per unit are due to the amounts of rubber, a nonstick coating for easy cleaning and maintenance, and a color dye (black, grey, tan). These vary by product and so are variable costs.6 I show these prices and the associated variable costs and contribution margins in Table 12.2. A simple constant mean daily demand model projects that the daily orders for car mats are 700 units and the orders for liners are 500 units. But these are upper bounds. Experience has shown that orders for each product are usually less. The robotic machine can produce at most 800 units per day, regardless of the type of product. This is the production function. Because of the use of this production robot, the number of mats and liners must be no more than 800. The operation problem is to assign the robotic machinery to produce the car mats and trunk liners so that the daily net contribution is a maximum. So the objective is to maximize the total contribution subject to the constraint that the mats and liners do not exceed 700 and 500, respectively, while having the robotic machine produce at capacity. The constraints are expressed as a system of equations: QM + QL ≤ 800.
(12.4)
0 ≤ QM ≤ 700.
(12.5)
0 ≤ QL ≤ 500.
(12.6)
.
See Hadley (1962, p. 3) for a brief discussion of why the inequality is used. The objective function to maximize is total net contribution: .NetC = $39QM + $20QL . I show a graph of this optimization problem in Fig. 12.11. In the .QL − QM space, there is an upper limit of 500 for .QL and an upper limit of 700 for .QM . These limits immediately carve out a rectangular region that is .500 × 700. This reduced space is further subdivided by the production function that has a maximum
6 See http://www.truck-and-car-floor-mats.com/Rubber-Car-Floor-Mats.html for some information about car floor mats. Last accessed December 8, 2022.
12.2 Application II: Linear Programming Problem
329
Fig. 12.11 This illustrates the constrained solution to the LP problem
QL of 800 when .QM = 0, and similarly for .QM . The shaded region is the Feasible Region where a solution can occur; all three constraints are met in this region. The objective function can be moved parallel to itself, but it reaches a maximum at the point shown. This gives the solution for .QL amd .QM : .QL = 100 and .QM = 700. The total contribution is $29,300. The condition in (12.4) can be expressed in matrix notation as
.
.
QM 11 × QL
≤ 800
or .Aub x ≤ bub where the subscript ub means “upper bound.” You could have several equality constraints in which case you would have .Aeq x = beq where the subscript eq means “equality.” The objective function is .c x.7 You must specify the bounds for the outputs. In my example, the outputs are all positive numbers; negative amounts would not make intuitive (or physical) sense. In practical problems, you might have negative bounds. To specify these, you need to
7I
adopted the SciPy notation to make it easier to read the SciPy documentation.
330
12 Applications: Operational Scale View
indicate the acceptable minimum and maximum values, but there are various ways to do this. The SciPy documentation notes that you can use: • None to indicate no bound • (0, None) as a default so that all decision variables are nonnegative • A single tuple (min, max) so that min and max are bounds for all decision variables For this problem, the bounds are .[0, 700] and .[0, 500] for mats and liners, respectively. Finally, you need to specify a method for actually doing the calculations. The oldest method for solving a linear programming problem is to use the simplex method. See Hadley (1962) for a description of this method. Many methods have been developed to determine the optimal solution to an LP problem. Some are: • Interior-point (Python’s SciPy package’s default) • Revised simplex • Simplex (Python’s SciPy package’s legacy method) The revised simplex is popular in economic research because it is intuitively appealing to economists based on how they rationalize a maximization problem.8 SciPy has powerful functionality in this area, but it has two implementation conditions that you must be aware of: 1. The problem must be a minimization problem. If you have a maximization problem, simply multiply the coefficients by .−1. Do NOT multiply the constraints by .−1. The solution will have a negative sign so simply multiply by .−1. 2. The variable bounds must not be unbounded. You can specify them as: (a) Specify .X ≥ 0 as [ 0, None ] (b) Specify .−∞ ≤ X ≤ ∞ as [ None, None ] (c) General form: [ Lower Bound, Upper Bound ] I show the setup for the LP example in Fig. 12.12. The solution is in Fig. 12.13. Notice that this agrees with the one I show in Fig. 12.11. The car mats are produced at the projected daily demand while the liners are slightly below their projected demand because the mats make a larger contribution. Simulation and optimization can be combined in a simulation optimization model when there is uncertainty and risk associated with the inputs. Some input variables will be uncertain so we now have uncertain variables whose values are exogenously determined by random processes. These processes are drawn from a probability distribution so a Monte Carlo simulation can be used to study the effects of random inputs. As examples, the price points for inputs may fluctuate daily based on market conditions; the amount of the raw material for a production process may fluctuate daily 8 See
https://en.wikipedia.org/wiki/Revised_simplex_method for a description. Last accessed September 22, 2021.
12.2 Application II: Linear Programming Problem
331
Fig. 12.12 The linear programming example for the car mats and liners, displayed in Fig. 12.11, is solved using this setup. The solution is in Fig. 12.13
Fig. 12.13 This is the solution to the linear programming example in Figs. 12.12 and 12.11. Notice the maximum net contribution
332
12 Applications: Operational Scale View
based on supply chain random shocks; production itself may vary randomly due to random mechanical shocks and downtimes or routine robotic health maintenance; and finally, customer demand may vary due to market conditions and reported news about economic, social, and weather-related conditions. The inputs for the LP optimization are positive numbers but they often have skewed distributions. Prices are an example. Coad (2009) argues that product prices are right-skewed because there is an association between income, which is known to be right-skewed, and prices (as well as quality). The generally skewed distributions suggest using a log-normal distribution for the random draws for the simulation. Coad (2009) argues that the right-skewed price distributions he studied have fatter tails than what is suggested by a log-normal distribution. The log-normal, however, is typically used for skewed distributions because the (natural) log transformation reduces the effect of large or extreme values. If a random variable X has a skewed distribution, then .Y = eX is log-normally distributed if .ln (Y ) ∼ N (μ, σ 2 ). If 2 .Y ∼ N (μ, σ ), then the pdf for Y is 2 1 − [x−μ] f (y) = √ e 2σ 2 . 2π σ 2
.
(12.7)
If Y is log-normally distributed, then 2 1 − [ln(y)−μ] 2σ 2 . f (y) = √ e y 2π σ 2
.
(12.8)
Some properties of the log-normal distribution from Hill et al. (2008) are: • • • • • • •
The pdf parameters .μ and .σ 2 are the mean and variance, respectively, of .ln (Y ). The median of Y is .m = eμ . The mean is .μ = ln (m). μ+σ 2/2 . .E(Y ) = e 2×μ+σ 2 × [eσ 2 − 1]. .V (Y ) = e The mode is .m/eσ 2 . .Mean > Median > Mode.
An additional property of the log-normal distribution is that the random variables are always positive.9 This follows because, by definition, if Y is log-normal, then X .X = ln (Y ) is normally distributed. But then .Y = e , the inverse. The exponential function, however, only returns positive values. So this implies that the log-normal only has positive values.10
9 See
https://en.wikipedia.org/wiki/Log-normal_distribution, last accessed December 15, 2022. https://probabilityandstats.wordpress.com/tag/lognormal-distribution/. For some nice simple proofs, see https://math.stackexchange.com/questions/1220729/proving-positivity-of-theexponential-function. Both were last accessed December 15, 2022. 10 See
12.3 Appendix
333
You can use Numpy’s log-normal function to get log-normally distributed random numbers, but you may want to return the numbers to their original scale. You could use Python’s math package to do this. As an example, you can use math.log( np.random.lognormal( mean = 0.0, sigma = 1.0, size = None ) ) where the np.random.lognormal function returns a random number in the form .eX and the natural log function from the math package puts this random number’s units back on the original X scale. Incidentally, note that according to the Numpy documentation, “[the] mean and standard deviation [for the lognormal function] are not the values for the distribution itself, but of the underlying normal distribution it is derived from.”11 I show several comparative plots of the log-normal distribution in Fig. 12.14. You can now simulate the allocations for random variations of orders. Since orders cannot be negative, you can add a positive random number to the “predicted” or expected orders that were 700 and 500 for mats and liners, respectively, in my example. I show a setup for this simulation in Fig. 12.15 and some output in Fig. 12.16.
12.3 Appendix 12.3.1 Poisson and Exponential Distribution Relationship The pdf for a Poisson random variable is f (x) =
.
(λ × t)x × e−λ×t x!
(12.9)
where .λ is a mean count of an event per unit time; .λ × t is the mean count; and x is a particular value of the random variable X for the event. An event could be an arrival of a customer to a website to look for and potentially place an order. If T is the time for an event, then the cumulative distribution is F (t) = P r(T ≤ t).
.
= 1 − P r(T > t).
(12.10) (12.11)
where .P r(T ≤ t) is interpreted as .P r(Time of an Event ≤ t). Notice that .P r(T > t) implies that no events occur before time t since T is the time of an event. This means that .x = 0. So
11 Source:
Numpy Documentation. Emphasis added.
334
12 Applications: Operational Scale View
Fig. 12.14 This shows a comparison of the log-normal distribution of different values of the shape parameter. The location parameter is 0 for each graph
P r(T > t) = P r(x = 0).
.
=
(λ × t)0 × eλ×t . 0!
= eλ×t .
(12.12) (12.13) (12.14)
Therefore, .P r(T ≤ t) = 1 − eλ×t , which is the exponential distribution. This is the cumulative distribution. The pdf is obtained by taking the first derivative
12.3 Appendix
335
Fig. 12.15 This is the setup for a linear programming simulation
with respect to t which yields .λ × eλ×t . Notice that the Poisson distribution is for discrete events (e.g., an arriving customer to place an order) while the exponential is continuous (e.g., measuring time). The two distributions can be compared by referencing how arrivals occur during a fixed period, say 60 min. I show in Fig. 12.17 12 customers arriving at random times within the 60 min.
336
12 Applications: Operational Scale View
Fig. 12.16 This is the output for the linear programming simulation setup in Fig. 12.15
12.3.2 Maximum Likelihood Estimator of λ An assumption for the Poisson distribution is that the events are all independent. This implies that the likelihood function can be written as the product of the individual Poissons: L(λ; x1 , x2 , . . . , xn ) =
.
n e−λ × λxi i=1
xi !
.
The log-likelihood is then l(λ; x1 , x2 , . . . , xn ) = ln (L)
.
= −n × λ −
n i=1
ln (x!) + ln (λ) ×
n i=1
xi .
12.3 Appendix
337
Fig. 12.17 This illustrates the Poisson and exponential distributions in the context of arriving customers. Based on Taboga (2021). Permission granted to use with modifications
The derivative of .l(λ; x1 , x2 , . . . , xn ) with respect to .λ is l 1 xi . = −n + × λ λ n
.
i=1
Set this equal to zero for the maximization yields the result: the estimator is the arithmetic mean of the sample observations.
12.3.3 Mean and Variance of the Exponential Distribution If X is exponentially distributed with scale parameter .λ and location zero, then the expected value is E(X) =
.
0
∞
x × λ × e−λ×x dx.
(12.15)
338
12 Applications: Operational Scale View
Evaluating this integral using integration-by-parts yields .E(X) = 1/λ. See Granville et al. (1941) for a good introduction to integration-by-parts. The variance of X can be found by first finding the .E(X2 ), which is .2/λ2 . Then V (X) = E(X2 ) − E(X)2.
.
2 1 − 2. λ2 λ 1 = 2. λ
=
(12.16) (12.17) (12.18)
12.3.4 Using Simpy Another way to analyze a queueing problem is to use the Python package Simpy. This was designed to handle discrete event simulations. Most of the applications seem to be queue-related. You can install simpy using pip -install simpy. See the online documentation for its use and examples.
Chapter 13
Applications: Tactical and Strategic Scale Views
This chapter continues the examples of the melding of predictive and simulation analytics with a focus on a tactical and strategic scale view. The creation of predictions is the same as for the operational scale view, but the data and the nature of the model change as you might expect. The simulations, however, could potentially be different. Different how? Operational scale view simulations are predominantly stochastic. Both the tactical and strategic scale view simulations could be stochastic or non-stochastic. If the latter, then they are in the class of system dynamics (SD) simulations with what-if scenarios. It all depends on the problem.
13.1 Tactical Scale View Applications I will describe two examples of tactical scale view problems: one for pricing and the other for churn. Both have very complex issues, so in this section, I only present enough material to illustrate main points dealing with predictions and simulations.
13.1.1 Tactical Application I: Pricing A tactical scale view is, by definition, broader, with a lot more to consider. I will continue with the queueing example from Chap. 12 but extend it to a marketing problem to attract customers with pricing and advertised wait times to receive an order as tactics. The focus is on higher-level managers with broader responsibilities for products, pricing, advertising, and sales. These managers will not deal with operations, but they are affected by, and subsequently, affect daily operations. This broader scale view has different components to consider. For example, a product manager at the tactical level must consider customer satisfaction, revenue © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 W. R. Paczkowski, Predictive and Simulation Analytics, https://doi.org/10.1007/978-3-031-31887-0_13
339
340
13 Applications: Tactical and Strategic Scale Views
from sales, costs of the order, and the net contribution margin to the larger enterprise while the operational manager only has to be concerned with meeting daily quotas. The first inclination of a product manager facing an order queueing issue that negatively affects customer satisfaction is to address the service rate. This can be handled by increasing the number of servers. The model I used above is the .M/M/1 queueing system with one server. A more general framework uses .c > 1 servers: .M/M/c. In a bank, if only one teller is open to serve customers but a queue develops, more teller windows would be opened to faster serve those customers. This implies, of course, that there are trained tellers available when needed to handle the extra windows. This is certainly an operational issue: allocate the tellers to handle the customer load. A tactical situation exists if the manager hires more tellers and advertises that more windows would be open, perhaps at certain busy times of the day such as noon time and around 4 PM before the end of the day. A strategic situation exists if the tellers are supplemented by ATMs or eventually replaced by mobile (i.e., smartphone) apps for mobile banking. There is another way you can tactically handle customer satisfaction issues due to queues. You could design the queueing experience so that customers do not notice how long they are in a line waiting to be served. Norman (2009) notes that the frustrations of standing in a line can be minimized (although, perhaps, not eliminated) by using customer satisfaction-enhancing tactics such as • Entertainment (e.g., monitors displaying game, home building, or weather channels) • Free samples (e.g., beverages, candy, cookies) • Employee or electronic interactions with waiting customers (e.g., to answer customer questions and keep them abreast of service status) In the long run, these tactics may be more effective than increasing the number of servers. But they may be strategically insufficient given advances in technology, competitive pressures, and demographic and cultural shifts (e.g., a younger demographics may be drawn to, and even demand, more sophisticated technological interfaces). I will comment on strategic issues in Sect. 13.2. The operational scale view in the previous section implicitly assumed that customers merely flowed to the online website without regard to the product itself or its features. It also implicitly assumed they knew all about the product and that the online provider had it to sell. Finally, it assumed that customers place an order merely by the act of arriving at the website or any business location; an arrival is tantamount to an order. These implicit assumptions may be valid for some, but not all, business processes. Typical textbook queueing problems cover banks, fast-food restaurants, gas stations, and the like where customers just “drive up” and wait to be served. Some processes are not that simple. An online website is an example. Customers do not know the features, including price, until they arrive at the site. That is when they can see the list of features, price points, availability, and expected delivery time. They may know about the site through a promotion or word-of-mouth before visiting it, but little else. My personal experience buying a new instant-hot faucet is an example.
13.1 Tactical Scale View Applications
341
I was advised where to shop online, but I did not know about the faucet’s features, including the price point, until I got to the site. Customers could randomly (from the perspective of the business) visit the website to shop, so there is an arrival, but then leave without placing an order. They could balk. Of course, customers could also place an order and cancel it before fulfillment; they could renege on the order. In both cases, customers are impatient. See Gross and Harris (1974, pp. 134–140). I will only consider the balking. The basic queueing model with a constant arrival rate .λ assumes, as I noted, that arriving customers automatically place an order. For online Web stores, this is clicking on a button to add a product to a shopping cart. In this case, a customer is converted into an actual buyer from being just a visitor to the site. But the same holds for fast food: someone who drives up to the ordering window places an order. In these cases, the arrival rate, .λ, is a gross number in the sense that all the customers place an order. Allowing for balking changes this. The gross arrival rate may be constant, but balking reduces it; there is a net arrival rate. A balking function reduces or dampens the gross arrivals. In queueing theory, someone is said to balk at joining a queue if they find the length of the queue to be too long. This is tantamount to having to wait too long for fulfillment since the wait time is proportional to queue length. I will use the concept in this sense but I will extend it to include cases in which someone decides to not join a queue for other reasons. In a business context, these could include a price point that is too high or product features that do not match the customers’ requirements. These are especially important for new products. As noted by Paczkowski (2020a), many new products fail because of pricing issues, poor design, and messaging failures. For the latter, this could be how the product is presented on the website or the whole website experience. The website pages could be too confusing, aesthetically poorly designed, or difficult to navigate, to mention a few reasons. This is why the market research technique called A/B testing is used to test variations of Web page content, although it can be used to test any differences. See Paczkowski (2020a) for a description of A/B testing for new product development. An arriving customer does not know the price or availability until he/she arrives at the ordering system at which point both pieces of information are revealed. If the price is too high or the waiting time for delivery is too long, then the customer balks and leaves without placing an order. This is a more general view of balking that implies a balking function with arguments price and waiting time. Both are announced only when the customer arrives. Let .B = f (P , W ) be the balking function with arguments P (the price) and W (the expected waiting time). I will assume that .0 ≤ B ≤ 1 so that B can be interpreted as a probability of balking. If .B = 1, then the customer will definitely balk and leave; if .B = 0, then the customer will stay and see the ordering process to its conclusion. I will assume that .dB/dP > 0 and .dB/dW > 0: if the price increases, the probability of balking increases; similarly for the wait time. See Gross and Harris (1974, p. X) for a definition of a balking function. The number of lost customers, .L, due to balking is the arrival rate times the balking probability, or
342
13 Applications: Tactical and Strategic Scale Views .
L = λB(P , W )
(13.1)
∂B ∂L =λ >0 ∂P ∂P
(13.2)
∂L ∂B =λ > 0. ∂W ∂W
(13.3)
with .
and .
We now have λ = λ − L.
.
(13.4)
= λ[1 − B(P , W )].
(13.5)
= λJ (P , W )
(13.6)
where .λ is a net arrival rate and .J (P , W ) is the probability of joining the system. People either balk or join to place an order. They are mutually exclusive and completely exhaustive actions so .B(P , W ) + J (P , W ) = 1 with the two functions interpreted as probabilities. So .J (P , W ) = 1 − B(P , W ) and ∂B ∂J =−