355 30 16MB
English Pages [365] Year 2022
Walter R. Paczkowski
Modern Survey Analysis
Using Python for Deeper Insights
Modern Survey Analysis
Walter R. Paczkowski
Modern Survey Analysis Using Python for Deeper Insights
Walter R. Paczkowski Data Analytics Corp. Plainsboro, NJ, USA
ISBN 978-3-030-76266-7 ISBN 978-3-030-76267-4 https://doi.org/10.1007/978-3-030-76267-4
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The historical root for my professional career as a data scientist, including my own consulting company which is focused on data science in general, has been survey analysis, primarily consumer surveys in the marketing domain. My experience has run the gamut from simple consumer attitudes, opinions, and interest (AIO) surveys to complex discrete choice, market segmentation, messaging and claims, pricing, and product positioning surveys. And the purpose for these has varied from just informative market scanning to in-depth marketing mix and new product development work. These all have been for companies in a wide variety of industries such as jewelry, pharmaceuticals, household products, education, medical devices, and automotive to mention a few. I learned a lot about survey data: how to collect them, organize them for analysis, and, of course, analyze them for actionable insight and recommendations for my clients. This book is focused on analyzing survey data based on what I learned. I have two overarching objectives for this book: 1. Show how to extract actionable, insightful, and useful information from survey data 2. Show how to use Python to analyze survey data
Why Surveys? Why focus on surveys other than the fact that this is my career heritage? The answer is simple. Surveys are a main source of data for key decision makers (KDMs), whether in the private or public sector. They need this data for the critical decisions they must make every day, decisions that have short-term and long-term implications and effects. They are not the only and definitely not the least important source. There are four sources that are relied on to some extent, the extent varying by the type of KDM and problem. The sources, not in any order, are:
v
vi
1. 2. 3. 4.
Preface
Observational Sensors Experimental Surveys
Observational and sensor measurements are historical data–data on what happened. These could be transactional (such as when customers shopped), production, employment, voter registrations and turnout, and the list goes on. Some are endogenous to the business or public agency, meaning they are the result of actions or decisions made by KDMs in the daily running of the business or public life. They ultimately have control over how such data are generated (besides random events which no one can control). Other data are exogenous, meaning they are determined or generated by forces outside the control of the KDMs and are over and beyond random events. The movement of the economy through a business cycle is a good example. Regardless of the form (endogenous or exogenous), data represent what did happen or is currently happening. Sensor-generated data are in the observational category. The difference is more degree than kind. Sensor data are generated in real-time and transmitted to a central data collection point, usually over wireless sensor networks (WSN). The result is a data flood, a deluge that must be stored and processed almost instantaneously. These data could represent measures in a production process, health measures in a medical facility, automobile performance measures, traffic patterns on major thoroughfares, and so forth. But all this sensor-generated data also represent what did happen or is currently happening. See Paczkowski (2020) for some discussion of sensor data and WSNs in the context of new product development. Experimental data are derived from designed experiments that have very rigid protocols to ensure that every aspect of a problem (i.e., factors or attributes) has equal representation in a study, that is, the experiment. Data are not historical as for observational and sensor data but “what-if” in nature: what-if about future events under controlled conditions. Examples are: • What if temperature is set at a high vs. low level? This is an industrial experiment. • What if price is $X rather than $Y? This is a marketing experiment. • What if one color is used rather than another? This is a product development experiment. • How would you vote change if candidate XX drops out of the presidential race? This is a political issue. Observational and sensor measurements are truly data, that is, they are facts. Some experimental studies, such as those listed above, will tell you about opinions, while others (e.g., the industrial experiments) will not. Generally, none of these will tell you about people’s opinions, plans, attitudes, reasons, understanding, awareness, familiarity, or concerns, all of which are subjective and personal. This list is more emotional, intellectual, and knowledge based. Items on the list are concerned with what people feel, believe, and know rather than on what they did or could do under different conditions. This is where surveys enter the picture. Marketing and public
Preface
vii
opinion what-if experiments are embedded in surveys so they are a hybrid of the two forms. Surveys can be combined with the other three forms. They allow you, for instance, to study artificial, controlled situations as in an industrial experiment. For example, in a pricing study, surveys could reveal preferences for pricing programs, strategies, and willingness to pay without actually changing prices. Conjoint, MaxDiff, and discrete choice studies are examples of experiments conducted within a survey framework. For what follows, I will differentiate between industrial and non-industrial experiments, the latter including marketing and opinion poll experiments embedded in surveys. Surveys get to an aspect of people’s psyche. Behavior can certainly be captured by asking survey respondents what they recently did (e.g., how much did they spend on jewelry this past holiday season) or might do under different conditions (e.g., will they still purchase if the price rises by X%?). These are not as accurate as direct observation, or measured by sensors, or derived from industrial experiments because they rely on what people have to say – and people are not always accurate or truthful in this regard. Even marketing experiments are not as accurate as actual purchase data because people tend to overstate how much they will buy, so such data have to be calibrated to make them more reasonable. Nonetheless, compared to the other three forms of data collection, surveys are the only way to get at what people are thinking. Why should it matter what people think? This is important because people (as customers, clients, and constituents) make personal decisions, based on what they know or are told, regarding purchases, what to request, what to register for, or who to vote for. These decisions are reflected in actual market behavior (i.e., purchases) or votes cast. Knowing how people think helps explain the observed behavior. Without an explanation, then all you have is observed behavior void of understanding. In short, surveys help to add another dimension to the data collected from the other three data collection methods, especially observed transactional data. Surveys have limitations, not the least of which are: 1. People’s responses are very subjective and open to interpretation. 2. People’s memories are dubious, foggy, and unclear. 3. People’s predictions of their own behavior (e.g., purchase intent or vote to cast) may not be fulfilled for a host of unknown and unknowable causes. 4. People tend to overstate intentions (e.g., how much they will spend on gifts for the next holiday season). The other data collection methods also have their shortcomings, so the fact that surveys are not flawless is not a reason not to use them. You just need to know how to use them. This includes how to structure and conduct a survey, how to write a questionnaire, and, of course, how to analyze data. This book focuses on the last way – analyzing survey data for actionable, insightful, and useful information.
viii
Preface
Why Python? The second overarching goal for this book is to describe how Python can be used for survey data analysis. Python has several advantages in this area such as: • It is free. • It has a rich array of packages for analyzing data in general. • It is programmable – every analyst should know some programming – and it is easy to program. You could ask “Why not just use spreadsheets”? Unfortunately, spreadsheets have major issues, several of which are: • • • • •
•
•
•
Data are often spread across several worksheets in a workbook. They make it difficult to identify data. They lack table operations such as joining, splitting, or stacking. They lack programming capabilities except Visual Basic for Applications (VBA), which is not a statistical programming language. They lack sophisticated statistical operations beyond arithmetic operations and simple regression analysis (add-on packages help, but they tend to lack depth and rely on the spreadsheet engine.) Spreadsheets are notorious for making it difficult to track formulas and catch errors. Each cell could have a separate formula, even cells in the same column for a single variable. The formula issue leads to reproducibility problems. The cells in the spreadsheet are linked, even across spreadsheets in the same workbook or across workbooks, often with no clear pattern. Tracing and reproducing an analysis is often difficult or impossible. Graphics are limited.
Preliminaries for Getting Started To successfully read this book, you will need Python and Pandas (and other Python packages) installed on your computer so you can follow the examples. This book is meant to be interactive and not static. A static book is one that you just read and try to absorb its messages. An interactive book is one that you read and then reproduce the examples. The examples are generated in a Jupyter notebook. A Jupyter notebook is the main programming tool of choice by data scientists for organizing, conducting, and documenting their statistical and analytical work. It provides a convenient way to enter programming commands, get the output from those commands, and document what was done or what is concluded from the output. The output from executing a command immediately follows the command so input and output “stay together.” I do everything in Jupyter notebooks.
Preface
ix
I provide screenshots of how to run commands and develop analyses along with the resulting output. This way, the Python code and resulting output are presented as a unit. In addition, the code is all well documented with comments so you can easily follow the steps I used to do a task. But of course, you can always go back to the Jupyter notebooks to see the actual code and run them yourself. I strongly recommend that you have Jupyter installed since Jupyter notebooks will be illustrated in this book. A Jupyter notebook of this book’s contents is available. If you do not have Jupyter, Python, and Pandas available, then I recommend that you download and install Anaconda,1 a freeware package that gives you access to everything you will need. Just select the download appropriate for your operating system. After you install Anaconda, you can use the Anaconda Navigator to launch Jupyter.2 A basic, introductory course in statistics is beneficial, primarily for later chapters.
The Book’s Structure This book has seven chapters. Chapter 1 sets the stage with a discussion of the importance of surveys and Python. Chapter 2 focuses on knowing the structure of data, which is really the profile of the survey respondents. Chapter 3 is concerned with shallow data analysis. This is simple statistics and simple visualizations such as bar/pie charts of main survey questions. This is where many analyses of survey data end. Chapter 4 is about deep data analysis that goes beyond the shallow analyses. Chapter 5 extends the deep analysis begun in Chap. 4 by introducing three regression models for deep analysis: OLS, logistic regression, and Poisson regression. Chapter 6 covers some specialized survey objectives to illustrate some of the concepts developed in the previous chapters. Chapter 7 changes focus and covers complex sample surveys. Different stages of complex samples are covered. Chapters 8 and 9 cover advanced material: Bayesian statistics applied to survey data analysis. You may be familiar with some Bayesian concepts. If not, then Chap. 8 will help you because it covers the basic concepts leading to Bayes’ Rule. I show in this chapter how to estimate Bayesian models using a Python package. I then extend the material in Chap. 8 to more advanced material in Chap. 9. These chapters will provide you with a new perspective on survey data and how to include prior information into your analyses. Plainsboro, NJ, USA
1 Download
Walter R. Paczkowski
Anaconda from https://www.anaconda.com/download/. note that there is Jupyter and JupyterLab. JupyterLab is the newer development version of Jupyter, so it is not ready for “prime time.” I will only use Jupyter which is stable at this time. 2 Please
Acknowledgments
In my last book, I noted the support and encouragement I received from my wonderful wife, Gail; my two daughters, Kristin and Melissa; and my two daughters, Kristin and Melissa. As before, Gail encouraged me to sit down and just write, especially when I did not want to, while my daughters provided the extra set of eyes I needed to make this book perfect. They provided the same support and encouragement for this book, so I owe them a lot, both then and now. I would also like to say something about my two grandsons who, now at 6 and 10, obviously did not contribute to this book but who, I hope, will look at this one in their adult years and say “Yup. My grandpa wrote this book, too.”
xi
Contents
1 Introduction to Modern Survey Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Information and Survey Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Demystifying Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Survey Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Target Audience and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.1 Key Parameters to Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.2 Sample Design to Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.3 Population Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.4 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.5 Margin of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2.6 Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Screener and Questionnaire Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Fielding the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Report Writing and Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Sample Representativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Digression on Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Calculating the Population Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Estimating Population Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Consumer Study: Yogurt Consumption . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Public Sector Study: VA Benefits Survey . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Public Opinion Study: Toronto Casino Opinion Survey. . . . . . . 1.5.4 Public Opinion Study: San Francisco Airport Customer Satisfaction Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Why Use Python for Survey Data Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Why Use Jupyter for Survey Data Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 3 4 5 7 9 9 10 10 10 10 12 14 14 16 16 20 21 22 25 25 27 28 30 30 32
xiii
xiv
2
3
Contents
First Step: Working with Survey Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Best Practices: First Steps to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Installing and Importing Python Packages . . . . . . . . . . . . . . . . . . . . . 2.1.2 Organizing Routinely Used Packages, Functions, and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Defining Data Paths and File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Defining Your Functions and Formatting Statements. . . . . . . . . . 2.1.5 Documenting Your Data with a Dictionary . . . . . . . . . . . . . . . . . . . . 2.2 Importing Your Data with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Identifying Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Reporting Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Reasons for Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Dealing with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4.1 Use the fillna( ) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4.2 Use the Interpolation( ) Method . . . . . . . . . . . . . . . . . . . . . 2.3.4.3 An Even More Sophisticated Method. . . . . . . . . . . . . . . . 2.4 Handling Special Types of Survey Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 CATA Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.1 Multiple Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.2 Multiple Responses by ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.3 Multiple Responses Delimited . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.4 Indicator Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.5 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Categorical Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Creating New Variables, Binning, and Rescaling . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Creating Summary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Other Forms of Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Knowing the Structure of the Data Using Simple Statistics . . . . . . . . . . . 2.6.1 Descriptive Statistics and DataFrame Checks. . . . . . . . . . . . . . . . . . 2.6.2 Obtaining Value Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Styling Your DataFrame Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Weight Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Complex Weight Calculation: Raking . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Types of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Querying Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35 36 36 39 41 42 42 43 48 49 49 50 51 51 51 52 52 52 53 53 54 54 54 54 56 58 62 64 67 68 69 69 70 73 75 80
Shallow Survey Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Frequency Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Ordinal-Based Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Nominal-Based Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Basic Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Cross-Tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83 84 85 86 86 89
Contents
xv
3.4 Data Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Visuals Best Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Data Visualization Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Other Charts and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5.1 Histograms and Boxplots for Distributions . . . . . . . . . . 3.4.5.2 Mosaic Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5.3 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Weighted Summaries: Crosstabs and Descriptive Statistics . . . . . . . . . . .
94 95 95 98 99 101 105 105 109 111
4
Beginning Deep Survey Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Hypothesis Testing Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Examples of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 A Formal Framework for Statistical Tests . . . . . . . . . . . . . . . . . . . . . 4.1.4 A Less Formal Framework for Statistical Tests . . . . . . . . . . . . . . . . 4.1.5 Types of Tests to Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Quantitative Data: Tests of Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Test of One Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Test of Two Means for Two Populations . . . . . . . . . . . . . . . . . . . . . . . 4.2.2.1 Standard Errors: Independent Populations . . . . . . . . . . 4.2.2.2 Standard Errors: Dependent Populations . . . . . . . . . . . . 4.2.3 Test of More Than Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Categorical Data: Tests of Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Single Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Comparing Proportions: Two Independent Populations . . . . . . . 4.3.3 Comparing Proportions: Paired Populations . . . . . . . . . . . . . . . . . . . 4.3.4 Comparing Multiple Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Advanced Tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Advanced Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Extended Visualizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Geographic Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Dynamic Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113 114 115 118 118 119 120 122 122 126 126 129 131 142 143 144 146 147 153 158 159 162 165 166
5
Advanced Deep Survey Analysis: The Regression Family . . . . . . . . . . . . . . . 5.1 The Regression Family and Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Identity Link: Introduction to OLS Regression . . . . . . . . . . . . . . . . . . . . 5.2.1 OLS Regression Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 The Classical Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Steps for Estimating an OLS Regression . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Predicting with the OLS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
177 178 179 180 180 181 182 186
xvi
Contents
5.3 The Logit Link: Introduction to Logistic Regression . . . . . . . . . . . . . . . . . . 5.3.1 Logistic Regression Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Steps for Estimating a Logistic Regression . . . . . . . . . . . . . . . . . . . . 5.3.4 Predicting with the Logistic Regression Model . . . . . . . . . . . . . . . . 5.4 The Poisson Link: Introduction to Poisson Regression . . . . . . . . . . . . . . . . 5.4.1 Poisson Regression Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Steps for Estimating a Poisson Regression. . . . . . . . . . . . . . . . . . . . . 5.4.4 Predicting with the Poisson Regression Model . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187 189 192 194 200 200 200 201 201 202 203
6
Sample of Specialized Survey Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Conjoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Analysis Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Creating the Design Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Fielding the Conjoint Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Estimating a Conjoint Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.6 Attribute Importance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Net Promoter Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 210 210 210 211 212 214 215 217 224 228
7
Complex Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Complex Sample Survey Estimation Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 CrossTabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 One-Sample Test: Hypothesized Mean. . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Two-Sample Test: Independence Case . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Two-Sample Test: Paired Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
237 239 240 241 244 245 245 246 247 248 248
8
Bayesian Survey Analysis: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Frequentist vs Bayesian Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Digression on Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Bayes’ Rule Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Bayes’ Rule Reexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 The Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 The Marginal Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.6 The Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.7 Hyperparameters of the Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .
251 253 259 259 261 262 263 263 264 264
Contents
9
xvii
8.3 Computational Method: MCMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Digression on Markov Chain Monte Carlo Simulation. . . . . . . . 8.3.2 Sampling from a Markov Chain Monte Carlo Simulation . . . . 8.4 Python Package pyMC3: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Basic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Benchmark OLS Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Using pyMC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 pyMC3 Bayesian Regression Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Bayesian Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2.1 The MAP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2.2 The Visualization Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Extensions to Other Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.1 Sample Mean Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.2 Sample Proportion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.3 Contingency Table Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.4 Logit Model for Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.5 Poisson Model for Count Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.1 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.2 Half-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.3 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265 265 269 269 270 272 273 274 274 280 280 282 289 290 290 291 295 297 300 300 300 301
Bayesian Survey Analysis: Multilevel Extension . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Multilevel Modeling: An introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Omitted Variable Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Simple Handling of Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Nested Market Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Multilevel Modeling: Some Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Aggregation and Disaggregation Issues . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Two Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Ubiquity of Hierarchical Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Data Visualization of Multilevel Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Basic Data Visualization and Regression Analysis . . . . . . . . . . . . 9.4 Case Study Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Pooled Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Unpooled (Dummy Variable) Regression Model . . . . . . . . . . . . . 9.4.3 Multilevel Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Multilevel Modeling Using pyMC3: Introduction . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Multilevel Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Multilevel Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Example Multilevel Estimation Set-up . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Example Multilevel Estimation Analyses . . . . . . . . . . . . . . . . . . . . . . 9.6 Multilevel Modeling with Level Explanatory Variables . . . . . . . . . . . . . . .
303 304 305 307 307 308 309 310 311 311 312 313 318 318 319 321 323 324 324 325 328 328
xviii
Contents
9.7 Extensions of Multilevel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.2 Possion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.3 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
328 330 332 332 333
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Fig. Fig. Fig. Fig.
1.10 1.11 1.12 2.1
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17
The Survey Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General Questionnaire Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating an Indicator Function in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yogurt Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yogurt Consumption Questionnaire Structure. . . . . . . . . . . . . . . . . . . . . . . . VA Study Population Control Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vets Questionnaire Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toronto Casino Questionnaire Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . San Francisco International Airport Customer Satisfaction Questionnaire Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anaconda Navigator Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anaconda Environment Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jupyter Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . This illustrates the connection between functions and methods for enhanced functionality in Python . . . . . . . . . . . . . . . . . . . . . . . Python Package Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Use of the %run Magic to Import Packages . . . . . . . . . . . . . . . . . . . . . . . . . . Illustrative Data and Notebook Path Hierarchy . . . . . . . . . . . . . . . . . . . . . . Example of Importing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing a CSV File Into Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing an Excel Worksheet Into Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . Importing an SPSS Worksheet Into Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . Importing an SPSS Using pyReadStat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Life Question for pyreadstat Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retrieving Column Label (Question) for Column Name . . . . . . . . . . . . Retrieving Value Labels (Question Options) for Column Name . . . . Categorical Coding of a Likert Scale Variable . . . . . . . . . . . . . . . . . . . . . . . Value Counts for VA Data without Categorical Declaration . . . . . . . . . Value Counts for VA Data with Categorical Declaration. . . . . . . . . . . . . Application of CategoricalDtype. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recoding of Yogurt Satisfaction Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 13 21 26 27 28 29 30 31 33 33 34 39 40 40 41 42 45 46 47 47 48 48 49 55 57 57 58 59 xix
xx
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 4.1 4.2 4.3
Age Calculation from Vet YOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Military Branch Calculation for the Vet data . . . . . . . . . . . . . . . . . . . . . . . . . Simple Weight Calculation in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Merging Weights into a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raking Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raking with ipfn Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weights Based om Raking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stacked Weights for Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Stacked Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Query of Female Voters Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Query of Female Voters Who Are 100% Likely to Vote . . . . . . . . . . . . . Query of Female Voters Who Are 100% Likely to Vote or Extremely Likely to Vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequency Summary Table: Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequency Summary Table: Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . Yogurt Data Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yogurt Data Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Mean Calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic Crosstab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enhanced Cross-tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enhanced Cross-tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic Cross-tab Using the pivot_table Method . . . . . . . . . . . . . . . . . . . . . . . One-way Table Using the pivot_table Method . . . . . . . . . . . . . . . . . . . . . . . Matplotlib Figure and Axis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pie Chart for Likelihood to Vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yogurt Age-Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pie Charts for Yogurt Age-Gender Distribution . . . . . . . . . . . . . . . . . . . . . . Alternative Pie Charts for Yogurt Age-Gender Distribution . . . . . . . . . Yogurt Consumers’ Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yogurt Consumers’ Gender Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stacking Data for SBS Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBS Bar Chart for the Yogurt Age-Gender Distribution . . . . . . . . . . . . . Histogram Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxplot Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histogram Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mosaic Chart Using Implicit Cross-tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mosaic Chart Using Explicit Cross-tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mosaic Chart Using Three Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heatmap of the Age-Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . Check Sum of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculation of Weighted Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . Weighted Cross-tabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hypothesis Testing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Test Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Normal and Student’s t-distribution . . . . . . . . . . . . . . . . .
60 61 71 72 76 77 78 78 79 80 81 81 86 87 88 88 89 90 91 92 94 94 97 99 100 101 102 103 103 104 104 106 107 107 108 108 109 110 111 112 112 120 121 123
List of Figures
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25
Fig. Fig. Fig. Fig. Fig. Fig.
4.26 4.27 4.28 4.29 4.30 4.31
Fig. 4.32 Fig. 4.33 Fig. 4.34 Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
4.35 4.36 4.37 4.38 4.39 4.40 4.41 4.42 4.43
Unweighted t-Test of Yogurt Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weighted t-Test of Yogurt Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unweighted z-Test of Yogurt Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weighted z-Test of Yogurt Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unweighted Pooled t-Test Comparing Means . . . . . . . . . . . . . . . . . . . . . . . . Weighted Pooled t-Test Comparing Means . . . . . . . . . . . . . . . . . . . . . . . . . . . Unweighted Pooled z-Test Comparing Means . . . . . . . . . . . . . . . . . . . . . . . Weighted z-Test of Yogurt Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paired T-test Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Value Analysis for Vet Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Age Distribution of Vets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean Age of Vets by Service Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ANOVA Table of Age of Vets by Service Branches . . . . . . . . . . . . . . . . . Probability of Incorrect Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Tukey’s HSD Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Plot of Tukey’s HSD Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heatmap of p-Values of Tukey’s HSD Test. . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Value Report for Question A7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Test Results for Question A7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Value Report for Question C1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code for Missing Value Report for Question C1 . . . . . . . . . . . . . . . . . . . . . Summary Table and Pie Chart for Missing Value Report for Question C1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code to create CATA Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CATA Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proportion Summary for the VA CATA Question QC1a . . . . . . . . . . . . . Cochrane’s Q Test for the VA CATA Question QC1a . . . . . . . . . . . . . . . . Marascuillo Procedure for the VA CATA Question QC1a . . . . . . . . . . . Results Summary of the Marascuillo Procedure for the VA CATA Question QC1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abbreviated Results Summary of the Marascuillo Procedure for the VA CATA Question QC1a . . . . . . . . . . . . . . . . . . . . . . . . . Response Distribution for the VA Enrollment Question QE1: Pre-Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Response Distribution for the VA Enrollment Question QE1: Post-Cleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pivot Table for the VA Enrollment Question QE1: Post-Cleaning . . . Grouped Boxplot of Vets’ Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 3-D Bar Chart of VA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Faceted Bar Chart of VA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographic Map Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographic Map Code Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographic Map of State of Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Static and Dynamic Visualization Functionality . . . . . . . Standardized Normal pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxi
124 124 125 125 128 128 129 129 130 131 132 133 133 140 141 142 143 145 145 148 149 150 151 152 153 154 155 156 157 157 158 159 160 161 163 164 164 165 166 171
xxii
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
List of Figures
4.44 4.45 4.46 4.47 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24
Chi Square pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Student’s t pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F Distribution pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Python Code for 3D Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression Model of Yogurt Purchases: Set-up . . . . . . . . . . . . . . . . . . . . . . Regression Model of Yogurt Purchases: Results . . . . . . . . . . . . . . . . . . . . . Regression Display Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OLS Prediction Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OLS Prediction Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General Logistic Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SFO Missing Value Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T2B Satisfaction Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender Distribution Before Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender Distribution After Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SFO Logit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SFO Crosstab of Gender and Satisfaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . Odds Ratio Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Odds Ratio Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of Yogurt Consumption per Week . . . . . . . . . . . . . . . . . . . . . . Poisson Regression Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical Depiction of the ANOVA Decomposition . . . . . . . . . . . . . . . . . Design Generation Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design Matrix in a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recoded Design Matrix in a DataFrame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Conjoint Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conjoint Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retrieving Estimated Part-Worths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retrieving Estimated Part-Worths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SFO Likelihood-to-Recommend Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recoding of SFO Likelihood-to-Recommend Data . . . . . . . . . . . . . . . . . . NPS Decision Tree Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NPS Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Satisfaction and Likelihood-to-Recommend Data Import . . . . . . . . . . . Satisfaction and Likelihood-to-Recommend Data Recoding . . . . . . . . Satisfaction and Promoter McNemar Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . Venn Diagram of Satisfied and Promoters . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-tab of Brand by Segment for the Yogurt Survey . . . . . . . . . . . . . . CA Map of Brand by Segment for the Yogurt Survey . . . . . . . . . . . . . . . CA Summary Table for the Yogurt Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . First Five Records of Toronto Casino Data . . . . . . . . . . . . . . . . . . . . . . . . . . . Toronto Casino Data Missing Value Report . . . . . . . . . . . . . . . . . . . . . . . . . . Toronto Casino Data Removing White Spaces . . . . . . . . . . . . . . . . . . . . . . . Toronto Casino Data Removing Punctuation Marks . . . . . . . . . . . . . . . . . Toronto Casino Data Length Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toronto Casino Data Length Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172 173 173 174 183 185 186 188 189 190 193 194 195 195 196 197 199 199 202 203 204 212 213 213 214 216 217 218 219 219 221 222 222 223 223 224 226 227 228 229 230 231 232 232 233
List of Figures
xxiii
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
233 234 241 242 243 243 244 245 246 247 248 249 255 257 262 266 268 269 273 274 275 276 277 279 281 282 283 284 285 286 287 287 288 289 290 291 291 292 292 293 293 294 295 295
6.25 6.26 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 8.23 8.24 8.25 8.26 8.27 8.28 8.29 8.30 8.31 8.32
Toronto Casino Data Length Boxplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toronto Casino Data Verbatim Wordcloud . . . . . . . . . . . . . . . . . . . . . . . . . . . SRS Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stratified Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VA Data Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VA Mean Age Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VA Mean Age Calculation with Strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Tabulation of a Categorical Variable for Counts . . . . . . . . . . . . . Simple Tabulation of a Categorical Variable for Proportions . . . . . . . . Simple Cross Tabulation of Two Categorical Variables. . . . . . . . . . . . . . One-Sample Test: Hypothesized Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-Sample Test: Independent Populations. . . . . . . . . . . . . . . . . . . . . . . . . . Classical Confidence Interval Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coin toss experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Informative and Uninformative Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Python Code to Generate a Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph of a Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quantity Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quantity Skewness Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Log Quantity Histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set-up to Estimate OLS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for the Estimated OLS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression using pyMC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Skewed Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example MAP Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooled Regression Summary from pyMC3 Pooled Model . . . . . . . . . . Posterior Distribution Summary Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of Trace Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Posterior Plots for the Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Posterior Plot for logIncome for the Regression Model . . . . . . . . . . . . . Posterior Plot Reference Line at 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Posterior Plot Reference Line at the Median . . . . . . . . . . . . . . . . . . . . . . . . . Null Hypothesis and the HDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set-up for Testing the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trace Diagrams for Testing the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Posterior Distribution for Testing the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . Set-up for Testing the Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trace Diagrams for Testing the Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . Posterior Distribution for Testing the Proportion . . . . . . . . . . . . . . . . . . . . Z-Test for the Voting Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set-up for the MCMC Estimation for the Voting Problem . . . . . . . . . . MCMC Estimation Results for the Voting Problem. . . . . . . . . . . . . . . . . . Posterior Distributions for the Voting Problem. . . . . . . . . . . . . . . . . . . . . . .
xxiv
List of Figures
Fig. 8.33 Posterior Distribution for the Differences Between Parties for the Voting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 8.34 Logit Model for Voting Intentions: Frequentist Approach . . . . . . . . . . . Fig. 8.35 Set-up for Bayesian Logit Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 8.36 Trace Diagrams for the Bayesian Logit Estimation . . . . . . . . . . . . . . . . . . Fig. 8.37 Posterior Distribution for the Odds Ratio of the Bayesian Logit . . . . Fig. 8.38 Political Party Odds Ratio Distribution for the Bayesian Logit . . . . . Fig. 8.39 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 8.40 Normal and Half-Normal Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.1 Changing Data Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.2 Multilevel Data Structure: Two Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.3 Connection of Levels to the Two Main Fallacies. . . . . . . . . . . . . . . . . . . . . Fig. 9.4 Pooled Regression with Generated Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.5 Graph of Pooled Generated Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.6 Pooled Regression with Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.7 Pooled Regression with Dummy Variables and Interactions . . . . . . . . Fig. 9.8 Pooled Regression Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.9 Pooled Regression ANOVA Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.10 Pooled Regression with Dummy Variables Summary . . . . . . . . . . . . . . . Fig. 9.11 Set-up for Multilevel Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.12 Estimation Results for Multilevel Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.13 Distribution of Check-out Waiting Time by Store Location . . . . . . . . . Fig. 9.14 Relationship Between Check-out Waiting Time and Price . . . . . . . . . . Fig. 9.15 Level 2 Regression Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
296 297 298 298 299 299 301 302 309 311 312 314 315 316 317 320 321 322 326 327 329 329 330
List of Tables
Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table
1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 5.1 5.2 6.1
Example of an Analysis Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of Quantities of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of Gender Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Population Parameter Calculations . . . . . . . . . . . . . . . . . . . . Python Package Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Python Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Dictionary for the VA Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . pyreadstat’s Returned Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Types of Categorical Survey Variables . . . . . . . . . . . . . . . . . . . Pandas Summary Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardization Methods for Response Bias . . . . . . . . . . . . . . . . . . . . . . . Example 2 × 2 Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pandas Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DataFrame Styling Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Population Distributions for Raking Example . . . . . . . . . . . . . . . . . . . . . . Sample Contingency Table for Raking Example . . . . . . . . . . . . . . . . . . . Pandas Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crosstab Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pivot_Table Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matplotlib Annotation Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustrative Questions for Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pandas Plot Kinds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weighted Statistics Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-Way Table Layout for Service Branches. . . . . . . . . . . . . . . . . . . . . . . . . . Examples of Effects Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General ANOVA Table Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stylized Cross-Tab for McNemar Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CATA Attribution of Response Differences . . . . . . . . . . . . . . . . . . . . . . . . . Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The General Structure of an ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . Parameters Needed for the Watch Case Study . . . . . . . . . . . . . . . . . . . . . .
15 18 19 22 32 37 43 48 56 62 63 66 68 70 74 74 89 92 93 97 98 98 111 134 136 138 146 150 179 206 212 xxv
xxvi
Table Table Table Table Table
List of Tables
8.1 8.2 8.3 9.1 9.2
Example Voting Intention Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Side Effects by Store Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Omitted Variable Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Problem Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
257 263 271 307 307
Chapter 1
Introduction to Modern Survey Analytics
Contents 1.1 1.2
Information and Survey Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demystifying Surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Survey Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Target Audience and Sample Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Screener and Questionnaire Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Fielding the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Report Writing and Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Sample Representativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Digression on Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Calculating the Population Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Estimating Population Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Consumer Study: Yogurt Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Public Sector Study: VA Benefits Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Public Opinion Study: Toronto Casino Opinion Survey . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Public Opinion Study: San Francisco Airport Customer Satisfaction Survey . . . 1.6 Why Use Python for Survey Data Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Why Use Jupyter for Survey Data Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 4 5 7 12 14 14 16 16 20 21 22 25 25 27 28 30 30 32
There are two things, it is often said, that you cannot escape: death and taxes. This is too narrow because there is a third: surveys. You are inundated daily by surveys of all kinds that cover both the private and public spheres of your life. In the private sphere, there are product surveys designed to learn what you buy, use, have, would like to have, and uncover what you believe is right and wrong about existing products. They are also used to determine the optimal marketing mix that consists of the right product, placement, promotion, and pricing combination to effectively sell products. They are further used to segment the market recognizing that one marketing mix does not equally apply to all customers. There are surveys used to gauge how well the producers of these products perform in all aspects of making, selling, and supporting their products. And there are surveys internal to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_1
1
2
1 Introduction to Modern Survey Analytics
those producers to help business managers determine if their employees are happy with their jobs and if they have any ideas for making processes more efficient or have suggestions and advice regarding new reorganization efforts and management changes. In the public sphere, there are political surveys—the “polls”—reported daily in the press that tell us how the public views a “hot” issue for an upcoming election, an initiative with public implications, and a policy change that should be undertaken. There are surveys to inform agencies about who is using the public services they offer, why those services are used, how often they are used, and even if the services are known. Some of these surveys are onetime events meant to provide information and insight for an immediate purpose. They would not be repeated because once conducted, they would have completed their purpose. A private survey to segment the market is done once (or maybe once every, say, 5 years) since the entire business organization is structured around the marketing segments. This includes business unit structure, lines of control and communication, and business or corporate identity. In addition, marketing campaigns are tailored for these segments. All this is founded on surveys. Other surveys are routinely conducted to keep track of market developments, views, and opinions that require unexpected organizational changes. These are tracking studies meant to show trends in key measures. Any one survey in a tracking study is insightful, but it is the collection over time that is more insightful and the actual reason tracking is done in the first place. Counting how many surveys are conducted annually, whether onetime or tracking, is next to impossible because many are proprietary. Businesses normally do not reveal their intelligence gathering efforts because doing so then reveals what their management is thinking or concerned about; this is valuable competitive intelligence for its competition. A further distinction has to be made between a survey per se and the number of people who answer the survey. The latter is the number of completions. The online survey provider SurveyMonkey claims that they alone handle 3 million completions a day. That translates to over a billion completes a year!1 And that is just for one provider. The US government conducts a large number of regular surveys that are used to measure the health of the economy and build valuable data sets for policy makers and business leaders. The US Census Bureau, for example, notes that it “conducts more than 130 surveys each year, including our nation’s largest business survey,” the Annual Retail Trade (ARTS).2
1 See https://www.quora.com/How-many-surveys-are-conducted-each-year-in-the-US#XettS. Last accessed March 29, 20120. 2 See the US Census Bureau’s website at https://www.census.gov/programs-surveys/surveyhelp/ list-of-surveys/business-surveys.html. Last accessed March 29, 2020.
1.1 Information and Survey Data
3
1.1 Information and Survey Data The reason surveys are an integral part of modern life is simple: They provide information for decision-making, and decisions in modern, high-tech, and interconnected societies are more complex than in previous periods. There is so much more happening in our society than even 20 years ago at the turn of the millennium. We now have sensors in almost all major appliances, in our cars, and at our street corners; we have social media that has created an entangled network where everyone is connected; we have full-fledged computers in our pockets and purses with more power than the best mainframes of 20 years ago;3 and we have the Internet with all its power, drawbacks, and potential benefits as well as dangers. Technology is changing very rapidly following Moore’s Law: “the observation that the number of transistors in a dense integrated circuit (IC) doubles about every 2 years.”4 As a result of this rapid and dynamic change in technology, there has been an equally rapid and dynamic change in our social structure, including what we believe, how we work and are organized, how we relate to each other, how we shop, and what we buy. Decision-makers in the private and public spheres of our society and economy must make decisions regarding what to offer, in terms of products and programs, recognizing that whatever they decide to do may, and probably will, change significantly soon after they make that decision. To keep pace with this rapidly changing world, they need information, penetrating insight, into what people want, what they believe, how they behave, and how that behavior has and will change. They can certainly get this from databases, socalled Big Data, but this type of data is, by their collection nature, historical. They reflect what did happen, not what will happen or where the world is headed. The only way to gain this information is by asking people about their beliefs, behaviors, intentions, and so on. This is where surveys are important. They are the vehicle, the source of information, for providing decision-makers with information about what drives or motivates people in a rapidly changing world. Since possibilities are now so much greater, the speed of deployment and coverage of surveys have also become equally greater issues. More surveys have to be conducted more frequently and in more depth to provide information to key decision-makers. These surveys result in an overload, not of information but of data because each one produces a lot of data in a very complex form. The data have to be processed, that is, analyzed, to extract the needed information. Data and information are not the same. Information is hidden inside data; it is latent, needing to be extracted. This extraction is not easy; in fact, it is quite onerous. This is certainly not a problem unique to survey data. The Big Data I referred to above has this same problem, a problem most likely just as large and onerous to solve as for surveys. 3 A McKinsey report in 2012 stated “More and more smartphones are as capable as the computers of yesteryear.” See Bauer et al. (2012). 4 See https://en.wikipedia.org/wiki/Moore%27s_law. Last accessed January 8, 2021.
4
1 Introduction to Modern Survey Analytics
Regardless of their source, data must be analyzed to extract the latent information buried inside them. Extraction methodologies could be shallow or deep. Shallow Data Analysis just skims the surface of the data and extracts minimal useful information. The methodologies tend to be simplistic, such as 2 × 2 tables, volumes of crosstabs (i.e., the “tabs”), and pie and bar charts, which are just one-dimensional views of one, maybe two, variables. More insightful, penetrating information is left latent, untapped. Deep Data Analysis digs deeper into the data, searching out relationships and associations across multiple variables and subsets of the data. The methodologies include, but are certainly not limited to, perceptual maps, regression analysis,5 multivariate statistical tests, and scientific data visualization beyond simple pie and bar charts, to mention a few. This book’s focus is Deep Data Analysis for extracting actionable, insightful, and useful information latent in survey data. See Paczkowski (2022) about the connection between data and information and the importance of, and discussions about, information extraction methodologies.
1.2 Demystifying Surveys It helps to clarify exactly what is a survey. This may seem odd considering that so many are conducted each year and that you are either responsible for one in your organization, receive reports or summaries of surveys, or have been asked to take part in a survey. So the chances are you had some contact with a survey that might lead you to believe you know what they are. Many people confuse a survey with something associated with it: the questionnaire. They often treat these terms— survey and questionnaire—as synonymous and interchangeable. Even professionals responsible for all aspects of a survey do this. But they are different. A survey is a process that consists of six hierarchically linked parts: 1. 2. 3. 4. 5. 6.
Objective statement Target audience identification Questionnaire development and testing Fielding of the survey Data analysis Results reporting
You can think of these collectively as a survey design. This sequence, of course, is only theoretical since many outside forces intervene in an actual application. I show an example survey design in Fig. 1.1 that highlights these six components. A careful study of them suggests that they could be further grouped into three overarching categories: 1. Planning 2. Execution 3. Analysis 5 Regression
analysis includes a family of methods. See Paczkowski (2022) for a discussion.
1.2 Demystifying Surveys
Objective
• Specify what is to be learned. Why do a survey?
Target
• Identify target audience. Calculate sample size and margin of error.
5
Planning
Questions
• Design, write, and test screener and questionnaire.
Execution Field
Analyze
• Field the study. Collect clean, and organize the data. • Analyze the data. Draw conclusions and recommendations.
Analysis Report
• Summarize and report results.
Fig. 1.1 This illustrates a typical survey process. Although the process is shown as a linear one, it certainly could be nonlinear (e.g., flowing back to a previous step to redo something) as well as iterative
which I also show in Fig. 1.1. The report stage is part of the analysis category because writing a report is often itself an analytical process. Writing, in general, is a creative process during which questions are asked that were not previously thought about but that become obvious and, therefore, which need to be addressed (and hopefully the data are available to answer them). I will expound on these six components in the next six subsections.
1.2.1 Survey Objectives The possible objectives for a survey are enormous to say the least. They can, however, be bucketed into four categories: 1. 2. 3. 4.
Fact finding Trend analysis Pattern identification Intentions
Fact finding runs the gamut from current viewpoints to behavioral habits to awareness to familiarity. Current viewpoints include beliefs and opinions such as
6
1 Introduction to Modern Survey Analytics
satisfaction, political affiliation, and socioeconomic judgments. Current behavioral habits include, as examples: where someone currently shops, the amount purchased on the last shopping visit, the frequency of employment changes, the services either currently or previously used, organizational or professional affiliations, the number of patients seen in a typical week, voted in the last election, proportion of patients in a medical practice who receive a particular medication, and so on. Demographic questions are included in this category because they are facts that aid profiling respondents and identifying more facts about behaviors and items from the other categories. For example, the gender of respondents is a fact that can be used to subdivide professional affiliation. Also in this behavioral habits category are questions about attitudes, interests, and opinions (AIOs).6 These questions are usually simple binary (or trinary) questions or Likert Scale questions. Binary questions require a Yes/No answer, while trinary ones require a Yes/No/Maybe or Yes/No/Don’t Know response. Likert Scale questions typically (but not always) have five points spanning a negative to positive sentiment. Common examples are Disagree-Agree and Dislike-Like. See Vyncke (2002) for some discussion of these types of questions in communications research. Awareness and familiarity are sometimes confused, but they differ. Awareness is just knowing that something exists, but the level of knowledge or experience with that item is flimsy at best. For example, someone could be aware of home medical services offered by Medicare but has never talked to a Medicare representative about care services, read any literature about them, and never spoke to a healthcare provider about what could be useful for his/her situation. Familiarity, on the other hand, is deeper knowledge or experience. The depth is not an issue in most instances because this is probably difficult to assess. Nonetheless, the knowledge level is more extensive so someone could reasonably comment about the item or issue. For the Medicare example, someone might be familiar with home healthcare provisioning for an elderly parent after talking to a Medicare representative or elder-care provider such as an attorney, medical practitioner, and assisted living coordinator. Trend analysis shows how facts are changing over time. This could take two forms: within a survey or between surveys. The first is based on a series of questions about, for example, amount purchased in each of the last few weeks or how much was spent on, say, jewelry the previous year’s holiday season and the recent holiday season. This type of tracking is useful for determining how survey respondents have changed over time and if that change has larger implications for the organization sponsoring the survey (i.e., the client). The between-surveys analysis involves asking the same fact-type questions each time a survey is conducted and then analyzing how the responses have changed over time. These are used to identify new trends useful for the organization or help spot problems that need attention. For example, hospitals track patient satisfaction for various parts of the hospital (e.g., emergency room, front desk assistance, and patient care) on, say, a monthly basis and post the monthly mean satisfaction scores for staff and patients to see.
6 AIO
is also used for Activities/Interests/Opinions. See Vyncke (2002) for this use.
1.2 Demystifying Surveys
7
Pattern identification is complex in part because it involves some of the facts from the fact-finding objective. In a business context, this could involve market segmentation using behavior and attitudinal responses, or it could be relationships among key measures. An example of relationships may be the key drivers for customer satisfaction or the relationship between the number of units purchased and average prices paid (i.e., a price elasticity), perhaps by demographic groups. Intentions are also complex because they are forward-looking, which means they may be difficult and challenging to measure. Asking someone to specify whether or not they will buy a product, say a diamond ring, this next holiday season may result in a lot of weak data: Most people do not know what they will buy until they buy it. Physicians’ prescription writing intentions for a new medication is another example. Common questions are, “How likely are you to use the new medication XX indicated for YY issue in your practice?” and “Over the next 3 months, do you anticipate that your prescribing of XX will increase, decrease, or remain the same?” In some instances, an experimental approach could be used to learn about intentions, but intentions under different conditions. In marketing (but not limited to marketing), conjoint, discrete choice, and maximum differential (MaxDiff) studies are common. These involve experimental designs and “made-up” or prototype products or objects that survey respondents are asked to study and then select one. The selected item is their choice: what they might buy (or do) under different circumstances. These studies are complex to say the least. See Paczkowski (2018) for a thorough exposition of choice studies for pricing problems and Paczkowski (2020) for their use in new product development.
1.2.2 Target Audience and Sample Size Identifying the target audience for a survey requires a lot of thought and planning. Survey data are worthless if the wrong people answer the questions. A simple but instructive example is designing a survey of prescription intentions for a new arthritis medication but then recruiting cardiologist to answer the question. “What sample size should I use?” is almost always one of the first question a survey researcher asks. Their main concern is often just getting “sample” so that they can write a report.7 This is an important question because there is a cost associated with data collection. The cost aspect is often overlooked. In medical studies, for example, specialty physicians for a new medication study may be difficult to find and equally difficult to persuade to be part of the study, so the cost of recruiting them could be exorbitant. An obvious cost is the direct outlays for data collection: recruiters, honorariums, and processing. If too much is collected, then costs will
7 This is based on my experience working with market research companies of all sizes. In fact, I am often asked this question before I am even told the study objectives. The focus is immediately on the sample.
8
1 Introduction to Modern Survey Analytics
obviously be too high. There is also an implicit cost: collecting too little data. If too little are collected, then the risk of poor estimates and, therefore, poor decisions is high. There are two costs that can be represented as C(n) for the monetary (i.e., dollar) cost of collecting the sample and a monetized value of the risk associated with an estimation error due to the wrong sample size. This latter loss can be expressed in expected value terms because unlike the direct monetary outlays for the sample that are measurable, risk can only be based on a probability of loss. Let this expected value of a (monetized) loss based on the sample size be L(n). An optimal sample size must be determined not only to provide the best estimates of the target variable but also to minimize this total cost associated with the sample. That is, choose n to minimize C(n) + L(n). See Cochrane (1963, p. 82) for a detailed discussion of this problem and its implications for sample size determination. Unfortunately, these costs are infrequently considered when calculating n. The book by Cochrane (1963) is a classic reference for sample size calculations with a special focus on the costs of data collection. Also see Bonett (2016). The question about what sample size to use should not be the first one asked; it should be the last. Many decisions about the sample design and the key statistics to estimate have to be made before the sample size is determined. The sample size depends on these decisions. The key inputs to sample size calculations are: 1. The budget planned for data collection and processing 2. The type of sample to pull (sampling design): • Simple random sample (which is hardly ever used) • Stratified random sample • Cluster random sample 3. The key statistic to estimate and analyze: • • • •
Total Mean Proportion Ratio
4. The level of confidence in the estimated results (i.e., the risk level to be tolerated) 5. The margin of error 6. The desired power of the estimators Aside from these, prior information is also needed about: • Population size • Expected population means and variances or population proportions • Strata sizes in the population along with their means or standard deviations (or proportions) if a stratified random sample is used It should be readily apparent that calculating a sample size is not trivial. And it is because of this nontriviality that many survey-based researchers ignore calculating the optimal sample size, preferring instead to just collect as many sample respon-
1.2 Demystifying Surveys
9
dents as their budget allows or by setting artificial quotas based on “experience.”8 Calculating sample size based on a budget is simple. If the budget is $B and the average cost of getting a respondent is $r, then the target sample size is just n = B/r . For example, if the budget for a physician survey is $10,000 and you know from experience that it costs $100 on average to locate and recruit a physician (which means to get them to actually participate), then only $10,000/$100 = 100 physicians can be used. Many survey designers then use this estimate to calculate and report a margin of error.
1.2.2.1
Key Parameters to Estimate
The estimate is your best guess of the true population parameter you are interested in, either a total, mean, proportion, or ratio. The sample size for a total is the same as for a mean since the total is just the mean times the sample size. Quite often in market research and public opinion polls, a proportion is the quantity of interest because it answers questions such as “What proportion of the target consumers will buy the product?” “What proportion of consumers are in the top box of satisfaction?” “What proportion of likely voters will support the referendum?” There is no rule, however, that states how these three quantities should be rank ordered; it all depends on the research objective. Ratios of totals or means are also often of interest. These are rates. Rates are measures per time or per units of another base. The price per unit sold, patients seen per day, or number of times voted per year are examples. More importantly, ratios differ fundamentally from proportions even though algebraically, they are the same: Both have a numerator and a denominator. The nature and makeup of the two components differ, which distinguishes a ratio from a proportion. The numerator of both is a random variable because the values depend on the data, so a ratio and a proportion are random variables. A proportion has a fixed denominator that is the sample size, so this is not a random variable. A ratio, however, has a denominator that depends on the sample drawn; it is not fixed as for a proportion. The ratio’s denominator is a random variable. It is this “random variable over random variable” nature of a ratio that makes them differ from a proportion because the statistical issues are far more complex with a ratio. See Cochrane (1963) for a discussion of the complexities of ratios. 1.2.2.2
Sample Design to Use
Either a simple random sample (SRS), a stratified random sample, or a cluster random sample can be used. A SRS is actually rarely used because it has been shown to be inefficient relative to other methods. Nonetheless, sometimes you do need to 8 I have experienced this myself when working with small market research companies. Large, more sophisticated market research companies, however, calculate sample sizes.
10
1 Introduction to Modern Survey Analytics
use it when knowledge of strata is not available. If a stratified random sample is used, then information about the strata is necessary. I discuss this information below. See Cochrane (1963); Hansen et al. (1953a,b); Kish (1965); Thompson (1992); Levy and Lemeshow (2008) for detailed discussions about different sample designs, methods, and issues. Complex sample surveys combine multiple ways of collecting data, but this adds a level of complication that actually requires specialized software to handle. I will discuss complex sample surveys and their analysis in Chap. 7.
1.2.2.3
Population Size
The population size is critical because it is used to calculate a correction factor called the finite population correction factor (fpc) since sampling is normally done without replacement. This means that the population from which the next sampling unit is drawn from is smaller than the previous one. This shrinkage must be accounted for. In most instances, however, the population is large enough that the adjustment has no effect. Nonetheless, the population size is still needed. See Knaub (2008) for a brief discussion about the fpc.
1.2.2.4
Alpha
All sample size formulas are based on confidence intervals. The confidence level is typically 95%, meaning that if a large number of samples is drawn and a confidence interval is calculated for each, then 95% of these intervals cover or contain the true population parameter. Then 5% of the intervals will not cover or contain the true parameter. This is the degree to which you expect to be wrong. The 5% is called “alpha” and must be specified in order to calculate a sample size. The most common values for alpha are 5% and 1%. My recommendation is to use 5% unless you have a good reason to change it.
1.2.2.5
Margin of Error
The margin of error is the precision of an estimate. This is often quoted as “± something.” The most common values are ±3%, ±5%, and ±10%. The alpha and margin of error are used together in one statement such as “We want to be within ±3% of the true population proportion 95% of the time” where 95% = 100% − 5% for α = 5%.
1.2.2.6
Additional Information
The Coefficient of Variation (CV) is sometimes used to calculate the sample size for estimating the population mean. It is an alternative measure of variation calculated
1.2 Demystifying Surveys
11
as the standard deviation divided by the mean: σ/μ where σ is the population standard deviation and μ is the population mean. The reasoning behind this measure is that the standard deviation is more important or indicative of variation when it is associated with a small mean than a large one. The mean places or locates the distribution, while the standard deviation is the spread around the location. The CV basically gives meaning and significance to the standard deviation. Suppose the standard deviation is 5 units and CV = 5. If the mean is 1 unit, then the spread is 5 units around 1 unit. But if the mean is 100 units, CV = 0.05 and the spread is 5 units around 100. This is much different. Several quantities are needed to calculate the CV: • Estimated population standard deviation – If data from prior studies are available, then they clearly should be used. This can be calculated or estimated. If a scale is used, such as a Likert Scale, for instance, if a 7-point scale is used, then an estimate for the standard deviation is 7/6 = 1.167. If a 5-point scale is used, the estimate is 5/6 = 0.8333. The divisor “6” comes from the property of the normal distribution that states that 99.7% of the data is contained within ±3 standard deviations of the mean. So six standard deviations should cover the scale, or a standard deviation is the scale divided by six. In general, the estimate is Standard Deviation = Scale P oints/6. This is obviously a crude estimate that would suffice if data on the standard deviation is lacking at the time a sample size is determined. Technically, mean and standard deviation are inappropriate for Likert Scale since they are ordinal. They are, however, applicable for interval and ratio data. Nonetheless, what I just described is common. • Estimated population mean – It can also be based on prior studies if available. This can be calculated or estimated if a scale is used. For instance, if a 7-point scale is used, then an estimate for the mean is 7/2 = 3.5. If a 5-point scale is used, the estimate is 5/2 = 2.5. The divisor is “2” to cut the interval in half. In general, the estimate is Mean = Scale P oints/2, the same as the median. • Estimated population size – This is much more difficult. Obviously, if data are available from prior studies, then this should be used; however, this is not always the case. In fact, an objective for a market research study could be to estimate the market (i.e., population) size. For example, a new product might appeal to a segment of the market, but the size of that segment is unknown and unknowable prior to product launch because the product, by definition of being new, does not have a customer base. Nonetheless, an estimate is needed in order to calculate a sample size for a survey to assess the appeal of the product. One possibility is to simply set the population size to a large number in sample size formulas. Another possibility is to look at market size for existing products (i.e., analog products) and use that as a proxy.
12
1 Introduction to Modern Survey Analytics
The sample design may call for a stratification of the population, and so stratification parameters have to be determined. This is the most complicated piece of information because stratification requires even more information. In particular, it requires information about: • The number of strata • The strata sizes in the population • The standard deviation within each strata in the population or the proportion within each strata in the population • An estimate of the overall mean The sample size has to be allocated to the strata. There are two possible methods to do this: • Equal Allocation: The sample size is equally divided across the indicated number of strata. This is probably the most common method because of its simplicity. • Proportional Allocation: The sample size is allocated based on the proportion of the population in each strata. This obviously requires information about population proportions by strata. If Proportional Allocation is used, then the sum of the strata sizes must match the population size. If the strata sizes for the population are unknown, then a good option is to assume they are equally sized. The strata standard deviations and proportions are also needed. If strata information for the population is unknown, then a good option is to assume they are equally sized.
1.2.3 Screener and Questionnaire Design There are two data collection instruments that have to be written before a survey could be fielded: the screener and questionnaire. The screener is a miniquestionnaire usually at the beginning of the main questionnaire. Its purpose is to qualify potential respondents for the main study since not all those contacted for the study are the right or appropriate people. You want only those who can provide the information you need for your objectives. Some of the qualifying questions may be used later in the analysis phase. For example, you may target upper-income, college-educated female jewelry purchasers, physicians whose primary medical practice location is in a hospital, or citizens who voted in the last presidential election. Anyone satisfying the criteria advances to the main questionnaire; others are thanked and their further participation terminated. If demographics are part of the screener, then that data can be used as part of the analysis. Designing the screener is, thus, as important and as complex as designing the questionnaire. The questions that form the main questionnaire have to be carefully crafted and written to avoid having leading, irrelevant, or biased questions. In addition, you have to make sure that nothing is omitted since, after all, it would be too costly to redo the study. The length of the questionnaire must also be considered. Finally, the screener and questionnaire should be tested (sometimes called pretesting) to assure that it
1.2 Demystifying Surveys
13
is clear and understandable to the respondents. See Bradburn et al. (2004) for an extensive discussion about questionnaire writing. Questionnaires per se are not the focus of this book. This does not mean, however, that I cannot discuss their general structure and use this as a foundation for the analysis principles I cover throughout this book. In general, questionnaires are divided into two parts: 1. Surround Questions 2. Core Questions Figure 1.2 illustrates these two parts and their connection. Surround Questions are those that provide the framework and background— the facts—needed to understand the survey respondents and to help analyze the Core Questions. Demographic questions are a good example. All surveys have a demographic section used for profiling respondents. In fact, one of the first steps in the analysis phase of the survey process is to develop respondent profiles. These often form the first section of a report, including an Executive Summary. AIOs
Fig. 1.2 This chart illustrates the general structure of a questionnaire. The Core Questions are the heart of the study. The Surround Questions provide the background on the respondents (e.g., the demographics) including their behaviors, knowledge, AIOs, and so forth. They are used to support the analysis of the Core Questions
14
1 Introduction to Modern Survey Analytics
are also in the category since they are also used to frame the analysis of the Core Questions. The Core Questions are the heart of the survey. They get to the set of objectives specified as the first step in designing a study. These are what the study is all about. I provide examples in Sect. 1.5 of Surround and Core Questions for Case Studies I use in this book.
1.2.4 Fielding the Study Fielding the survey means recruiting the survey participants, administering the survey, collecting and compiling the data, and distributing the data to the analysts. An important part of fielding is quality control. This includes checking for missing data, ensuring skip patterns were followed, and ensuring that responses were provided within acceptable ranges. With modern online survey tools, most of this is done automatically by survey software, but you should, nonetheless, run your own quality checks. Part of fielding a study is developing a list of people to contact and invite to participate. Usually, a sampling frame is developed for this purpose. A sampling frame is a subset of the population, but a subset for which you have contact information such as phone numbers or email addresses. You may not, and usually do not, have this information for the entire target population simply because it is too costly to collect this information. See Kish (1965) and Levy and Lemeshow (2008) for discussions about sampling frames.
1.2.5 Data Analysis The analysis phase is what this book is about. Analysis is within the context of the Surround and Core Questions. There are two levels of analysis. Which one you use depends on your sophistication, the complexity of the core objective(s), and the background of the potential report readers. The two levels are the Shallow Data Analysis and Deep Data Analysis I introduced above in Sect. 1.1. Shallow Data Analysis only skims the surface of what can be done with survey data. There is a lot more that could be extracted from the data, but this is not done. Simple analytical tools are used, mostly pie and bar charts and simple tables. Infographics are more the norm than the exception. Deep Data Analysis, on the other hand, relies on sophisticated modeling (e.g., linear regression, logistic regression, decision trees, and so forth) that requires advanced analytical training and software, not to forget programming knowledge to manipulate the data. See Paczkowski (2020) for a discussion of deep analytics for new product development and Paczkowski (2022) for deep data analysis for business analytics. An analysis plan is a vital part of this phase. Like all other activities you do, you will do better, you will succeed, if you have a plan. The plan does not have
1.2 Demystifying Surveys
15
to be in great detail with each activity and step perfectly outlined. This might, in the end, be counterproductive because you might tend to follow “the letter of the plan,” meaning you would not deviate from it no matter what you might find that would be new, unexpected, and interesting. You need to be flexible. A detailed plan may not give you that flexibility. Nonetheless, one is needed so that you act on the focus of the survey by analyzing the Core Questions using the appropriate Surround Questions. Otherwise, you will stray. An analysis plan should have several parts: 1. 2. 3. 4.
Question to analyze including its role in the study Analysis type or form Key functions and/or software (optional) Comments
The questions should be divided into two groups based on their role: Core or Surround Questions. I recommend analyzing the Surround Questions first since they are the background for the Core Questions. This is where the simple analyses could justifiably be done: tables, pie charts, and bar charts. The demographic questions should come first since they profile respondents. A pie (preferably bar) chart for gender and bar charts for income, education, age, and so forth if applicable to the study should be used. The same holds for other Surround Questions. The Core Questions should be analyzed using the methods I discuss in this book with the Surround Question as support, filters, classifiers, and clarifies. You should make note of the key functions and, if necessary, software you will use. There are many different functions available, and you will most likely develop your own library of functions. You should note these to streamline your work to keep you on the plan. Finally, you will most likely want to make notes (i.e., comments) of what you plan to do. These could be about extra data sources or issues to watch for that may impact your analyses. Comments are always very important and should not be overlooked. I show a useful format for an analysis plan in Table 1.1. This is only a suggestion but one I have found to be useful. This is for a fictitious transit study focused on measuring subway rider satisfaction. A target number of males and females was
Table 1.1 This is a short example of an analysis plan. You will have several Core Questions so you might want to label them. The “Core #1” reference is to the first Core Question Question S2: Gender
Role Analysis form Surround Pie or bar chart
SW1: Subway Core #1
Test T2B satisfaction
Functions/software Comments Seaborn pie Use to test differences in proportions and means for core; use to filter for drilldown. Recode function; Compare transit value counts forms for differences in satisfaction.
16
1 Introduction to Modern Survey Analytics
specified, so a question about gender was placed in the screener to ensure getting the target sample sizes. The designation “S2: Gender” in Table 1.1 refers to the second screener question that is “What is your gender?” This is a Surround Question. Subway satisfaction, a Core Question, is designated “SW1: Subway” in the subway (“SW”) section of the main questionnaire.
1.2.6 Report Writing and Presentation The reporting phase, the last part of the survey process, is an art form many researchers rush through at the last minute just to complete the project. Reports are not the focus of this book. See Rea and Parker (2005) for suggestions and advice on developing reports. Also see Tufte (1983) for a general view of presenting any type of quantitative data. I have found that reports contain at least the following sections (in order): 1. Project statement and objectives 2. Methodologies summary • Sample collection, size, and margin of error summary 3. 4. 5. 6. 7.
Executive summary Respondent profiles/demographics (i.e., Surround Question recap) Main analysis (i.e., Core Question analysis) Recommendation Overall summary and highlights
These sections are only recommendations, but they do cover, nonetheless, the basics of a report.
1.3 Sample Representativeness A common perspective of a sample is that it must be a microcosm of the target population reflecting the central characteristics and features of interest (i.e., the study variables) and the demographics of the population. In other words, the sample must be “representative.” Unfortunately, few people provide an adequate definition of representativeness. What constitutes a representative sample? Grafstrom and Schelin (2014) and Ramsey and Hewitt (2005) address this question with Ramsey and Hewitt (2005, p. 71) taking the position that a sample is representative only in the “context of the question the data are supposed to address. In the simplest terms, if the data can answer the question, it is representative.” This is not very informative because it is not clear what criteria should be used to determine that an objective question has been answered.
1.3 Sample Representativeness
17
Grafstrom and Schelin (2014) provide a more formal mathematical definition, but basically, a sample is representative if the proportion of one group in the sample is the same as the corresponding group’s proportion in the target population. For instance, the sample is representative if the proportion of young males in the target population equals the proportion of young males in the sample. They express this as Nk/N ≈ nk/n where k is the group, n is the sample size, and N is the size of the target population. Proportions are used because the sample is obviously just a subset. A proportionality is used in the definition because of sampling error, which can be attributed to “luck of the draw” in selecting a sample rather than doing a census. If sampling error is zero, then Nk/N = nk/n. Sampling error, unfortunately, is never zero as noted by Ramsey and Hewitt (2005), so a proportionality factor is used. It is also used because of systematic errors in sampling beyond sampling error, for example, oversampling a subgroup of the population. In a jewelry study, for instance, women may be oversampled because they are the main wearers of jewelry. Results based on data that do not reflect the underlying population of men and women may, and probably will, be biased. An example cited by Voss et al. (1995) regarding public opinion polls is especially relevant because of the importance of these polls in presidential elections. They note that African Americans can be underrepresented in polls, thus strongly biasing results since African Americans tend to vote more democratic. For a thorough study of the democratic orientation of African Americans, see White and Laird (2020). An issue with the proportionality definition is the use of a group. What group? Only one group? My example referred to males in a gender classification, and since you know the proportion of males in the population, you also know the proportion of females. What if the classification is ethnicity or race? There are many groups in the ethnicity and race classification. According to the US Census Bureau, there are five racial categories plus an additional category for more than two races.9 Do the sample racial proportions have to agree with all five census proportions? What happens if gender is thrown into the mix? This is a complicated expansion on the simple definition but yet one that has to be addressed. The answer to the questions— “What proportion or set of proportions?”—is addressed through weighting. A sample will thus never be representative because of these two types of errors: sampling error and nonresponse error. Appropriately developed weights, however, will bring the sample proportions into agreement with population proportions so that the sample will be representative on the basis of the categories used to develop the weights. I discuss weights and how they are calculated in Chap. 2.
9 According to the US Census Bureau, “OMB requires that race data be collected for a minimum of five groups: White, Black or African American, American Indian or Alaska Native, Asian, and Native Hawaiian or Other Pacific Islander. OMB permits the Census Bureau to also use a sixth category—Some Other Race. Respondents may report more than one race.” See https://www.census.gov/quickfacts/fact/note/US/RHI625219#:~:text=OMB %20requires%20that%20race%20data,report%20more%20than%20one%20race. Last accessed January 8, 2021.
18
1 Introduction to Modern Survey Analytics
Table 1.2 These are a few examples of quantities of interest about the population. There are, of course, an infinite number of possibilities since there are an infinite number of problems addressed in each of these areas. All these are estimated using survey data Quantity Total
Mean Proportion
Ratio
Private sector # Buyers # Units will buy X $Will spend on X # of patients treated for Y Average expenditure on food Proportion satisfied
Public sector # Receiving benefits # Unemployed # Weeks unemployed
Average amount of public assistance Proportion of veterans receiving benefits Proportion of patients Proportion unemployed with disease Proportion of male buyers Per capita spending Veterans medical expenditures to years of service
Public opinion # Voters # In favor of referendum
Average time in a voting booth Proportion of voters for Democrats Proportion of voters by race/ethnicity/gender
Number of registered voters per household
When you conduct a survey, regardless if it is in the private or public sector or if it is a public opinion survey, there are four population parameters of primary interest. A parameter is an unknown, constant numeric characteristic of the population. They are for a quantity of interest, a Core Question of the survey. The population parameters of prime interest in most surveys are the 1. 2. 3. 4.
total, mean, proportion, and ratio
of the quantity of interest as I noted above. Of course, a single study could be concerned with all three for any subset of variables, but nonetheless, these are the three main values of interest. I list a few examples in Table 1.2. These parameters will be estimated using sample data. The mean and proportion are related to the total of the quantity of interest. A mean is just the total scaled or normalized by the population size, while a proportion is just a special case of a mean. The special case is all the values are either 0 or 1; the proportion is just the mean of these values. The mean is used when the quantity of interest is a continuous measure or could be interpreted as a continuous measure, for example, price of a product in the private sector, size of unemployment compensation in the public sector, or the time spent in a voting booth in a public opinion survey. The proportion is used when the quantity of interest is categorical so that objects of interest (e.g., patients, buyers, voters) can be assigned labels (e.g., wellness or illness doctor visitor, buyer or non-buyer, voter or nonvoter) identifying discrete groups sometimes called levels or categories. These are interchangeable terms, and I will use them as such. These categories are mutually exclusive and completely exhaustive, meaning that an object can be
1.3 Sample Representativeness
19
Table 1.3 These are five people with labels (i.e., character strings) designating their gender. Two dummy variables numerically encode the gender designations
Person ID 1 2 3 4 5
Gender Male Female Female Male Female
Male dummy 1 0 0 1 0
Female dummy 0 1 1 0 1
assigned to only one category and there is no chance for the object to not be assigned to a category. Gender is a classic example. The labels for the categories are words or character strings (e.g., “male” and “female”) that cannot be used in calculations. They have to be encoded with a numeric so that calculations are possible. The 0/1 values, called dummy values or one-hot values, are usually used. The resulting numeric variable is called a dummy variable or a one-hot variable, another set of interchangeable terms. If a concept has J > 2 categories, then J > 2 dummy variables are created. Obviously, J = 2 is the minimum. A dummy variable is formally defined as Dj =
1
if object in category j
0
otherwise
(1.1)
As an example, for the gender example in Table 1.3, the two dummy variables are: DMale = DF emale =
1
if gender is male
0
otherwise
1
if gender is female
0
otherwise
(1.2)
(1.3)
You can create one dummy variable for each category of a concept. So gender would have two dummies: one for the male label and one for the female label. If a person is male, then the corresponding element in the male dummy variable is coded as 1 and the corresponding element in the female dummy variable is coded as 0. The opposite holds if the person is female. I illustrate this in Table 1.3. A proportion is calculated as the sum of the 0/1 values for a dummy variable divided by the number of observations. So a proportion is just a mean. In Table 1.3, the mean for the female dummy variable is 3/5 = 0.60. This is the proportion who are female. Obviously, the mean for males (i.e., the proportion) is 2/5 = 0.40. Notice that the proportions sum to 1.0. Also notice that the sum of the male dummy variable and the female dummy variable is 5, the sample size. A ratio is a rate such as the number of cases a physician sees to the number seen in a practice or the average household expenditure on food to the average household income.
20
1 Introduction to Modern Survey Analytics
1.3.1 Digression on Indicator Variables A dummy variable is also called an indicator variable because it indicates an object’s membership in a category. The notation in (1.1) is cumbersome, and so I find it more efficient to use indicator function notation defined as: I(x) =
1
if x is true
0
otherwise
(1.4)
As a function, it returns a 1 if the argument is true, 0 otherwise, although other values can be returned depending on the problem. For the gender example, the indicator function I(male) returns 1 if gender is male, 0 otherwise. Similarly, I(f emale) returns 1 if gender is female, 0 otherwise. This is more compact notation and I will use it below, especially in Chap. 7. The indicator function could be written using set notation: IA (x) =
1
if x ∈ A
0
if x ∈ /A
(1.5)
where A is a set of elements. From this perspective, you can define two set operations: intersection and union: IA∩B (x) = min(IA (x), IB (x))
(1.6)
= IA (x) · IB (x)
(1.7)
IA∪B (x) = max(IA (x), IB (x))
(1.8)
and
= IA (x) · IB (x) − IA (x) · IB (x)
(1.9)
These follow directly from the definitions of intersection and union in set theory. An indicator function can be easily defined in Python using a list comprehension. I illustrate one possibility in Fig. 1.3. A list comprehension is an effective tool for creating a list, which is a Python container for holding objects. A list comprehension structure is quite simple: [ expression for member in iterable (if conditional) ] I will mention this again in Chap. 2.
1.3 Sample Representativeness
21
Fig. 1.3 An indicator function
1.3.2 Calculating the Population Parameters Let U be the target population and N be its size. Let Y be the variable of interest, such as one of the items I listed in Table 1.2. The population total for Y is Y =
N
Yi
(1.10)
i=1
where i designates an object in the target population such as a person, household, firm, and so forth. The population mean is N
1 × Yi . Y¯ = N
(1.11)
i=1
The population proportion is a special case where there is a discrete variable representing a segment of the population such as gender. Let the segment have K categories or levels, and let Ik (i) be the indicator function for object i’s membership in level k of the category, k = 1, 2, . . . , K. The proportion in level k is N
Pk =
1 × Ik (i). N i=1
(1.12)
22
1 Introduction to Modern Survey Analytics
Table 1.4 This is an example of a population of N = 10 voters. Their time in the voting booth 1 ¯ to the nearest whole minute was recorded: Y = 10 × 10 i=1 Yi = 40 minutes; Y = i=1 Yi = 10 10 1 4 minutes; PMales = × 10 i=1 IMales (i) = 0.60; YF emale = i=1 Yi × IF emale (i) = 12 10 4 1 × i=1 Yi × IF emales (i) = 3 minutes. The number of female voters is minutes; Y¯F emales = 10 1 i=1 0IF emales (i) = 4 Population member (i) 1 2 3 4 5 6 7 8 9 10
Gender Male Male Female Female Male Female Male Female Male Male
Time voting (minutes; Yi ) 5 6 4 5 3 2 4 1 6 4
IMale (i) 1 1 0 0 1 0 1 0 1 1
IF emale (i) 0 0 1 1 0 1 0 1 0 0
You could, incidentally, find the total and mean for the segments. In this case, they are calculated as Yk =
N
Yi × Ik (i)
(1.13)
i=1
and N
1 Yi × Ik (i). Y¯k = N × i=1 Ik (i) i=1
(1.14)
I illustrate how these three population parameters are calculated in Table 1.4. The problem with these calculations is that they are based on unknown quantities because you cannot collect data on the population unless, of course, you conduct a census. The data, obviously, come from a survey. The formulas have to be modified for use with this data. I will discuss the modifications in the next section.
1.4 Estimating Population Parameters The three population parameters (i.e., total, mean, and proportion) for the quantity of interest are estimated using the formulas I stated above but adjusted by an extra parameter: the inclusion probability of an object being included in the sample. This
1.4 Estimating Population Parameters
23
inclusion probability is necessary because any one object, object i, may be in the target population but may or may not be in the sample; there is a probability of being included. Let n be the sample size. The number of samples all of the same size, n, that can be drawn from the target population is calculated using N N! = n n! × (N − n)!
(1.15)
where N ! = N × (N − 1) × (N − 2) × . . . × 1 is the N factorial. Similarly for n!, (1.15) gives the number of combinations of n items selected from N items overall. See Fuller (2009) for a formal development of this and other combinatorial results. For our problem, this is the count of all possible samples, each of size n. The basic frequency definition of a probability is the size of an event space relative to the size of a sample space. The former is the count of all events of interest that could happen, and the latter is the count of all possibilities. A simple example is pulling an ace—the event—from a deck of cards. The event space is the number of aces (4), and the sample space is the number of cards (52). The probability of an ace is 4/52 = 1/13. See, for example, Scheaffer (1990) and Weiss (2005). The name “sample space” is appropriate in this case because this is the count of all possible samples. The size of the single event “select at random any one sample of size n” is 1, so the size of the event space is 1. The probability of randomly selecting one sample, s, from the sample space is then 1 P r(s) = N .
(1.16)
n
As an example, let the population be the set P = {A, B, C, D, E} so N = 5. Let n = 2. Then the number of possible samples of size 2 that could be collected is N n = 10: S = {AB, AC, AD, AE, BC, BD, BE, CD, CE, DE}. Note that each of these 10 combinations is a possible sample. The probability of selecting any one sample, s ∈ S, from this sample space is P r(s ∈ S) = 1/10 = 0.10. Now consider one object, i, from the target population. It can be included in one (or more) of the samples or in none of them. For example, let i be the letter A ∈ P. It appears in four samples. The count of the number of samples that includes i is 1 N −1 (N − 1)! × = 1× 1 n−1 (n − 1)! × (N − n)!
(1.17)
The 1 is included because this represents object i that can be paired with the other N 1 n objects. For my example above, the letter A can be paired with four other letters. The number of pairs containing A is (5−1)!/2−1 = 4. So A is included in four of the ten samples. A quick check of the list of pairs above shows that this is correct.
24
1 Introduction to Modern Survey Analytics
−1 The Nn−1 represents the size of the event space for object i. Then the probability of i being included in a sample is πi = P r(s) N −1 = n−1 N
(1.18) (1.19)
n
=
n N
(1.20)
πi is the inclusion probability. This is used in the estimators for the target population parameters. See Fuller (2009) for a discussion. My derivation closely follows his derivation. Notice for my example, πA = 2/5 = 0.40 = 4/10. Consider the estimator for the target population total given by Yˆ =
N 1 × Yi × I(i) πi
(1.21)
i=1
=
N
wi × Yi × I(i)
(1.22)
i=1
where I(i) is the indicator function for i being selected from the population to be in any sample. The wi = 1/πi is a weight placed on each member of the target population for their inclusion in the sample. This is the Horvitz-Thompson estimator. See Horvitz and Thompson (1952). There are comparable statements for the mean and proportion. I will discuss the calculation of weights in Sect. 2.7 and illustrate their use several times throughout this book, especially in Chap. 7. Recall my above discussion about ratios, in particular that the denominator of a ratio is a random variable because it depends on the sample draw. This discussion based on the example set S = {AB, AC, AD, AE, BC, BD, BE, CD, CE, DE} should help to make the ratio issue clear. Suppose that the elements of S are households. Any sample of size n = 2 will contain different households than another sample of size n = 2. You could collect data from each household on its food expenditure and total household income and then calculate the “proportion” of income spent on food. But this proportion will vary based on the sample, not to mention the households within a sample. You know that there are 10 possible samples that clearly means you could have 10 possible “proportions,” each of which has an inclusion probability. The ratios would vary based on the luck of the draw.
1.5 Case Studies
25
1.5 Case Studies In this section, I will introduce the case studies I will use throughout the book. One is a consumer study, one is a public sector study, and two are public opinion studies. I am separating the two public opinion surveys from the public sector one because even though they are “public,” they are not about a government agency or other entity. I will reserve “public sector” for an agency or operation controlled by any level of government (i.e., federal, state, county, or local). I will occasionally refer to other surveys, but these four are the main ones.
1.5.1 Consumer Study: Yogurt Consumption This is a (albeit, fictitious) study of yogurt-buying consumers.10 The client organization, a major producer of yogurts, commissioned a study so they could learn about yogurt consumption patterns as well as flavor and brand preferences. The information about flavor preferences will be used in new product development. Consumers were asked questions about: • • • • • • • •
The brand of yogurt they purchase Where they buy it How much they pay How much they buy Their favorite flavors Attribute importance and satisfaction Overall brand satisfaction Their demographics (e.g., gender, age, household income) to profile them
There are six brands: the client’s brand, what the client considers to be their major competitor, and four other yogurt producers. The Surround Questions covered the consumers’ buying behavior such as their shopping and purchasing, demographics, etc. There were four Core Questions that defined the main focus of the study: Core Question I: Core Question II: Core Question III: Core Question IV:
Flavor preference Brand-flavor relationship Brand-segment relationship Price elasticity analysis
It was determined from internal studies and various syndicated industry reports examined by the client that there are 240,780,262 yogurt consumers in the United 10 Based
on fictional data for illustrative purposes only. Source: Paczkowski, W. R. Market Data Analysis. SAS Press (2016). Copyright 2016, SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC.
26
1 Introduction to Modern Survey Analytics
Fig. 1.4 This code snippet illustrates the calculation of the sample size. The population standard deviation was based on prior studies. The sample size was calculated for the mean number of units of yogurt purchased
States. The sample size was n = 2000, which was actually a number the client required from its experience. They also know that the population standard deviation is 0.7 yogurts. They specified a margin of error of ±3%. A theoretical sample size was determined for a simple random sample but then rounded down per the client’s request. I show the theoretical sample size calculation in Fig. 1.4. This is based on σ ME = Zα/2 × √ n
(1.23)
where ME is the margin of error and Z is the normal quantile. See Weiss (2005, p. 374). The sample size is then determined by n=
Zα/2 × σ ME
2 (1.24)
which you can see in Fig. 1.4. Other sample size formulas are more complex. See Cochrane (1963), Levy and Lemeshow (2008), and Thompson (1992). I also show the structure for the first Core Question, flavor preference, in Fig. 1.5. Notice that this Core Question requires a number of Surround Questions to provide data elements for addressing it.
1.5 Case Studies
27
Fig. 1.5 This chart is a simplification of the structure of the yogurt consumers’ questionnaire. Even though there are four Core Questions, I illustrate only one here: “What flavors of yogurt do consumers prefer?” The Surround Questions provide the background on the respondents (e.g., the demographics) including their behaviors, knowledge, AIOs, and so forth. They are used to support the analysis of the Core Questions
1.5.2 Public Sector Study: VA Benefits Survey This case study is based on the National Survey of Veterans, Active Duty Service Members, Demobilized National Guard and Reserve Members, Family Members, and Surviving Spouses conducted in 2010 by the Veterans Administration.11 My focus will be the veterans. The VA designed this survey to help plan future programs and services for veterans. The sample size was 8710 veterans representing the Army, Navy, Air Force, Marine Corps, Coast Guard, and Other Services (e.g., Public Health Services, Environmental Services Administration, NOAA, and US Merchant Marine). The total number of veterans in 2010 was 22,172,806. I provide a summary of the population control totals in Fig. 1.6. I will discuss the use of these totals, especially the total number of veterans, in the next chapter.
11 Source:
http://www.va.gov/VETDATA/Surveys.asp.
28
1 Introduction to Modern Survey Analytics
Fig. 1.6 VA study population control totals. Source: “National Survey of Veterans Detailed Description of Weighting Procedures.” Appendix B-1. http://www.va.gov/VETDATA/Surveys.asp. Last accessed April 10, 2016
The questionnaire is divided into 15 sections: 1. Background 2. Familiarity with Veteran Benefits 3. Disability/Vocational Rehabilitation 4. Health Status
5. Healthcare 6. Health Insurance 7. Education and Training 8. Employment 9. Life Insurance
10. 11. 12. 13. 14. 15.
Home Loans Burial Benefits Burial Plans Internet Use Income Demographics
There are 8710 respondents and 614 variables plus a sampling weight variable. Variables were in alphabetical order by “new” variable names that did not agree with the questionnaire, so I changed the names to match the questionnaire. I also assigned a missing value code to Don’t Know responses and deleted “junk” variables such as loop counters. The final variable count is 514 plus 1 for a veteran ID for a total of 515. A Core Question is the vets’ enrollment in VA-provided healthcare programs. I illustrate this Core Question and its Surround Questions in Fig. 1.7.
1.5.3 Public Opinion Study: Toronto Casino Opinion Survey In 2012, the City Council for Toronto, Canada, considered a proposal to develop a new casino in the Toronto city limits or expand an existing racetrack, the Woodbine Racetrack, for gaming. They conducted a survey, which they called a “Casino
1.5 Case Studies
29
Fig. 1.7 This chart is a simplification of the structure of the veterans’ questionnaire. The Core Question for this example is: “Why are veterans not enrolling in healthcare programs?” The Surround Questions provide the background on the respondents (e.g., the demographics) including their behaviors, knowledge, AIOs, and so forth. They are used to support the analysis of the Core Questions
Feedback Form,” both online and in person, to obtain public and key stakeholders input regarding the opportunities, issues, and challenges associated with the two proposals. This input allowed the City Council to make an informed decision whether or not to approve the casino or expand the racetrack. The questionnaire consisted of 11 questions as a combination of verbatim text responses and checkbox questions. The sample size was n = 17,780. The final data set online has n = 17,766 with 14 respondents deleted for unknown reasons.12 I show the structure for the casino survey in Fig. 1.8.
12 The
data and questionnaire can be found at https://www.toronto.ca/city-government/dataresearch-maps/open-data/open-data-catalogue/#16257dc8-9f8d-5ad2-4116-49a0832287ef and https://ckan0.cf.opendata.inter.prod-toronto.ca/ne/dataset/casino-survey-results.
30
1 Introduction to Modern Survey Analytics
Fig. 1.8 This chart is a simplification of the structure of the Toronto City Council’ questionnaire on a casino proposal. The Core Question for this example is: “Should a new casino be approved?” The Surround Questions provide the background on the respondents (e.g., the demographics) including their behaviors, knowledge, AIOs, and so forth. They are used to support the analysis of the Core Questions
1.5.4 Public Opinion Study: San Francisco Airport Customer Satisfaction Survey The San Francisco International Airport conducts an annual customer satisfaction survey to assess travelers’ overall satisfaction with the airport and its amenities. The 2017 study had a sample size of n = 2831. I show the survey structure for the satisfaction question in Fig. 1.9.
1.6 Why Use Python for Survey Data Analysis? I opted to use Python and several of its associated packages for this book. It is a great tool to distill latent information from survey data for several reasons. First, in general, Python was designed to handle and manage large data sets and to do this efficiently! Most survey data sets are small compared to what is typically handled
1.6 Why Use Python for Survey Data Analysis?
31
Fig. 1.9 This chart is a simplification of the structure of the San Francisco International Airport customer satisfaction questionnaire. The Core Question is: “How satisfied are you with the SFO Airport as a whole?” The Surround Questions provide the background on the respondents (i.e., their demographics), information travel behaviors, travel origin and destination, and so forth. They are used to support the analysis of the Core Questions
in Business Data Analytics projects, but nonetheless, using software that efficiently handles data regardless of size is a positive attribute for that software. It also has many add-on packages that extend its power beyond its core, embedded capabilities. Most modern data analytic software use a package paradigm to extend the power of the base program. R is certainly a prime example. SAS is another example because it uses procedures called Procs to provide specialized capabilities, but these are just packages. Python has 130k+ packages13 and growing. The general categories of analysis covered by these packages are shown in Table 1.5. There is a wide community of package contributors who add to these categories on a regular basis so that state-of-the-art methodologies are always being intro-
13 https://www.quora.com/How-do-I-know-how-many-packages-are-in-Python.
32 Table 1.5 This table lists eight Python package categories
1 Introduction to Modern Survey Analytics Data visualization Probabilistic programming Machine learning Deep learning frameworks
Statistical modeling Data manipulation Numerical calculations Natural language processing
duced. Three key Python packages for survey data analysis I will use are Pandas for data manipulation and management; Seaborn for scientific data visualization, and StatsModels for statistical/econometric modeling. I will assume a basic working knowledge of Python. These three packages may be more specialized. There are tutorials online to help guide you through their functionalities. McKinney (2018) is an excellent book for Pandas. In addition to the package paradigm, Python is a full-fledged programming language. It has an intuitive syntax that makes coding easy. Several language features, primarily list and dictionary comprehensions, make it easier and more efficient to code in one line that would otherwise require a multiline function. I already showed you one application of a list comprehension. You will see examples of this throughout this book. Finally, Python has a wide community of supporters aside from package contributors. The website Stackoverflow14 is invaluable for programming help.
1.7 Why Use Jupyter for Survey Data Analysis? Another tool very useful for data analysis in general is Jupyter. The Jupyter paradigm is the lab notebook used in the physical sciences for laboratory experiment documentation. A Jupyter notebook has cells where text and programming code are entered and executed or run. Code cells are for code and Markdown cells for documentation and text. You will see examples throughout this book. A Jupyter notebook for this book is available. Jupyter will handle different languages (called kernels) including Python. Other kernels include R, SAS, Fortran, Stata, and many more. I will use the Python kernel within a Jupyter notebook environment for traditional and advanced, deep survey data analysis. I recommend that you install both Python and Jupyter on your computer using Anaconda, which is a good data science toolkit and package manager for not only Python but a number of other programs. You can download Anaconda from www.anaconda.com. Select the installation appropriate for your computer. Jupyter is installed and launched from the Anaconda Navigator page. See Fig. 1.10. The installation will automatically install Python and a large number of typical Python packages. Other packages can be installed using the Anaconda Navigator’s
14 https://stackoverflow.com.
1.7 Why Use Jupyter for Survey Data Analysis?
33
Fig. 1.10 This is the main Anaconda Navigator page. You can install and launch Jupyter from here
Fig. 1.11 This is the Anaconda Environment page where you can manage Python packages. Packages are installed using the conda package manager, which is run in the background from this page. Select “Not Installed” from the drop-down menu option, and find the package you need. Conda will automatically install it
Environment page, which itself uses the package management tool named conda. See Fig. 1.11. Once you launch Jupyter, a dashboard will appear. Navigate to your notebook directory and select your notebook. See Fig. 1.12.
34
1 Introduction to Modern Survey Analytics
Fig. 1.12 This is the Jupyter Dashboard. Just navigate to your notebook file. The file extension is .ipynb
Chapter 2
First Step: Working with Survey Data
Contents 2.1
2.2 2.3
2.4
2.5
2.6
2.7
2.8
Best Practices: First Steps to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Installing and Importing Python Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Organizing Routinely Used Packages, Functions, and Formats . . . . . . . . . . . . . . . . . . 2.1.3 Defining Data Paths and File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Defining Your Functions and Formatting Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Documenting Your Data with a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing Your Data with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Identifying Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Reporting Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Reasons for Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Dealing with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handling Special Types of Survey Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 CATA Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Categorical Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating New Variables, Binning, and Rescaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Creating Summary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Other Forms of Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowing the Structure of the Data Using Simple Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Descriptive Statistics and DataFrame Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Obtaining Value Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Styling Your DataFrame Display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weight Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Complex Weight Calculation: Raking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Types of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Querying Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36 36 39 41 42 42 43 48 49 49 50 51 52 52 54 56 58 62 64 67 68 69 69 70 73 75 80
You cannot do basic survey data analysis or any type of data analysis, whether it be for surveys or not, without understanding the structure of your data. For surveys, this means at least understanding the background of your respondents: their gender, age, education, and so forth. This amounts to understanding respondents’ © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_2
35
36
2 First Step: Working with Survey Data
profiles. Examples include age distribution, gender distribution, income distribution, political party affiliation distribution, and residency distribution, to mention just a few. Profiles provide a perspective on how your respondents answer the main survey questions; different groups answer differently. But your data have to be organized to allow you to do this. In this chapter, you will gain a perspective on how to organize your data to prepare to look at the basic distributions of your respondents. You will then begin to look at your data in the next chapter.
2.1 Best Practices: First Steps to Analysis Before you do any analysis, you have to complete four tasks in Python, usually in order. You should do these four upfront, in one location in your Jupyter notebook as Best Practice. Once these four are completed, you can more readily find key components you may need without having to search your notebook, which could be challenging and needlessly time-consuming if you have a large notebook. Best Practices consist of: 1. Installing and importing the Python packages you will use throughout your analysis 2. Defining data paths and data file names 3. Defining or loading all the functions and formatting statements you typically use for your analyses 4. Creating a data dictionary to document your data The Jupyter ecosystem is an ideal way to manage, control, and execute your analysis. My Best Practice recommendation, therefore, is to place package and data importing in a code cell at the beginning of your Jupyter notebook so that relevant commands and functionality are in one place. This minimizes, if not eliminates, the time-consuming task of hunting for an import statement to check if you have what you need. Then follow these with the functions and formatting statements and the data dictionary. I will discuss these steps in the following subsections.
2.1.1 Installing and Importing Python Packages One of the strengths of Python is its package paradigm.1 There is a core set of Python packages with functionalities that allow you to write programming code; manipulate data (which include string or text data), dates, and times (basically calendrical operations); interact with the operating system, and much more. These
1 The
packages are sometimes referred to as libraries and modules. I will use packages.
2.1 Best Practices: First Steps to Analysis Table 2.1 This is a partial list of Python packages I will use in this book
37 Package Pandas Numpy Seaborn Matplotlib Statsmodels Pyreadstat
Functionality Data management Computations Data visualization Data visualization Modeling and statistics Import SPSS data files
Alias pd np sns plt sm ps
are basic or native functionalities. Packages beyond the core set extend Python’s capabilities. These are user written and maintained by a wide community of users and developers. The users can be individuals or organizations. I list several key packages in Table 2.1 that I will use, illustrate, or mention in this book. Many of these packages are tightly interconnected, relying on or using components of or output from others. For example, Pandas, listed in Table 2.1, allows you to produce, manipulate, and manage tables of data. A table is called a DataFrame, which I will discuss later. Seaborn and statsmodels, also listed in Table 2.1, read these DataFrames. Seaborn also uses Matplotlib as a base. Before you can use any package, it must be installed on your local computer or network. Installing a package on your local computer and importing it into Python are two different operations. Right now, I am concerned with installing a package. I will discuss importing a package shortly. Core packages are installed when you first install Python so there is nothing more for you to do. As an example, there is a core package called str that handles string manipulations. It is part of the Python environment and is immediately available to you. You do not have to install or import them before you use them. Noncore packages, however, must be installed on your local computer and then loaded or imported into your computer session. You install a package once on your local computer (although you should always install updates), but you import it each time you run a Python session. There are two ways to install a package. You could use the Package Installer for Python (pip) interface or the conda interface. Both are package managers that are used from the command prompt window by typing an appropriate statement on the command line.2 As an example, a package I will use is called sidetable that produces nicely formatted tables. It can be installed using pip install sidetable or conda install -c conda-forge sidetable at the command line. A complete listing of available packages can be found at https://pypi.org/. Once a package is installed, you can import it into your analysis session. Importing is easy. The simplest way is to type import packageName as alias in a Jupyter code cell and then run or execute the cell. The package should be available for your use. I will mention the aliases below. As an example, you type import pandas as pd to import Pandas.
2 Conda
can be run from the Anaconda Navigator. See https://www.anaconda.com/.
38
2 First Step: Working with Survey Data
Sometimes, and more often than not, a package contains sub-modules with their own stand-alone functionalities. There might be a number of sub-modules that comprise a very large package. When you import a package, you import all the sub-modules, but you may need only one. Importing all of them occupies computer memory, which is costly because memory is still finite in size. To import just a submodule, type from packageName import subModuleName as alias in a code cell. For example, there is a package I will use called itertools that provides iteration functionality. I will use this in Chap. 4. A sub-module allows you to calculate the number of combinations. You import just this sub-module as from itertools import combinations as combos. Each package contains a set of functions usually with the set of parameters that define what the function can do and how. You access or call a function and pass it as arguments or instructions for the parameters to perform the operations you need. For example, to create a crosstab, you use the crosstab function in Pandas with parameters for variables to use in the calculations. You can call this function as pd.crosstab( X, Y) where X and Y are the arguments for the parameters defining the rows and columns, respectively, of the crosstab. Hunt (2020, p. 124) makes a distinction between parameters and arguments that I will try to adhere to in this book. He notes that parameters are defined as part of the function definition and are used “to make data available within the function itself.” The parameters are defined in the function header either by position or by name in which case they have a default value. Arguments, on the other hand, are “the actual value or data passed into the function when it is called.” This data will be assigned to the function’s parameters and actually used inside the function. If a parameter has a default, then an argument is not needed for it. The difference between a parameter and an argument is a fine distinction, but one necessary to make, nonetheless. The Python interpreter has to know where a function resides, so in addition to using or calling the function by name, you also have to specify the package by name.3 This is best done using an alias for the package name and chaining that alias to the function name using dot notation. I show some standard aliases in Table 2.1. For example, you call the crosstab function in Pandas with pd.crosstab( arguments ) as I did above. I will describe below the relevant arguments for this function. There are two types of “functions.” One is a function per se. It takes a set of arguments and returns something such as a report or set of numbers, perhaps as an array or table. Functions are stand-alone, meaning they are not attached to any objects. They are either user defined by you or someone else and collected in packages or predefined in the overall Python package so they are native to Python. The second type of function is a method, which is like a function—it takes arguments and returns something—but it is attached to an object when that object is created. It only exists for that object and as long as that object exists. An object is a DataFrame, table, array, graph, and so forth. When the object is deleted or a Python
3 There is a way around this, but I do not recommend that you not specify the package name. The reason is simple: There may be two or more functions with the same name.
2.1 Best Practices: First Steps to Analysis
39
Enhanced Functionality
Functions
User Defined
Example: footer( x )
Packaged
Example: sns.distplot( df.X )
Chain with Alias
Methods
Predefined (Native)
Automatically Connected to Object
Example: print( ’Hello’ )
Example: df.head( ) Chain to Object df
Fig. 2.1 This illustrates the connection between functions and methods for enhanced functionality in Python
session ends, that method ends because the object ends, unless it is saved. Also, the method cannot be used with other objects. I will try to use the word “method” when referring to entities specific to an object and the word “function” for standalone entities. A function call is usually like pd.functionName( ), where pd is the Pandas alias, whereas a method call is usually like df.methodName( ), where df is the name for a Pandas object. As two examples, pd.crosstab( x, y ) is a function call and df.head( ) is a method call where df is the name of a DataFrame object. I will not discuss how to create methods in this book, but I will show you how to create functions. I show the connection between functions and methods in Fig. 2.1. Packages have to be imported or loaded before they can be used. I consider the importing of packages upfront, all in one place, as Best Practice. Even if you use a package later in your analysis, I recommend that you place the import statements upfront with all the others so that they are all together. I also recommend as Best Practice that the packages be organized by functionality to make it easier to identify which one does what function in your notebook. I illustrate package importing and organization in Fig. 2.2.
2.1.2 Organizing Routinely Used Packages, Functions, and Formats You could place the loading statements in a script file if you use the same package loading each time you analyze a new survey. A script file is a plain text file. Text files usually have the extension .txt, but in this case, it is .py. For example,
40
2 First Step: Working with Survey Data
Fig. 2.2 A recommended way to organize package import statements. This should be upfront in your Jupyter notebook with all packages loaded in one location as shown here
Fig. 2.3 This code snippet illustrates how to use the %run magic to import the packages shown in Fig. 2.2, which are in a script file named packageImport.py in the folder named scriptLib
the package import statements in Fig. 2.2 could be put into a script file named packageLoad.py. You would then load and run the script file in a Jupyter code cell using %run “c:/scriptlib/packageImport.py”.4 The %run command is called a magic in Jupyter terminology. Another magic is %load, which loads, but does not run, the script file. This is convenient to use if you want to modify part of your script file while leaving the master file untouched. Use this magic command as %load “c:/scriptlib/packageImport.py”. There is a list of available magics. See https://ipython.readthedocs.io/en/stable/interactive/magics.html for information about magic commands. I illustrate its use in Fig. 2.3.
4 Notice the percent sign before the command run and the use of double quotes. From the IPython documentation: “On Windows systems, the use of single quotes ‘when specifying a file is not supported. Use double quotes ”.
2.1 Best Practices: First Steps to Analysis
41
2.1.3 Defining Data Paths and File Names Where you store your survey data on your local computer is a personal decision, your personal preference. Regardless of where they are stored, there will be one or more data files identified by a directory path to a data directory or data folder and a file with a file name. A directory (folder), in general, is just a logical place to store files, usually files that have a commonality. A data directory is where you store your data. The directory, of course, has a name. The directory path is the location of a directory from a central or home or root directory. On a Windows computer, the root directory is designated as “c:/”; on a Mac and Linux computer, it is “/.” The path is a succession of subdirectories from the root. For example, if “Data” is the data directory on a Windows computer, then the path is “c:/data”/. If this data directory is a subdirectory of a project directory, then the path is “c:/project/data.” The path just shows you how to maneuver through your hierarchical file structure. If the data are stored in a data subdirectory, called Data, in a project directory, called Project, and your Jupyter notebook for analysis is stored in another subdirectory, called Notebook, under the same project directory, then the path from the notebook to the data file is written as “../Data/” where the two leading dots say to move up one level in the data hierarchy and then move to the data directory. The statement “../../Data/” says to move up two levels and then down to the Data directory. I show a directory structure in Fig. 2.4 and how to use it in Fig. 2.5.
Fig. 2.4 This illustrates a (fictitious) directory structure. The “Project” directory contains all the information and folders for a project such as data, notebook, questionnaire, and report. It is assumed that the data file is a CSV file
42
2 First Step: Working with Survey Data
Fig. 2.5 This is an example of how to specify a path and import a data file into a Pandas DataFrame. The DataFrame is named df, which is a conventional name
Notice that I used forward slashes in the path definitions. A backslash (“\”) may confuse the Python interpreter because a backslash indicates an escape sequence or character. For example, “\n” is an escape sequence telling the interpreter to cause a line break, whereas “\” says to continue on the next line. You will see examples of both in this book.
2.1.4 Defining Your Functions and Formatting Statements As I mentioned above, I consider it Best Practice to place all you own user-written functions and formatting statements upfront in a notebook so that you know exactly where they are defined. This minimizes searching you have to do to find a function or format.
2.1.5 Documenting Your Data with a Dictionary You should always document your data in a data dictionary. This should not be confused with Python’s dictionary data object, which is a way to organize data. The content of a data dictionary is called metadata, which are data about the data. A Best
2.2 Importing Your Data with Pandas
43
Table 2.2 This is a simple data dictionary structure for the VA questionnaire Variable Military service branch
Measure Nominal
Year born Age
Integer Integer
Values Army, Navy, Air Force, Marine Corps, Coast Guard, Other Year as YYYY Calculated as 2010 - YOB
Mnemonic Branch YOB age
Practice is to always organize the metadata into a dictionary so you can quickly look up a variable and understand its meaning and measurement. A data dictionary can be complex or simple. If you were working with data from a data warehouse, a data dictionary is an absolute necessity because of the volume of variables and factors in the data warehouse. The data dictionary is large and complex. If the database is small, the data dictionary is small and simple. For a survey, especially a large and complex one, a data dictionary serves the same purpose as one for a data warehouse, but the structure and content would not be as extensive. A simple, perhaps minimal, structure consists of four columns: 1. 2. 3. 4.
Variable (column) name Measurement units Value labels Variable label
The variable label is a mnemonic that can be used in statistical modeling and data visualization. I provide a possible layout using a VA question in Table 2.2.
2.2 Importing Your Data with Pandas I will use Pandas as the main Python data management tool in this book. It has a wide array of capabilities, not all of which are applicable or useful for survey data analysis, but it is nonetheless comprehensive enough that it will meet all your data management needs and requirements. I will introduce and illustrate its capabilities and functionalities throughout succeeding chapters. For now, I want to emphasize importing data. For an overview of Pandas, see McKinney (2018). Pandas Read Functions You import data using a Pandas read_X function where “X” is a placeholder for the format of the data file you want to import. Pandas handles many data formats, but three are applicable for most survey data files, especially those produced by online survey tools. These formats are: • Comma Separated Value (extension: csv) • Excel (extension: xlsx or xls) • SPSS (extension: sav)
44
2 First Step: Working with Survey Data
SPSS is a statistics software product primarily targeted to the market research industry where it has a strong presence. Consequently, many online survey vendors provide data formatted in its format. This format is simply referred to as the SPSS format. I will discuss each of these in the following subsections. CSV Formatted Data The Comma Separated Value (CSV) format is the most popular in data analytics in general because of its simplicity. Every statistical software package will read and write CSV formatted data.5 As the name suggests, the data values are separated by commas, which makes reading and writing data very easy. There is one record per observation and each observation is in one record. This means a record could be very long. Missing values may be represented by a missing value code, which I will discuss below, or by two successive commas. Text or character strings are possible values, but these are handled slightly differently. They are separated by commas from neighbor values in a data record, but they may also be enclosed in quotation marks to make sure they are interpreted as text. Double quotation marks are usually used to begin and end the text string although problems occur when a quote appears within the string. The use of quotation marks is not universal. The implication is that CSV files are not infallible; there might be errors depending on how text are handled. The first record of the file is usually, but not always, the header record that contains the column names. Pandas will, by default, treat the contents of the first record as columns names and the second record as the start of the data. You can tell it that there is no header and that the first record is the start of the data. In most instances, the first record does contain the column names so the default is sufficient. However, those names may not be what you would like to use, so you do have the option to change them either when you initially import the data or afterward. You can also provide names when you import the data if a header is missing. Finally, you have options regarding which records to read, which columns to read, how dates should be handled, and many more. See the Pandas online documentation for details. In this book, I will assume you will read all the records and columns. When you import your data, you must place them somewhere. Pandas (actually, the Python interpreter) assigns them to a location in memory. In fact, all “objects” you create in Python, Pandas, and the other packages you will use are objects stored in a memory location, whether the objects are data values or strings. A name must be assigned to that location, not to the object, so you could access the object; otherwise, the object is worthless to you because you will not know where it is in computer memory. In Python terminology, the name you assign is bound to that memory
5 I have not done an exhaustive check of all available statistical software packages, but based on my experience with a large number of them, especially all the major ones, I believe this statement is correct.
2.2 Importing Your Data with Pandas
45
Fig. 2.6 This illustrates how to read or import CSV data into a Pandas DataFrame. The DataFrame is named df, a standard name in Pandas programming, in this example. The statement df.head( ) is explained in the text
location. The notion of binding a name to a memory location is an important topic. See Sedgewick et al. (2016, pp. 14–21) for a good discussion. I often give the path definition a name such as “path” and the file to import a name such as “file.” I put the path in a separate code cell upfront in the notebook, perhaps immediately after package loading, function definition, and formatting cells. I also consider this to be Best Practice because it makes the code cleaner and easier to read. I only have to define the path once, and then, I can reuse it later as needed in the notebook. I also put the file definition in a cell with the import function. I then concatenate the two names as path + file in the pd.read_csv function. I view this as more readable. I illustrate in Fig. 2.6 how a simple CSV file is imported using this concatenation. The imported DataFrame is simply named, in most instances, df, short for DataFrame. This alias is conventional in Python/Pandas programming, but you can, of course, use any name. A name must begin with a letter or the underscore character, and it cannot begin with a number. You should avoid beginning a name with an underscore even though its use is legal. Names are case-sensitive. For example, “df,” “DF,” “dF,” and “Df” designate four different variables. Hunt (2020) recommends, as Best Practice, that you put all names in lowercase, be as descriptive as possible without being too long and cumbersome (e.g., “x” is not always a good name) and separate words in a name by underscores to aid readability (e.g., “df_agg”). A DataFrame is a rectangular array of data with each row being a single observation and each column being a single variable or feature. This is sometimes referred to as a tidy data set. The general rules are: • • • •
A single observation in a row An observation is not split across rows A single variable in a column A variable is not split across columns.
46
2 First Step: Working with Survey Data
Fig. 2.7 This illustrates how to read or import an Excel worksheet into a Pandas DataFrame. The DataFrame is named df as before. The data path was already defined as shown in Fig. 2.6. The worksheet to import is named “Data,” but it happens to be the first and only sheet in the notebook. Consequently, it is imported by default
For survey data, a single row is the set of responses for a single respondent. This row is sometimes referred to as a respondent’s response pattern. I typically make the first column a unique respondent identifier named, for example, RID for “respondent ID.” Most online survey vendors provide an ID automatically. Notice the last line in Fig. 2.6: df.head( ). This has three parts. The first, df, is the name of the DataFrame. The second part is a dot, which is the chaining function that chains or links the name of the object (df in this case) to a method associated with that object. The method is the third part: head( ). This method merely prints the first five records (the “head”) of the DataFrame df. The default is to print five records. If a number is put inside the parentheses, then that number of records will be printed. For example, you could use df.head( 10 ) or df.head( n = 10 ) to print 10 records from the DataFrame df . If there is a head, there must be a tail. The command df.tail( ) displays, by default, the last five records of the DataFrame df. Excel Formatted Data Excel files, or workbooks, are imported the same way as CSV files, except you have to specify the worksheet in the workbook. The default is the first worksheet. You specify the worksheet using a character string. Assume that your workbook has a worksheet named “Data.” Then you specify it using this character string. I illustrate how a simple Excel file is read in Fig. 2.7. SPSS Formatted Data Many (if not most) consumer surveys are now conducted online. These systems typically output collected data in the third format: SPSS. An SPSS formatted data file is a rectangular array of data much like an Excel or CSV formatted file. The file, however, usually has metadata attached to it, which is a data dictionary. In this context, it is the questionnaire question applicable to a variable, the levels for that variable, an indicator if the variable is nominal/ordinal/continuous, and so forth. A complete list is: 1. 2. 3. 4.
Variable name Variable type (numeric) Column width Decimal places
2.2 Importing Your Data with Pandas
5. 6. 7. 8. 9. 10. 11.
47
Label (i.e., the question) Value labels (a mapping of codes to labels) Number of missing values Columns Alignment (left, center, right) Measure (nominal, ordinal, scale) Role (input, output, and a few others)
As an example, if the question is “What is your gender?,” then the data column in the SPSS file is “gender” and the metadata includes the question itself, the levels (e.g., “1 = male, 2 = female”), an indicator that this is a nominal variable, and several others. Pandas has a function to import an SPSS formatted data file. I show an example in Fig. 2.8. This Pandas function has a drawback: It does not import the metadata, assuming metadata are available. You can overcome this inability by creating your own metadata functionality using the Python dictionary data structure.6 An alternative is to use the package pyreadstat to read the SPSS formatted data file with the associated metadata. I provide an example of its use shown in Fig. 2.9. You can install pyreadstat on your local computer using pip install pyreadstat or conda install -c conda-forge pyreadstat. You import the package using import pyreadstat as ps.
Fig. 2.8 This illustrates how to read or import an SPSS .sav into a Pandas DataFrame. The DataFrame is named df as before. The data path is redefined since a different data set is imported
Fig. 2.9 This illustrates how to read or import an SPSS file into a Pandas DataFrame using pyreadstat, which has an alias ps. Two objects are returned: data and meta, in that order
6 The word “dictionary” will be used several times in the next discussions, so it will be an overworked word. The correct usage has to be inferred from context.
48 Table 2.3 This is just a partial listing of attributes returned by pyreadstat
2 First Step: Working with Survey Data Object Variable (column) name Variable question Value labels Variable and value labels
Meta attribute column_names column_labels value_labels variable_values_labels
Fig. 2.10 This is the question used in the pyreadstat examples, Figs. 2.11 and 2.12
Fig. 2.11 The survey question for the column label LIFE is retrieved and displayed
A call to pyreadstat returns two objects: the data and the metadata, in that order. The data are obvious. As a convenience, they are automatically formatted as a Pandas DataFrame. The metadata has all the items listed above, but the three important ones are the variable name, label, and value labels. As a convenience, the column names and their associated value labels are bundled in one attribute called variable_value_labels. Table 2.3 shows a listing of the important attributes. I show examples of their use in Figs. 2.11 and 2.12 and the question for these two examples in Fig. 2.10.
2.3 Handling Missing Values Missing values are a headache for survey analysts and, in fact, for every data analyst whether for surveys or not. No data set is 100% complete and perfect. There are always holes in the data, that is, missing value, that either have to be filled with legitimate values, if possible and appropriate, or have the entire record with the missing value deleted. If you have too many missing values, then your study will be
2.3 Handling Missing Values
49
Fig. 2.12 The survey question options for the column label LIFE is retrieved and displayed
jeopardized since you will have insufficient data to make adequate and acceptable conclusions and recommendations. So it is very important to 1. Identify missing values 2. Understand the reason(s) for missing values 3. Handle the missing values
2.3.1 Identifying Missing Values I noted above that Pandas’ data reading functions allow you to identify and appropriately code missing values so you can identify them in displays. Pandas has a number of predefined codes, but other special codes are always possible. For example, it is not uncommon for a csv file for a survey to contain numeric symbols such a 99, 999, 9999, 99999, and so forth to represent missing values. The SFO questionnaire, for example, uses 0 for “Blank/Multiple responses.” This coding, incidentally, applies to question options such as “Don’t Know” and “Refuse,” which can be interpreted as missing values since they are not usable responses to a question. The Pandas pd.read_csv function has an argument, na_values, that allows you to identify nonconventional codes. It takes a scalar, string, list-like array, or a dictionary definition of missing value codes in your csv data file so that the function can label them as NaN or missing. A data dictionary will describe the codes for the survey.
2.3.2 Reporting Missing Values Identifying specially coded missing values when you import your data is one task, what to do with them once imported is another, and separate, task. It is actually
50
2 First Step: Working with Survey Data
several tasks. One is to count the number of missing values for each variable. You can use the Pandas info( ) method, the isna( ) method, or some special reports. I provide examples throughout this book. Regardless, you just count the number of missing values and report the count of missing as well as, perhaps, the percent missing. The isna( ) method returns True or False for each observation of a variable (True is “Missing” and False is “Not Missing”), and you could easily count the number True and False. This is all merely assessing or reporting the degree of missingness. Once you know the degree of missingness, then what? What you do depends on why you have missing values.
2.3.3 Reasons for Missing Values The action you take to handle missing values depends on the reason for the missingness. I usually discuss two broad reasons for survey data: 1. Structural reasons 2. Respondent reasons Most surveys are now conducted online with sophisticated software that controls the physical presentation of questions (i.e., their look and feel) and the order of their presentation. By order, I mean the question sequence following a logical pattern. Some questions in a questionnaire are dependent on responses to previous questions using logical if-statements. Those later questions, therefore, may not be shown to a particular respondent. For example, in a survey of consumers’ soup preference, an early question, perhaps a Surround Question, could ask if the respondent is vegan or vegetarian. If they respond “Yes,” then later questions should not and, if programmed correctly, will not ask them if they prefer a soup with sausage, chicken, or beef. The response to the meat question will never appear, and so the data file will have missing value codes for the vegans and vegetarians. There is nothing to do with these missing values except, perhaps, report them. I refer to these missing values as structural missings. What about nonstructural missing values? These will occur for one of two reasons. If the survey was done online, then either the questionnaire was poorly designed or the programming was incorrect. Discovering missing values after the survey is complete is too late; there is nothing you could do about it. If the survey was done by the phone or paper questionnaire, then the respondent just did not answer the question(s). If you use a phone survey, you may be able to prompt or encourage a respondent to provide an answer. In the worst-case scenario, you will have missing values. If the survey was a paper survey, then there is also not much you could do. Sometimes, there is a follow-up with the respondents provided their names are known, but this is probably unlikely since almost all surveys, regardless of administration method, are anonymous. The bottom line is that non-online surveys will most likely have missing values.
2.3 Handling Missing Values
51
There is an extensive literature on the types of missing data. See Enders (2010) for a discussion of these reasons. Also see Paczkowski (2022). I will just assume you have nonstructural missing values. The next question is what to do about them.
2.3.4 Dealing with Missing Values You have several options for handling nonstructural missing data. The most obvious is that you could drop the records with any missing values in any variable. This is drastic if you have a limited data set (i.e., small sample size), which will likely be the case with survey studies. You can use the Pandas dropna method for this. Less drastic options are available.
2.3.4.1
Use the fillna( ) Method
A less drastic option is to replace missing value codes with an appropriate value. The question, of course, is the value to use. For one approach, you can use the fillna method. This has several arguments. One is just a scalar value to replace the missing indicator. For example, you could use fillna( 0 ) to replace missing values with a zero. You could also calculate a value, say, a mean for all non-missing values of a variable and use that mean. For this, you could use fillna( df.mean( )). Any other statistic such as the median or mode will work. There is a caveat however. If you use df.fillna( 0 ), then missing values in all variables will be replaced by a zero. If you want to replace missing values for a particular variable, use df.X.fillna( 0 ) where X is a variable name. You can repeat the last value before a missing value using the method = ‘pad’ argument for fillna. For example, df.X.fillna( method = ‘pad’ ) will repeat the last non-missing value before each missing value. If there are several missing values in sequence, then you can limit the number of repeats using another argument: limit = . For example, df.X.fillna( method = ‘pad’, limit = 1 ) will repeat the last value only once; df.X.fillna( method = ‘pad’, limit = 2 ) will repeat it twice. The “pad” argument fills forward in a DataFrame. The “backfill” argument fills backward. You can also use shortcuts such as df.X.ffill( ) to fill forward and df.X.bfill( ) to fill backward. I do not recommend using the pad argument (or the ffill or bfill functions) because each record is an independent observation with no relationship with the previous record. See my comment below for the interpolation method.
2.3.4.2
Use the Interpolation( ) Method
A more sophisticated method to fill missing values is to interpolate. Pandas use a linear interpolation method. You interpolate between two values using a formula such as
52
2 First Step: Working with Survey Data
Y (X) = YB + (X − XB ) ×
(YA − YB ) XA − XB )
(2.1)
where X is the index of position for the missing value, XB is the index before the missing value, XA is the index after the missing value, YB is the value before the missing value, and YA is the value after the missing value. I do not recommend this procedure because it depends on the relative position of the missing value. With a survey data set, the data are cross-sectional, which means that you could randomly shuffle the data rows of the data set; the order of the rows is meaningless. Consequently, there is no meaning to “before” and “after.” This method is more appropriate for time series data. Nonetheless, interpolation is available using df.interpolate( ).
2.3.4.3
An Even More Sophisticated Method
You can also estimate a regression model using the methods I will discuss later in this book. This involves estimating a model such as Yi = β0 + β1 × Xi + i where Y is the variable with the missing value and X is an explanatory variable. The explanatory variable could be one of the Surround variables such as a demographic variable. This variable should be completely populated without any missing values, especially for the record that has a missing value for Y . Once the model is estimated, you could predict the missing value(s) by using the X variable value(s) as the predictor(s). See Chap. 5 where I discuss regression analysis and the use of the model for prediction in Sect. 5.2.5.
2.4 Handling Special Types of Survey Data Questionnaires usually have two special questions that capture opinions and interests. One is Check all that Apply (CATA) and the other is categorical. The latter can, and most definitely does, appear in other data sources, but they are more prevalent in surveys. I will discuss why and give some examples below. The CATA variables are definitely unique to surveys simply because they have a battery of options that all deal with one general question. I will discuss both types in the following subsections.
2.4.1 CATA Questions A common question type is Choose all that Apply (CATA). For example, the VA study asked: “What are the reasons you haven’t applied for any VA disability benefits?.” This question, and others like it, have a main or stem target question
2.4 Handling Special Types of Survey Data
53
(e.g., disability benefits) followed by a series of options the respondent can select or not regarding its fit with the target. For the VA example, the follow-up options are don’t have a service-connected disability, not aware of VA service-connected disability program, don’t think I’m entitled or eligible, getting military disability pay, getting disability income from another source, don’t think disability is severe enough, don’t know how to apply, don’t want any assistance, don’t need assistance, applying is too much trouble or red tape, never thought about it, and other. See Safir (2008) for a brief overview of this question format. The multiple responses for CATA questions are stored using several formats that determine how you can read and process the data. These formats are:7 Multiple responses: Items selected are recorded in separate columns. Multiple responses by ID: Item selected are recorded in separate rows. Multiple responses delimited: Item selected are recorded in one cell with a separating delimiter. Indicator variable: Selections are recorded in separate columns indicated by 0/1. Frequencies: Record counts are recorded. I discuss these formats in the following subsections.
2.4.1.1
Multiple Responses
A common variant of a CATA question is, “Select the top n items from the following list,” where n is an integer up to and including the number of items in the list of options. For example, a specific question might be, “Please rank your top 3 TV programs with 1 being the top ranked program,” where a list of 20 TV programs is provided. Each selected program is recorded in one column in a data table with only three columns used even though there are 20 TV programs. The first column holds the top ranked program. All the responses for a respondent are in one row of the table.
2.4.1.2
Multiple Responses by ID
This is like the multiple responses except each selected item is recorded in a separate row. For the TV program example, there is one column for the selected TV programs but three rows for each respondent. The first row contains the top ranked selection, and the last row has the third ranked selection. The ID is the respondent ID with maybe a separate column with an indicator for the ranking of the selection.
7 These
descriptions follow Paczkowski (2016). Used with permission of SAS.
54
2.4.1.3
2 First Step: Working with Survey Data
Multiple Responses Delimited
This form was used to save storage space in computer files when hard drive space was small. This is not an issue with modern computer hardware. Nonetheless, it is still in occasional (i.e., rare) use. For this format, all the selected responses are stored in a single cell for each respondent. The selections are separated in that cell by a comma, semicolon, tab, a vertical line (called a “pipe”), or some other marker.
2.4.1.4
Indicator Variable
This is probably the most common format. There is a separate column for each option in the option list. Each selected option is coded with a 1 while each nonselected option is coded with a 0. All responses for a respondent are in one row. It is simple to import—just use the appropriate import method. I mention the use of this format in Sect. 2.5.1 and discuss the analysis of CATA data in Chap. 4.
2.4.1.5
Frequencies
Sometimes, just a frequency count for each selected option is recorded in a single column. If there are five options, there are five frequency columns each with the frequency count for that option. This is basically a crosstab arrangement of the data. I discuss crosstabs at length in Sect. 3.3.
2.4.2 Categorical Questions Relational databases, such as those for Big Data, are constructed in a way that efficiently organizes the variables in the databases to avoid needless repetition of the same data elements. For example, in a database that contains orders data for several products, each order would have a variable that indicates the variable purchased. This is the name and perhaps a brief description to specifically identify the product. The constant repetition of the product name and its description consumes storage and memory space. This is made more efficient by putting the product name and description in a separate product data table with a unique key identifying each product and then repeating just the key in the orders data table. There are now two tables, and the key links them so merging or joining can be done. Efficiency is gained by the onetime entry of the product key in the orders table even though the key is repeated; less storage space is consumed this way. In addition, there is a side benefit: If the product description is changed, the change only has to be made once in the product table, not in the orders table. This structuring of the data tables is called normalization.
2.4 Handling Special Types of Survey Data
55
Fig. 2.13 The original values for the Likert variable are shown as they appear in the DataFrame. After the variable is categorized, integer codes are used rather than the original strings although visually, you will still see the strings displayed. You can access the codes as shown here. You can also access categories and check the ordering using another accessor
Pandas’ categorical mimics this feature to effectively normalize a categorical variable. The categories (i.e., levels), which are usually strings, are stored once and not repeated throughout the DataFrame. The key is an integer that appears in the DataFrame and also (behind the scenes) in an array along with the category descriptions.8 This array is not visible to you although you can access it using the Pandas accessor method cat. The efficiency gained is just the onetime recording of the string rather than have them repeat. If the DataFrame is large, then there is considerable saving in terms of memory usage, but more importantly, there is considerable efficiencies for you since you just have to specify the categories as well as their order once. I illustrate the coding in Fig. 2.13 and the use of the accessor. Surveys are unique in that the measures are mostly, but certainly not entirely, categorical rather than quantitative. A quantitative measure is a count or floatingpoint number such as a count of patients, number of times someone voted for president, average price paid for a product, and so forth. Income and age could be quantitative measures (e.g., a specific dollar amount of income and specific age in years), but they are sensitive topics for most people, especially their income. Consequently, they are more likely to refuse to provide any data. However, they are less apt to refuse if a question asks them in which income range or age bracket they belong. Ranges are categories so income range and age range are categorical. They are also ordinal. Other categorical variables, however, are not ordinal, for example, marital status and current employment status.
8 The integer is an 8-bit integer, or 1 byte. This means that this is the simplest representation of the categories, which has definite memory-saving implications.
56
2 First Step: Working with Survey Data
Table 2.4 This is just a small sampling of survey categorical variables and their ordered characteristic. The Pandas category data type helps to manage these to improve data efficiencies. The starred items could be quantitative but are usually categorized to increase the likelihood of respondents providing some data Unordered (nominal) Yes/no Segments Marketing regions Gender Occupations Race/ethnicity
Ordered (ordinal) Income* Age* Management level Education Amount spent* Likert
Likert Scale and Yes/No questions are also very common. A characteristic of these categorical variables is that some are ordered and others unordered. The ordered are ordinal variables, and the unordered are nominal variables. I list a few possibilities in Table 2.4. The healthcare questions from the VA survey are examples. The vets were asked to indicate their agreement with seven statements regarding their VA health benefit use. A 5-point Likert Scale was used with a “Don’t Know” included as a sixth item. The Pandas value_counts method can be used to calculate the distribution of responses for the six points for each statement. However, the results will be ordered by the frequencies if the sort argument is True (the default) or in an unintuitive order otherwise.9 By defining this variable as a categorical variable and specifying the order of the categories, then the value_counts method will return the intuitive ordering. I illustrate this in Fig. 2.14. Figure 2.15 illustrates the results when the variable is declared to be categorical. I provide the code to change the data type for this variable and all seven statements in Fig. 2.16. Once the categories are created as a new data type using CategoricalDtype, it can be applied to all similar Likert Scale variables throughout your DataFrame. If you have to make a change in the wording of a scale item, you do this just once and then reapply to all the variables; you do not have to make a change to each variable separately. In addition, this categorical feature carries over to data visualization as well, which I will illustrate later.
2.5 Creating New Variables, Binning, and Rescaling You may have to manipulate your survey data before you do any analysis, whether simple and shallow or complex and deep. This may involve creating new summary
9 The order when sort = False is used is based on a hash table and so is in an arbitrary order. See StackOverFlow article at https://stackoverflow.com/questions/33661295/pandas-value-countssortfalse-with-large-series-doesnt-work.
2.5 Creating New Variables, Binning, and Rescaling
Fig. 2.14 Notice that the sort order in both of these examples is not helpful or is unintuitive
Fig. 2.15 Notice that the sort order is now helpful and intuitive
57
58
2 First Step: Working with Survey Data
Fig. 2.16 A list of the Likert Scale items is created and then used as an argument to CategoricalDtype. This is applied to all Likert Scale variables using a single for loop
variables or standardizing variables to a new scale so that they are on the same basis. This is a general topic of preprocessing survey data. I will discuss creating new summary variables and standardizing variables to a new scale in this section. For a more detailed discussion about preprocessing data, see Paczkowski (2022).
2.5.1 Creating Summary Variables Suppose you conduct a customer satisfaction survey using a 10-point Likert Scale to measure satisfaction. Assume that “10” is Very Satisfied and “1” is Very Dissatisfied. You could use all 10 points of the scale, or you could dichotomize it into two parts: one indicating satisfaction and the other dissatisfaction. One way to do this with 10 points is to create an indicator variable and (arbitrarily) classify respondents who rated their satisfaction as a 8, 9, or 10 as “Satisfied” and all others “Dissatisfied.” This is a new variable added to your data set. I illustrate this in Fig. 2.17 for the yogurt data. I use a list comprehension, which I previously introduced, to recode the satisfaction measure from 10 points to 2 which are 0 and 1. A list comprehension is an advanced Python tool that simplifies, yet accomplishes the same result as, a for loop. A for loop is a way to literally loop through a series of statements on multiple lines, each statement performing some operation. It is a very
2.5 Creating New Variables, Binning, and Rescaling
59
Fig. 2.17 This illustrates the recoding of the yogurt customer satisfaction survey data. A list comprehension is used to do the recoding rather than complicated, and perhaps error-prone, ifstatements and a for loop. (a) Initial data. (b) Recoded data
60
2 First Step: Working with Survey Data
Fig. 2.18 This illustrates how to calculate a new variable, age, from the YOB for veterans in the vet survey. (a) Initial data. (b) Calculated data
powerful construct, and all programming languages have one as part of its basic language structure. Unfortunately, despite their usefulness and ubiquity, for loops can also be opaque, making code difficult to read, distracting, and even prone to error. A list comprehension is a simpler way to capture the essence of a for loop but in a one-line statement as opposed to a number of statements and multiple lines. For loops are not completely replaced by list comprehensions, but they should be used to replace some for loops whenever possible. The list comprehension in Fig. 2.17 is [ 1 if x >= 8 else 0 for x in df.satisfaction ]. It operates on the satisfaction variable in the DataFrame df by iteratively processing each value in this variable, assigning each value to another temporary variable “x.” The variable in df , in fact, any variable in a DataFrame, can be accessed as df.X. For each value, this list comprehension checks if that value is greater than or equal to 8. If this is true, then a “1” is added to the list; otherwise, a “0” is added. When the list is completed, it is assigned to a new variable in df called “t3b,” which stands for “top-three box.” I frequently use list comprehensions in this book. Incidentally, I illustrate the use of a for loop for another application in Fig. 2.19. Suppose you asked survey respondents to state the year they were born instead of just asking them their age. Someone’s age, like income, is often a sensitive topic, as I already mentioned, that may result in a low response to the question; asking year of birth (YOB), however, is less intrusive. This was done in the veterans survey: Veterans were asked to write the four-digit year they were born. But you need their age. You can create a new age variable by subtracting the year of birth from the year of the survey. The vet survey was conducted in 2010 so the calculation is 2010 − Y OB. I illustrate this in Fig. 2.18. Notice how the age is added to the DataFrame.
2.5 Creating New Variables, Binning, and Rescaling
61
Fig. 2.19 This illustrates how to calculate a new variable, branches, from the CATA branch question in the vet survey. (a) Initial data. (b) Calculated data
As a final example, you may want a summary measure of several variables. This might be as complex as an index number or as simple as a count. For instance, for the vet survey, respondents were asked to check the military branches in which they served. There were six options: Army, Navy, Air Force, Marine Corp., Coast Guard,
62
2 First Step: Working with Survey Data
and Other.10 Each respondent answered the question by checking the appropriate branch or branches. This is a CATA-type question with an indicator data recording: 1 if the vet served in that branch; 0 if not. A simple analysis would consider the total number of military branches they served in, so a new variable is created that is simply the count of branches. But you would certainly like to know the branch. One possibility is to create another new variable that indicates which of the six branches a vet was a member of if the vet served in only one branch but indicate “Multiple” if the vet was in more than one military branch. I illustrate this in Fig. 2.19. Summary statistics such as sums and means can also be added to a DataFrame as a new variable with the statistics representing operations across a set of columns. For example, after the military branch membership is determined and added to the DataFrame, you might want to include a count of the branches for each vet. You do this by adding the indicator variable for each branch (which is either 0 or 1 for the branch) for each vet. One approach is to use the Pandas sum method on the set of columns for the branches. This method just sums the column values, row by row. You have to tell the method, however, that the summation is over the columns in a row because the default is to sum the rows for each column; you want to sum the columns for each row. In Pandas, the rows and columns of a DataFrame are referred to as axes. axis = 0 refers to the rows in a column, and axis = 1 refers to the columns in a row. The order follows standard math matrix order, which is always row-column. The default for the sum method is axis = 0. To sum the columns for each row, you use axis = 1. You can see an example in Fig. 2.19. I list several Pandas summary measures and their defaults in Table 2.5.
2.5.2 Rescaling In addition to creating summary variables, you may want to rescale a variable to make it more interpretable or to place several of them on the same basis for Table 2.5 A partial list of Pandas summary measures. The default axis for each is axis = 0, which operates across rows for each column. Use axis = 1 to operate across columns for each row. For a complete list, see McKinney (2018, p. 160)
Measure Sum Mean Count Standard deviation Minimum Maximum Median Percent change
Method sum mean count std min max median pct_change
10 “Other” is defined as Public Health Service, Environmental Services Administration, National Oceanic and Atmospheric Administration, and US Merchant Marine.
2.5 Creating New Variables, Binning, and Rescaling
63
Table 2.6 These are three methods for response bias standardization. This is a summary of methods listed by Fischer (2004). The Xij is the response measure on the j th item X by respondent i. The dot notation indicates which subscript the calculation is over. So X¯i. indicates that the mean is calculated for respondent i over his/her responses to all items, while X¯.. is the overall or grand mean. Also, si. is the standard deviation of respondent i over all items Method Mean only Scale only
Within subject Xij − X¯i.
Within item Xij − X¯.j
Within group Xij − X¯..
Xij si.
Xij s.j
Xij s..
Mean and scale
Xij − X¯i. si.
Xij − X¯.j s.j
Xij − X¯.. s..
Double Yij = Xij − X¯i. ; Yij − Y¯.. Xij Yij Yij = ; si. s.. Xij − X¯i. Yij − Y¯.. Yij = ; si. s..
comparisons or use in modeling and other multivariate methods. For example, hierarchical and k-Means clustering are frequently used to segment customers to increase the efficiency or marketing programs by targeting differential offers, prices, and messages to each segment. For either method, a number of variables are used to create the clusters (i.e., segments), but the cluster solution might be biased if these variables are on different scales. The bias is due to them exerting undue influence in the clustering algorithm merely because of their scales. Common examples of scaledifferent variables are income and years of schooling. However, scale differences could occur in other situations. For example, it is not uncommon for Likert Scale questions to be used in clustering. People tend to use different parts of the scale to answer Likert Scale questions, perhaps due to cultural and racial biases and personal motivations. For instance, Bachman and O’Malley (1984) note racial differences among students in response to education and life-oriented questions. Pagolu and Chakraborty (2011) show that there are eight ways to classify responses to a 7-point Likert Scale. Also see Liu (2015, Chapter 1) for a similar analysis. This scale use results in response biases, but there are some who question if these biases are real. See Fischer (2004), for example, on some comments regarding this issue of biases. Nonetheless, survey analysts sometimes rescale Likert Scale questions to adjust for this bias. Fischer (2004) discusses various methods at length that I summarize in Table 2.6. Following the Fischer (2004) taxonomy, there are three adjustment methods: mean only, scale only, and both mean and scale. These are conventional ways to standardized a random variable in a survey. Fischer (2004) also lists an adjustment method using covariates, but it is more complicated. See Fischer (2004) for details. These three methods can be applied in four ways to a data matrix: 1. Within-subject: All item responses for a single respondent are scaled using statistics for that respondent. The items might be the complete set of items in a Likert Scale battery of questions or a subset. 2. Within-item: All respondents are scaled for a single item using statistics for that item.
64
2 First Step: Working with Survey Data
3. Within-group: All respondents and all items are scaled by the statistics for a group. The groups might be all females and all males or all Democrats and all Republicans. 4. Double: All items for each respondent are scaled individually and then scaled again by the grand mean (maybe by group). The within-subject standardization is useful if you want to understand how individual respondents “endorse” or weight each item in the set. The mean endorsement is zero when the scores are mean adjusted. Unfortunately, as noted by Fischer (2004) and Hicks (1970), and others, you cannot compare the distribution of standardized responses of one respondent to any others because each individual bases his/her responses on his/her own background, biases, prejudices, and beliefs. The withinitem standardization is consistent with typical statistical preprocessing of data and is more applicable here than the within-subject because the effects of individual perspectives are not impactful. The mean score within each item is zero when the scores are mean adjusted. The within-group is useful when groups or subsets of respondents need to be analyzed separately. This is the case when there are clear cultural, ethnic, and gender differences you must account for. Finally, the double standardization is claimed to be useful, but I have found little application of this. See Pagolu and Chakraborty (2011), however, for discussion and applications of this approach. My recommendation is to use the within-item standardization because you can compare items that is more often than not what you want to do.
2.5.3 Other Forms of Preprocessing There are other forms of data preprocessing that you should consider when analyzing survey data. These vary by type of measure: continuous or discrete. These include: Continuous measures: • • • •
Normalize to a value lying between 0 and 1 (inclusive) Normalize to a value lying between 0 and 1 (inclusive) but summing to 1.0 Convert to an odds ratio Bin
Discrete measures: • Encode: dummy or one-hot encoding Normalizing to a value lying between 0 and 1 (inclusive) is based on the range of a continuous response. A continuous response is a decimal-based number even though the decimal may not be evident. For example, you could ask physicians in a pharmaceutical study how many patients they see on average each week for a particular illness. The phrase “on average” implies a decimal number such as 25.5,
2.5 Creating New Variables, Binning, and Rescaling
65
but you could ask them to provide a whole number. Or you could ask the physicians what percent of their patients by age cohort receive a vaccine injection each year. This would also be a continuous number. This last example is already normalized to lie between 0 and 1 (or 0 and 100 if percents are recorded), and these values for each respondent must sum to 1 (or 100). The first example of a count, however, is not normalized. It might be better for some analyses to normalize the data. The normalization is done using XiN ew =
Xi − Min(X) Range of X
(2.2)
where XiN ew is the new, rescaled value of Xi , Min(X) is the minimum value of the measure, X, and Range of X = Max(X)−Min(X). This is a linear transformation such that the new values lie in the closed interval [0, 1]. The linearity means the relationship among the original values of the measure is preserved. The new values could be multiplied by 100 so that they are now in the closed interval [0, 100] so the values could be interpreted as an index. Equation (2.2) can be modified to produce a new minimum or a new maximum so you are not restricted to the closed interval [0, 1]. The modification is XiN ew =
Xi − XMin N ew N ew N ew × (XMax − XMin ) + XMin XMax − XMin
(2.3)
See Paczkowski (2022) for discussion and examples of both (2.2) and (2.3). There may be times when you want to rescale data to lie between 0 and 1 but to sum to 1.0. For example, you may have estimated utilities from a conjoint study in a market research survey and you want to interpret these as shares (e.g., market share, share of wallet, share of preference). You can use eXi XiN ew = n
j =1 e
XiN ew = 1
Xj
, i = 1, . . . , n
(2.4) (2.5)
i
This is a nonlinear transformation. See Sect. 6.1 for a high-level explanation of conjoint analysis. Also see Paczkowski (2022) for a discussion and application of the logit model. Finally, see Paczkowski (2018, 2020) for discussions about conjoint analysis in pricing and new product development, respectively. You may want to express some survey data in terms of the odds of an event happening simply because people have a prior notion of odds from sporting events and, therefore, find it easier to understand. For example, in a public opinion poll for an upcoming election, you can ask the simple question, “Will you vote for candidate X?,” where X is a candidate’s name. You have responses by males and females, and you want to know how much more likely are males to vote for the candidate
66
2 First Step: Working with Survey Data
Table 2.7 This is an example of 2 × 2 table
Voter gender Male Female
Vote for candidate Yes No 7 3 2 8
than females. The odds ratio will tell you this. The odds of an event happening are defined as O=
p 1−p
(2.6)
where p = P r(Event) is the probability of the event occurring. For survey data, this is estimated by the sample proportion that is the unbiased, maximum likelihood estimator of the population proportion. The odds ratio is the ratio of the odds for one group relative to another. In my example, it is the ratio of the odds for males to females: OR =
OM OF
(2.7)
where OM is the odds for males and OF is the odds for females. The odds ratio shows the strength of association between a predictor and the response of interest. It can vary from 0 to infinity. If the odds ratio is one, there is no association. Staying with the voting example, if the ratio of the probability of voting for the candidate to the probability of not voting is the same for men and women, then the odds ratio is 1.0. However, if it is 1.5, then men are one and one-half times more likely than women to vote for the candidate. If the OR is less than one, then the women are more likely than men. You can simply invert the odds ratio for men to get the odds ratio for women. As an explicit example, suppose you have data for the voting example for n = 10 people, and this data are arranged in a 2 × 2 table as in Table 2.7. A 2 × 2 table is the simplest, but yet a very powerful and useful, table that has two rows and two columns. The rows and columns are the levels or categories of two categorical variables. The probability of males voting for the candidate is pM = 7/10 = 0.7 for n = 10 males, so not voting is qM = 1 − pM = 1 − 0.7 = 0.3. The corresponding probabilities for females are pF = 2/10 = 0.2 and qF = 1 − 0.2 = 0.8. The voting odds for males and females are OddsM = 0.7/0.3 = 2.3333 and OddsF = 0.2/0.8 = 0.2500, respectively. The odds ratio for voting for candidate X is OR = 2.3333/0.2500 = 9.33. Thus, the odds of voting for the candidate are 9.33 times larger than the odds for a female voting for that same candidate. The inverse is 0.1071 for females. You could also create your own grouping or binning of continuous variables using the Pandas cut or qcut functions. These are applied to a single variable (i.e., column) in a DataFrame. The cut function assigns values of the variable to bins you
2.6 Knowing the Structure of the Data Using Simple Statistics
67
define or to an equal number of bins. For example, once you calculate the age of the vets for the VA study, you could bin their age into discrete categories to create a new categorical variable. If you specify bins as a list [18, 40, 60, 100], then the bins are interpreted by Pandas as (18, 40], (40, 60] and (60, 100]. Note the brace notation. This is standard math notation for half-open interval, in this case open on the left. This means the left value is not included but the right value is included; including the right value is the default. So the interval (18, 40] is interpreted as 18 < age ≤ 40. You can change the inclusion of the right value using the argument right = F alse and the left value using include_lowest = T rue in the cut function. The function qcut does the same thing as cut but uses quantile-based binning. The quantiles might be the number of quantiles (e.g., 10 for deciles, 4 for quartiles) or an array of quantiles (e.g., [0, 0.25, 0.50, 0.75, 1.0]). See McKinney (2018) for discussions and examples. You will have to create new variables from categorical variables for some modeling tasks. For example, if you are building a regression model, which I review in Chap. 5, but you want to include a categorical variable such as the region where the respondents live, then that variable must be recoded into several new variables called dummy, one-hot encoded, or indicator variables. The names are used interchangeably. This involves creating a new variable for each level of the categorical variable. Using the region as an example, if there are four regions that correspond to the US Census regions (i.e., Midwest, Northeast, South, and West), then four dummy variables are created. You can easily create the dummy variables using the Pandas get_dummies function. This takes an argument that is the DataFrame variable to be dummified. The new dummy variables are automatically added to the DataFrame. Optional arguments allow you to specify a prefix for each dummy (no prefix is the default) and if the first created dummy (which represents the first level in alphanumeric order) should be dropped (the default is False so all created dummies are included). As an example, the US Census Bureau defines four census regions: Midwest, Northeast, South, and West. Suppose you have a categorical variable, “Region,” with these four as levels. The Pandas function specification to dummify this variable is pd.get_dummies( df[ “Region” ], prefix = [ “dum” ], drop_first = False ), and the resulting dummy variables are named dum_Midwest, dum_Northeast, dum_South, and dum_West.
2.6 Knowing the Structure of the Data Using Simple Statistics Part of your knowledge of the structure of your data is gleaned from summary statistics and visualizations. There is a fine line, however, between using these tools for understanding structure and as a form of analysis. I will briefly mention some descriptive tools in this chapter and then develop more content for them in the next chapter as forms of Shallow Data Analysis. These tools, used here for understanding data structure, are descriptive statistics and data visualizations.
68
2 First Step: Working with Survey Data
2.6.1 Descriptive Statistics and DataFrame Checks There are three Pandas methods and two Pandas attribute calls that you can use to explore your data for structure and content. An attribute is a characteristic that means calling it just returns something but otherwise it does nothing. 1. 2. 3. 4. 5.
df.describe( ) df.info( ) df.head( ) and df.tails() df.shape df.columns
The first three are the methods and the last two are the attributes. These assume that your data are in a Pandas DataFrame named df. The describe method produces a table of descriptive summary measures for continuous floating-point variables. The statistics are the count, the mean, the standard deviation, and the Five Number Summary measures. For discrete objects such as strings or timestamps, it produces the count, the number of unique values, the top most common value, and the most common value’s frequency. Timestamps also include the first and last items. The measures are all returned in a Series or DataFrame. I will illustrate the describe( ) method in Chap. 3. The info( ) method displays information about the DataFrame itself: number of records, number of columns (i.e., variables), the non-null record count of each column, the data type of each column (i.e., the dtype, the DataFrame’s memory usage in bytes, and the count and nature of the DataFrame’s index. The dtype is the data type. I show data types in Table 2.8. Of these different types, the most common for survey data are the integers and strings. The DataFrame’s index is a record identifier automatically created for a DataFrame when that DataFrame is created. At first, the index is just a series of integers from 0 to the number of records. If a new DataFrame has five records, the index is 0, 1, 2, 3, 4, and 5. This is the first column when the DataFrame is displayed. You can access and use the index like any other variable in the DataFrame. You can also assign another column to be the index using the method set_index. For example, if you have survey data with a respondent ID (which almost all survey data sets will
Table 2.8 Pandas data types. See Paczkowski (2022) for discussion about the different data types, especially the datetime Type Floating-point number Integer number Datetime value String
Designation float64 int64 datetime64[ns] object
Comments A floating-point number has a decimal. An integer is a whole number. A datetime is an internally represented value for time. A string is a character
2.6 Knowing the Structure of the Data Using Simple Statistics
69
have), then a logical candidate for an index is this ID. You would use df.set_index( [ ‘ID’ ], inplace = True ). The inplace = True argument tells the Python interpreter to overwrite the DataFrame with a copy that has ID as the index; otherwise, a new copy is created which must be named. The index could have duplicate values, but I do not recommend this. In fact, you should always use the respondent ID for the index. Finally, the index could have several levels so that you have a MultiIndex. See Paczkowski (2022) for extensive discussions of a MulitiIndex. The shape and column are not methods; they are attributes of a DataFrame. The shape attribute call returns a tuple consisting of the number of rows and columns (in that order) in the DataFrame. A tuple is an ordered, immutable object that is a basic container for data elements of any type. The columns attribute call returns a list of column names. You use both as I listed them above. See Paczkowski (2022) for a discussion about what to look for in the returned list of column names.
2.6.2 Obtaining Value Counts A very useful method is value_counts, which returns a Pandas series. This method is chained to a variable in the DataFrame. As an example, if your DataFrame df has a variable X, then df.X.value_counts() will return the frequencies of each unique value in this variable. If you use df.X.value_counts( normalize = True ), then you will get the proportion of the sample size for each unique value. The default is “normalize = False.”
2.6.3 Styling Your DataFrame Display A very powerful and useful feature of Pandas DataFrames is the style collection of methods. You can add a title (called a caption in Pandas), add a bar chart to all or select columns, color cells to highlight specific values, and format individual cells as needed. The basic setup for adding a style is df.style.StyleFeature where StyleFeature is the feature you want to use. These features are all chained to the key word style. For example, to add the StyleFeature caption, usedf.style.set_caption( ‘Caption’ ). To add a caption and a bar chart aligned at zero, use df.style.set_caption( ‘Caption’ ).bar( align = ‘zero’ ). I provide a list of the styling options I commonly use and that appears in this book in Table 2.9. You will see their use throughout this book.
70
2 First Step: Working with Survey Data
Table 2.9 This is a partial list of styling options. All of these are chained together and also chained to the keyword style. For set_table_styles, I define a table style called tbl_styles in my Best Practices section and use it with this style Style set_caption bar hide_index set_table_styles format( format)
Appearance Creates a caption (i.e., title) Adds a bar chart Hides the DataFrame index Sets a style for a table such as the caption font Sets the format for a column
Example set_caption( ‘Example Table’ ) bar( align = ‘zero’ ) hide_index( ) set_table_styles( tbl_styles ) format( fmt )
2.7 Weight Calculations I discussed the issue of representation in Chap. 1 and noted that a sample is representative if the proportion of the sample on a variable of interest is proportional to the corresponding population proportion. An equality holds if there is a proportionality factor called a weight. Let the proportionality factor be wtgk . Then Nk/N = wtgk × nk/n. A weight is the factor that brings a sample into equality with the population so that the sample proportions match target population proportions. The weight is calculated as the proportion in the population divided by the proportion in the sample. Since the numerator and denominator are both positive, wtgk > 0. If wtgk = 1, then the two proportions already match and the sample is representative by default. Let P p be the population proportion and P s the sample proportion on a variable p of interest, a Core Question. If a group under study is females, then PF = NF/N and p s n PF = F/n. Assume that PF is known, perhaps from a census or prior studies that have been validated. Then the weight is the factor that makes the sample proportion equal the population proportion: p
PF = wtgF × PFs As an example, in a sample of n = 10 people, assume the sample proportion of females is PFs = 0.80 (i.e., there are 8 women and 2 men) and the population proportion is 50%. The weight for females is p
P wtgF = Fs PF
0.50 0.80 = 0.625 =
The proportionality factor 0.625 makes the sample representative of the population. If females are disproportionately larger in the sample, then the males are dispropor-
2.7 Weight Calculations
71
Fig. 2.20 Weights are simple to calculate given the sample to weight and the assumed or known population proportions. In this example, a sample of size n = 10 consists of male and female indicators. The assumed population proportions are 50% male and 50% female
tionately smaller, so they also have to be weighted. The weight for males, using the same calculation, is 2.500. Each person in the sample is assigned the appropriate weight so all females will have a weight of 0.625 and all males will have a weight of 2.500. Each female is interpreted as representing 0.625 women, and each male is interpreted as representing 2.500 men. Notice that 0.625 × 8 + 2.5 × 2 = 10. So n = nF + nM =
nF i=1
wtgiF +
(2.8) nM
wtgj M .
(2.9)
j =1
A function can be written in Python for calculating simple weights given the sample to be weighted and the population proportions. I illustrate such a function in Fig. 2.20. The weight wtgk is assigned to each unit in the sample. If the sample size is n, then each of the n units is assigned a weight wtgk appropriate for the unit. A simple merge or concatenation can be done to match the weights to the gender in a DataFrame. If k = Males, Females with wtgM = 0.714 and wtgF = 1.66, respectively, then each male in the survey gets a weight of 0.714 and each female gets a weight of 1.66 as shown in Fig. 2.20.
72
2 First Step: Working with Survey Data
Fig. 2.21 Once simple weights are calculated from the function in Fig. 2.20, they must be merged with the main DataFrame. A Python dictionary can be used for this purpose
This simple calculation assumes the population gender distribution is 50-50, so it is a balanced or uniform distribution. The calculation is the same for any other distribution as the function in Fig. 2.20 suggests. For example, it will still hold if the target population is 25% female. This could be the case, for example, in a jewelrybuying study in which the majority of buyers of jewelry are men. I illustrate the merging of the original DataFrame and sample weights in Fig. 2.21. The target population distribution is a known constraint the sample must satisfy for representativeness. If there are several factors, and several distributions, then the number of constraints is higher and the calculations are more complex. For example, the target population distributions for age and education may also be known from a census. So distributions are now known by gender, age, and education rather than just by gender. This additional information is used to improve the sample’s representativeness of the target population. The function in Fig. 2.20 can handle different population distributions for one factor (e.g., gender), but the issue of several factors is a different problem because they must be dealt with simultaneously. For this case, a procedure for handling several known population distributions, called raking, is used. I describe this in the next section.
2.7 Weight Calculations
73
2.7.1 Complex Weight Calculation: Raking The example I used above with male and female proportions was based on a oneway table: one row and two columns, one column for males and one for females. There is a corresponding one-way table for the population proportions. There are situations when several variables can be used to weight sample data, with these variables arranged in a contingency table. The minimum size for the table is two rows and two columns. Table 2.11 is an example of a contingency table, a two-way table in this case, of the age categories of voters and their party affiliations. The table is described as being 4 × 3: four rows and three columns. A cell is the intersection of a row and column. If i is the ith row and j is the j th column, then the cell at their intersection is cell ij . The cell values are the count of people with that combination of row and column. The count for cell ij is called the frequency and is designated as fij . If the cells of the sample contingency table with and c columns are r rows summed, then you have the total sample size, n: n = ri=1 cj =1 fij . If you divide each cell’s frequency by n, you get the proportion ofthe sample in that cell. If pij is the proportion of the sample in cell ij , then ri=1 cj =1 pij = 1. Each table has a marginal distribution, one for the rows and one for the columns. The column marginal distribution is the sum of the proportions in each column, which is 1.0. That is, c
p.j = 1
(2.10)
j =1
where the subscript “.” indicates that summation is over the rows. The row marginal distribution is comparably defined: r
pi. = 1.
(2.11)
i=1
The row labeled “Total” is the column marginal, and the column labeled “Total” is the row marginal. There is an extensive literature on the analysis of contingency tables. See, for example, Agresti (2002). There should be, and occasionally there is, a corresponding table for the population. Usually, such tables do not exist, but the marginal distributions do exist. They are available from many sources, but the US Census for a wealth of demographic data is probably the most comprehensive source of marginal distributions in the United States. When population data are available on the marginal quantities of a contingency table, a procedure called iterative proportional fitting, balancing, or raking can be used to adjust the cell values of the sample-based contingency table so that the
74
2 First Step: Working with Survey Data
Table 2.10 These two population distributions are the population marginals for a 4 × 3 contingency table. Notice that the two distributions each sum to 100. The age labels in Panel (a) are from https://www.people-press.org/2018/03/20/1-trends-in-party-affiliation-among-demographicgroups/, last accessed May 11, 2020 (a) Voting age distribution Millennial X-gen Boomers 33 24 30 Table 2.11 This a sample contingency table for raking. The row and column marginals do not sum to population totals, so the cell values have to be adjusted
Silent 13
Millennial X-gen Boomers Silent Total
(b) Political affiliation Democrat Republican 29 35 Democrat 3 6 4 5 18
Republican 5 9 7 8 29
Independent 36
Independent 4 9 11 5 29
Total 12 24 22 18 76
marginals of the adjusted table equal the known population marginals. This is an old procedure that has been shown to be very effective and reliable. See Deming and Stephan (1940) and Deming (1943, Chapter VII) for an early treatment of raking. Two important requirements for the procedure to work are: 1. The population marginal distribution for the rows of a table and those for the columns must both sum to the same value. That is, if there are r row marginal values and c column marginal values, then ri=1 Mi = cj =1 Mj where Mi is a row marginal value and Mj is a column marginal value. The set of Mi values is the row marginal distribution, and the set of Mj values is the column marginal distribution. 2. There can be no zero values or missing values in either the row or column marginal distributions. If this is not met, a small value (usually one) can be added to the relevant marginal cell. As an example of the first requirement, suppose a 4 × 3 sample contingency table has a voting age distribution for the rows (e.g., millennial (18–35), X-gen (36–51), boomers (52–70), and silent (71+)) and political party affiliation for the columns (e.g., Democrat, Republican, Independent). The sum of the population frequency counts for the four income designations must match the sum of the population frequency counts for the three political party affiliations. You can check this in the example in Table 2.10. Now consider the sample contingency table of age by political party affiliation shown in Table 2.11. You can iteratively adjust the cells of Table 2.11 using the population distributions in Table 2.10 to adjust all rows and then all columns. Each row of Table 2.11 is adjusted using the method described in Sect. 2.7. For example, the first cell in the first row of Table 2.11 is adjusted as 3 × 33/12 where 33 is the population millennial count and 12 is the corresponding sample marginal count. So the weight is 33/12 based on Sect. 2.7. The second value in the first row of Table 2.11 is adjusted as 5 × 33/12. The same calculation is applied to the third cell in that
2.7 Weight Calculations
75
row. The second and third rows are adjusted the same way using the appropriate population marginal numbers. When all three rows have been adjusted, the columns are successively adjusted the same way. When all four columns have been adjusted, this marks the end of an iteration. A second iteration is started following the same steps but using the adjusted table. Several iterations are done until the changes in the cell values are below a specified tolerance level and then the iterations stop. I provide a Python script for executing these steps in Fig. 2.22. There is a Python function called ipfn that will handle raking for multiple dimensions. This function is installed using pip install ipfn. An example of its use for the two-dimensional table in Table 2.11 is shown in Fig. 2.23. The raked results in Figs. 2.22 and 2.23 are fitted totals. What about the weights themselves that are used to weight all the data? These are found by element-byelement division of the fitted population totals by the sample totals. The calculation is shown in Fig. 2.24. The weights DataFrame in Fig. 2.24 has to be stacked so that it can be merged with the original sample DataFrame. I show stacking in Fig. 2.25. The merge procedures I previously discussed can be used for the merge. The weights should be examined because sometimes some weights are too large and could cause biases in estimated parameters. Some analysts trim the weights because of this. See, for example, Potter and Zheng (2015). I show some simple analyses in Fig. 2.26 of the weights I show in Fig. 2.25.
2.7.2 Types of Weights There is more to weights than most suppose. There are at least two weights that should be considered although most only consider one, if they use weights at all. As noted by Lavallee and Beaumont (2015), weighting survey data is often ignored in practice. Also see Gelman and Carlin (2002) who cite Voss et al. (1995) on lack of weighting issues. Reasons for this could include a lack of understanding about how to calculate weights or a lack of information regarding the population. But it could also include the complexity of weights. There are several weights that have to be considered beyond the simple description I provided above. These are: 1. Design weights 2. Nonresponse weights 3. Post-stratification weights The last weight, post-stratification, is sometimes called calibration weights. Nonresponse weights adjust the sample data for those members of the target population who do not respond to a survey because they do not want to participate or are unavailable when the survey is conducted. See Groves et al. (2002) for a collection of articles dealing with just this one topic. This should give you an idea of its complexity. It is also an important one because as I noted above, a goal of a survey is to collect data to estimate population amounts. These estimates should be unbiased,
76 Fig. 2.22 This is a script to do raking of the table in Table 2.11 using the population marginal distributions in Table 2.10. Five iterations were done. (a) Script. (b) Output
2 First Step: Working with Survey Data
2.7 Weight Calculations
77
Fig. 2.23 This is a script to do raking of the table in Table 2.11 using the ipfn function. (a) ipfn Script. (b) Output
78
2 First Step: Working with Survey Data
Fig. 2.24 The sampling weights are easily calculated by element-by-element division of the fitted sample totals from Figs. 2.22 or 2.23 by the sample totals
Fig. 2.25 The weights in Fig. 2.24 are stacked for merging with the main sample DataFrame
but as noted by Bethlehem (2002, p. 276), the “stronger the relationship between target variable and response behavior, the larger the bias. The size of the bias also depends on the amount of nonresponse. The more people are inclined to cooperate in a survey, the higher the average response probability will be, resulting in a smaller bias.”
2.7 Weight Calculations
79
Weight Histogram 5
Frequency
4
3
2
1
0 1.0
1.5
2.0
2.5
3.0
3.5
Weight
Fig. 2.26 The weights in Fig. 2.25 are examined using a simple descriptive summary and histogram
Post-stratification weighting is used to adjust for over-/undersampling groups in the population. This might be due to difficulty in finding members of a group so they become underrepresented, or some members of a group are more prone to not answering survey questions. Men, for example, may be less inclined than women to participate in a survey about jewelry purchased for themselves. This latter example is one for nonresponse that indicates that that nonresponse is a reason for over-/undersampling. Post-stratification weighting could reduce the bias due to nonresponse. See Bethlehem (2002). Design weights are a technical issue that most practitioners are likely unaware of them. As noted by Gelman and Carlin (2002, p. 290), these weights are concerned with the sample itself before the sample is collected. They are known “at the time the survey is designed.” The post-stratification weights, however, can only be known or calculated after the survey data are collected. Their “intuitive” role is the same, to weight the data to make them representative of the population, but their statistical function is different. Both are needed. See Dorofeev and Grant (2006) for a discussion of design and nonresponse weights and the complexity of weighting in general. If there are several weights, and assuming they are all known, how are they used? A weight is applied to each survey respondent, but only one can be used. It is not uncommon for several weights to be calculated. Since only one can be used, the weights are simply multiplied.
80
2 First Step: Working with Survey Data
2.8 Querying Data It is not uncommon, before tabulations or visualizations are done, to query a data set to create subsets for a particular focus or discussion. Querying is the process of asking questions and not just getting answers but also creating the subsets. For example, you might have a preelection poll of voters in a presidential election, and you want to know how only women will vote. You need to subset your entire data set of voters to create a subset of only women voters. From this subset, you could then create tabulations or visual displays as well as do advanced analyses. Similarly, you could conduct a customer satisfaction survey using a 5-point Likert Scale to measure satisfaction and want to understand those who indicated dissatisfaction. Consider the presidential preelection poll example I mentioned above. Suppose you have a survey DataFrame consisting of poll responses of potential voters from a number of voting locations around the country in the 2016 presidential election. Three queries are shown in Figs. 2.27, 2.28, and 2.29 to illustrate an increasing complexity of queries. The first is a simple query for female voters only, the second is for female voters who are 100% sure they will vote, and the third is for female voters who are 100% sure or extremely likely to vote.
Fig. 2.27 This is a query setup for female voters only
2.8 Querying Data
81
Fig. 2.28 This is a query setup for female voters who are 100% likely to vote
Fig. 2.29 This is a query setup for female voters who are 100% likely to vote or extremely likely to vote
Chapter 3
Shallow Survey Analysis
Contents 3.1
Frequency Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.1.1 Ordinal-Based Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.1.2 Nominal-Based Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2 Basic Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.3 Cross-Tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.4 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.4.1 Visuals Best Practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.4.2 Data Visualization Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.4.3 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4.4 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.4.5 Other Charts and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.5 Weighted Summaries: Crosstabs and Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Once you understand your data’s structure, you can begin to analyze them for your Core Questions. Analysis usually begins by creating tabulations (the “tabs”) and visualizations. I classify these as Shallow Analyses. They are shallow because they only skim the surface of your data, providing almost obvious results but not penetrating insight. Summaries are usually developed and presented as if they are detailed analyses, but they are not the essential and critical information contained in the data. Key decision-makers do not get the information they need to make their best decisions. If anything, Shallow Analysis raises more questions than they answer. In addition, those who develop Shallow Analyses pass the actual analyses onto their clients who must decipher meaning, content, and messages from them. These are the responsibilities of the analysts, responsibilities met by Deep Analysis but left unaddressed by Shallow Analysis. What is covered by Shallow Analysis? The following are covered: 1. 2. 3. 4.
Frequency summaries Basic descriptive statistics Cross-tabulations Simple visuals such as pie and bar charts
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_3
83
84
3 Shallow Survey Analysis
These four have their use in describing Surround Questions that form the foundation for Core Questions, but too often, the same four are applied to Core Questions and basically leave them unanswered. They have their purpose, but it is a limited one and should not be used as the end-all for survey data analysis.1
3.1 Frequency Summaries I stated in Chap. 2 that most, but not all, questions in a survey are categorical. There are quantitative measures such as the average price paid for a product on the last purchase occasion, the average number of patients seen per week in a medical practice, or the number of times someone has voted in any election in the past year. There are other quantitative measures that are calculated such as the age of a veteran based on the year of birth in the VA study. Quantitative statistical methods can be used to analyze them. There are some measures, however, that are controversial about whether they are or are not quantitative. The Likert Scale is the prime example. See Brill (2008) for brief discussion of the Likert Scale. There are two opposing viewpoints on the Likert Scale. The first is that it is strictly categorical with survey respondents self-categorizing themselves as belonging to one scale group or another. For a 5-point Agree-Disagree question, for example, respondents pick one of the five scale points to describe themselves. They cannot pick anything between two points, nor can they pick a point below the first or above the last. In this view, the scale is not only categorical but also ordinal since the points have a definite order. If it runs from Strongly Disagree to Strongly Agree, then Strongly Disagree < Somewhat Disagree < Neither Agree Nor Disagree < Somewhat Agree < Strongly Agree. Only ordinal-type analysis can be done; arithmetic operations (e.g., addition, subtraction, multiplication, and division) cannot, which excludes calculating means and standard deviations. In addition, since the scales are ordinal, “one cannot assume that the distance between each point of the scale is equal, that is, the distance between ‘very little’ and ‘a little’ may not be the same as the distance between ‘a lot’ and ‘a very great deal’ on a rating scale. One could not say, for example, that, in a 5-point rating scale (1 = Strongly Disagree; 2 = Disagree; 3 = Neither Agree Nor Disagree; 4 = Agree; 5 = Strongly Agree) point 4 is in twice as much agreement as point 2, or that point 1 is in five times more disagreement than point 5.” See Cohen et al. (2007, p. 502). You can only place responses in order. See Jamieson (2004) on some abuses of the Likert Scale. Also see Brill (2008) who notes that survey analysts often use the scale incorrectly.
1 I have seen many market research study analyses, both as the market research client/department head and the consultant advising about analysis methods, that have just one pie chart after another throughout a report as if they were all that were needed. This is Shallow Analysis to the extreme but yet indicative of what I am claiming.
3.1 Frequency Summaries
85
What can you do if arithmetic operations are inappropriate? The recommended operations with ordinal data are calculating the median, mode, and frequencies. From the frequencies, it is simple to calculate percentages that are just normalized frequencies. Percentages are preferable since they have a common base (i.e., the sample size) and allow you to compare the distribution of responses in one survey to that in another; you cannot do this with frequencies since there is no common base. The second viewpoint is that there is a continuous, but latent scale underlying the Likert Scale points. Not only is it continuous, but it is also an interval scale. Some have argued that increasing the number of scale points to 11 better approximates a continuous distribution, thus allowing the use of arithmetic operations. See Wu and Leung (2017) for some review of this point. Also, see Leung (2011) and Hodge and Gillespie (2007).
3.1.1 Ordinal-Based Summaries The basic tool to use with ordinal data is a frequency table that summarizes the frequencies for each level of the scale. You could create a simple table using the Pandas value_counts() method. This takes two Boolean arguments: sort and normalize.2 The sort argument, if set to True (the default), instructs value_counts to sort the results by the frequencies; otherwise, there is no logic to the sorting. The normalize argument, if set to True (the default is False), divides each count by the sample size (i.e., total of all frequencies) and expresses the result as a proportion. The result is returned as a Pandas series. You can easily create a frequency table using simple Python/Pandas statements. I illustrate one possibility in Fig. 3.1 for one of the VA healthcare battery of Likert questions. This figure shows an extension of the basic value_counts call to create the frequencies, the normalized frequencies (i.e., proportions), and their cumulative counterparts. The cumulative numbers are based on the cumsum( ) function. I placed these components into a DataFrame, which I then styled. A dictionary (named dt) has the questions associated with the variables, and the relevant question is used as a table caption. The Likert items have the categorical data type I discussed in Chap. 2 (see Fig. 2.16).3
2 It also has three other arguments: ascending, bins, and dropna. I find the two in the text to be the most useful. 3 If you are familiar with SAS, you should recognize the output in Fig. 3.1 as the SAS Proc Freq layout.
86
3 Shallow Survey Analysis
Fig. 3.1 Frequency summary table based on the value_counts method. The variable previously had a CategoricalDtype defined for the ordinal Likert items. See Fig. 2.16. A bar chart is overlaid on each column of the table using the bar styling argument
3.1.2 Nominal-Based Summaries It should be clear that if a categorical variable is nominal, then the cumulative columns in Fig. 3.1 do not make sense and, therefore, should be omitted. Figure 3.2 is an example of a question that asked if the veteran respondent has a healthcare provider. The possible responses are “Yes,” “No,” and “Don’t Know,” which are nominal.
3.2 Basic Descriptive Statistics Responses to ratio-based questions can be summarized by the descriptive statistics taught in a basic statistics course. These include: 1. 2. 3. 4.
Mean or average Percentiles Standard deviation Range
3.2 Basic Descriptive Statistics
87
Fig. 3.2 Frequency summary table based on the value_counts method without the cumulative columns. The variable has a CategoricalDtype defined for the nominal items
These are easily obtained using the Pandas describe( ) method chained to a DataFrame. As an example, the yogurt study contains data on the average price paid for yogurt and the number of units (i.e., containers) of yogurt purchased. The DataFrame contains a large number of variables; I extracted only a few as I show in Fig. 3.3. The describe( ) method returns the main summary measures for ratio data: count, mean, standard deviation, and the Five Number Summary. The Five Number Summary is a classic label for the five percentile measures: 0th percentile (i.e., minimum), 25th percentile (i.e., first quartile or Q1), 50th percentile (i.e., median or Q2), 75th percentile (i.e., third quartile or Q3), and the 100th percentile (i.e., the maximum). You can calculate the range ( = Maximum - Minimum) and the Interquartile Range (IQR = Q3—Q1). I illustrate the application of describe( ) to the data in Figs. 3.3 and 3.4. The describe( ) method can take an argument to specify the percentile such as percentiles = [ 0.05, 0.25, 0.75, 0.95 ]; the default is the Five Number Summary. You can calculate individual statistics for a variable in the DataFrame using a method chained to the DataFrame. I list several available statistical functions in Table 3.1. The mean of a variable in a DataFrame can be calculated by chaining the mean( ) method to the DataFrame as an example of a method’s use. I provide an example in Fig. 3.5.
88
3 Shallow Survey Analysis
Fig. 3.3 This is a subset of the yogurt data with price and units purchased
Fig. 3.4 These are the descriptive statistics for the yogurt price and units purchased data in Fig. 3.3. The T chained to the describe( ) method is a function to transpose the display. Without this function, the variables are listed as the columns and the statistics as the rows
3.3 Cross-Tabulations Table 3.1 This is a list of statistical methods in Pandas. See McKinney (2018, p. 160)
89 Function count sum mean mad median min max mode abs prod std var sem skew kurt quantile cumsum cumprod cummax cummin
Description Number of non-NA observations Sum of values Mean of values Mean absolute deviation Arithmetic median of values Minimum Maximum Mode Absolute value Product of values Bessel-corrected sample standard deviation Unbiased variance Standard error of the mean Sample skewness (3rd moment) Sample kurtosis (4th moment) Sample quantile (value at %) Cumulative sum Cumulative product Cumulative maximum Cumulative minimum
Fig. 3.5 This illustrates how to use the mean( ) chained to a DataFrame. This uses the price variable for the yogurt study
3.3 Cross-Tabulations A cross-tabulation (a.k.a. the “tabs”) is the most commonly used tool in the Shallow Analysis of survey data.4 It is greatly abused, overworked, and too simplistic. The crosstab itself is a long-used tool in statistical analysis, so its use in survey data analysis cannot be denigrated. The issue I have with its use for Shallow Analysis is that many practitioners do not go beyond a crosstab to examine or test relationships inside the tabs; they stop with the tabs themselves and avoid their Deep Analysis, the 4I
once had a client who told me, “Everything you want to know is in the tabs.”
90
3 Shallow Survey Analysis
Fig. 3.6 This is a basic crosstab of frequencies for combinations of the levels or categories of two categorical variable. A CategoricalDtype was created to place the client’s major competitor next to the client in the column listing. The brands are the rows and the segments are the columns. Pandas’ crosstab function by default places the row and column labels in alphanumeric order. Notice the column labels
subject of the next chapter. The Deep Analysis of crosstabs in statistics deals with hypothesis testing, something eschewed by those who merely “look” at the tabs. I just want to introduce tabs in this chapter to show how they are created. Pandas has a crosstab function for creating a tab. It creates a DataFrame displayed as a table with rows and columns. The dimension of a table is always written as the number of rows by the number of columns—in that order. If r is the number of rows and c the number of columns, then a table dimension is r × c. I show how to create a crosstab in Table 3.6 for two categorical variables from the yogurt data. The tab is specified as “Brands” for rows and “Segments” for columns with rows first followed by columns. A problem with Fig. 3.6 is that frequencies are shown. As I mentioned previously, frequencies reflect the sample; they are not generalizable to other samples. Proportions (or percents) are better for comparisons to other survey results because they reflect normalization to a common base. Pandas allows several types of normalizations for the crosstab function: • By the sum of values (argument: “all” or True) • By the sum of each row (argument: “index”) • By the sum of each column (argument: “columns”)
3.3 Cross-Tabulations
91
Fig. 3.7 This is an enhanced version of the crosstab in Table 3.6. Notice how the row and column marginal distributions each sum to 1.0. Each can be interpreted as a probability distribution
You can also normalize the margin values if “margins” is set to True. All tables have two margins: one for rows and one for columns. The row margin is the sum of the columns for each row, and the column margin is the sum of the rows for each column. If the tab has r rows, then the row margin has r elements. If the tab has c columns, then the column margin has c elements. If fij is the frequency of the cell at the intersection of row i and column j , i = 1, 2, . . . , r and j = 1, 2, . . . , c, then fi. = cj =1 fij , f.j = ri=1 fij , and f.. = ri=1 cj =1 fij = n. Note that f.. = ri=1 fi. = cj =1 f.j = n. The margins show the distribution of the row and column labels. Figure 3.7 is the same crosstab as Fig. 3.6 but with normalization by the sum of all values. I also added row and column marginals to show how the proportions sum for the rows and columns. For this example, the row sums are the distribution of respondents’ brand selection, and the column sums are the distribution of respondents by their segment assignment. I will look at those distribution in the next chapter. A variation on the basic crosstab is the use of a third variable to populate the cells of the tab rather than the frequencies (or proportions) for the two variables defining
92
3 Shallow Survey Analysis
Fig. 3.8 This is an enhanced version of the crosstab in Table 3.6 with the mean price as a value for each combination of segment and brand Table 3.2 This is a partial listing of key parameters for the crosstab function. See the Pandas documentation for more details. Example usage: xtab = pd.crosstab( df.Segment, df.Brand, normalize = ‘all’, margins = True ). See McKinney (2018, p. 315) Parameter Index Columns Values Aggfunc Margins Normalize
Setting First position; list of variables for the rows or index of the table Second position; list of variables for the columns of the table List of variables to be aggregated Aggregation function as a list or dictionary; none is default Display margins; false is default Normalize the entries; false is default
the table. For example, you might want to know the average price everyone paid for yogurt by their segment and brand purchased. This is a combination of categorical and quantitative analysis. You can use the Pandas crosstab function with a values and aggfunc argument. The aggfunc argument specifies how the values argument is handled for each combination of the rows and columns. The mean and sum are the most common options. I give an example in Fig. 3.8. I also provide the key parameters for the crosstab function in Table 3.2. An enhanced version of the Pandas crosstab function is the pivot_table method. This is very close in functionality to crosstab but with some major differences. The crosstab function is best for producing simple crosstabs, while pivot_table produces
3.3 Cross-Tabulations
93
Table 3.3 This is a partial listing of key parameters for the pivot_table function. See the Pandas documentation for more details. Example usage: pv_tbl = pd.pivot_table( df, index = ‘Segments’, columns = ‘Brand’, values = ‘Price’). See McKinney (2018, p. 315) Parameter DataFrame Index Columns Values Aggfunc Margins
Setting First positional parameter List of variables for the rows or index of the table Optional List of variables to be aggregated Aggregation function as a list or dictionary; mean is default Display margins; false is default
these tabs, albeit, with a slight difference but with more flexibility for producing other layouts beyond the “tabs.” The crosstab function requires arrays for the row and column arguments. The default aggregation is counting the number of entries in all cell combinations. Other aggregation methods can be specified, but the value to be aggregated must also be specified. The pivot_table method requires a DataFrame, and the default aggregation is the mean. The crosstab function takes the array arguments and internally converts them to a DataFrame, which means it does a little more processing than the pivot_table method. A recommendation, if your data are already in a DataFrame, which it most likely will be, then use the pivot_table method. I provided the main arguments for pivot_table in Table 3.3. Unfortunately, unlike for crosstab, there is no normalization argument. You can, however, do your own normalization with a simple statement chained to the pivot_table command. Suppose you want row and column marginals. You would use the apply method on the axis to be marginalized with a lambda function as the argument.5 To get a row margin, use axis = 0, which applies the function row by row; for a column margin, use axis = 1. You could normalize the entire table using applymap, which applies a function to each element of the table. pivot_table, unlike crosstab, takes a minimum of one argument: index. This creates a one-way table with the index for the rows. I show a basic tab created with the pivot_table method in Fig. 3.9 and a variation using only the index argument for a one-way table in Fig. 3.10. The cross-tab function is chained to the Pandas alias pd, and the pivot_table is chained to the DataFrame name.
5 You could, of course, define a marginal function and place it in the Best Practices section of your Jupyter notebook. This function would then be available whenever you need it.
94
3 Shallow Survey Analysis
Fig. 3.9 This is a basic crosstab like the one in Table 3.6 but based on the pivot_table method
Fig. 3.10 This is a one-way table based on the pivot_table method using only the index argument. Notice that this reproduced the row marginal distribution in Fig. 3.9. Other configurations are possible
3.4 Data Visualization In addition to tabs, simple pie and bar graphs are the most common forms of Shallow Data Analysis tools used to analyze, as well as present, survey data. Both are used to display results for nominal or ordinal data such as simple Yes/No questions and Likert Scale questions. Since these types of questions dominate questionnaires, both Surround and Core Questions, their use is understandable. Their overuse, however, is an issue since they do not, and cannot, penetrate the data, especially for the Core Questions, to provide the insight and information key decision-makers require. The
3.4 Data Visualization
95
reason is their almost single focus on one or two variables at a time so that they do not clearly show relationships. In addition, the data behind these two graphs are not subjected to statistical tests, so it is impossible to say if there is any statistical significance to differences among pie slices or bars. Some visual issues make it difficult at times to discern differences among pie slices and bar charts bars. As humans, we have difficulty seeing differences in angles and depths although we are better at discerning differences in lengths. For instance, (Hegarty 2011, p. 459) notes that “perception of angles is necessary to understand pie charts, perception of position along a common axis is necessary to understand bar charts, and position along nonaligned scales is necessary to compare corresponding elements in stacked bar charts.” She further notes that based on work by Cleveland (1994), “perceiving position along a common scale was judged as the most accurate, followed by position along nonaligned scales, comparisons of line lengths, angles, areas, and volumes in that order. The ordering of the necessary perceptual tasks was used to predict the effectiveness of different types of graphs, for example, that bar charts are more effective than pie charts for presenting relative magnitudes because position along a common scale is a more accurate perceptual judgment than is angle.” See Kosslyn (2006) and Peebles and Ali (2015) for some discussions. I will review these two basic graphs in this section along with some slight extensions. My objective is not to delve into the visualization issues any further than what I mentioned but to show how they can be and are used in survey analysis and how they can be developed in Python.
3.4.1 Visuals Best Practice There is one Best Practice I recommend for data visualization: annotate your graphs. This should include the base and the question(s) used to create the plotted data. The base is the sample size, type, and composition of respondents. I usually define a base variable for the base information and question. A footer function displays the information at the bottom of the graph. The footer function is defined in the Best Practices section upfront in a Jupyter notebook. You will see many examples of its use in this book. You can see an example of this annotation in Fig. 3.12.
3.4.2 Data Visualization Background Python has a number of data visualization packages, each with its own strengths and purpose. Pandas has built-in plotting methods that allow easy graphic rending of DataFrame variables. I certainly recommend these methods. I also recommend Seaborn for more stylized and scientific visualization of data. Seaborn reads Pandas
96
3 Shallow Survey Analysis
DataFrames so you do not have to do any special data processing to move from Pandas to Seaborn. Both Pandas and Seaborn have a common base for their graphing tools and functionalities. This is Matplotlib. This is a large and complex package with a submodule called pyplot that allows you to control and interact with any aspect of a graph.6 Because this capability is extensive, it is also a challenge to use. Pandas and Seaborn are interfaces to the Matplotlib library to make it easier for you to access its functionality in a more streamlined fashion than directly accessing Matplotlib. Even though Pandas and Seaborn are interfaces to Matplotlib’s complexities, you still need to know about it and how graphs are produced so you can more effectively create and annotate them with titles and labels at a minimum. I will describe some background in this subsection. For more details, see the Matplotlib documentation online, (VanderPlas 2017, Chapter 4), and (Hunt 2019, Chapters 5 and 6). You import Matplotlib in the Best Practices section of your Jupyter notebook like any other package. Normally, however, the sub-module, pyplot, is imported to interface with Matplotlib and to manage figures. The import statement is import Matplotlib.pyplot as plt. The alias is conventional. Matplotlib is used to specify a plotting canvas where all elements of a graph are “painted.” The canvas is referred to as a figure. It has a size with a default written in a tuple as (6.4, 4.8). The first element is the figure width in inches, and the second is the figure height, also in inches. You can, of course, change this figure size. You can create a figure space using fig = plt.figure( ) where fig is the name of the figure space. This will be a blank space. A graph or set of graphs are drawn inside a figure space. The graph is referred to as an axis. This is confusing because a graph per se has axes (note the spelling) such as the X-axis and the Y-axis. These are referred to as spines. If there are several graphs in a figure space, then there are several axes, each with different spines and separately named. The default number of axes is, of course, 1. You can explicitly define a figure space and the axes using fig, ax = plt.subplot( nrows = 1, ncols = 1 ) where fig is the name of the figure space and ax is the name of the axis inside the figure space. Notice that a graph is not produced. Only the area is set where the graph will eventually be drawn. In this example, I created only one axis in a grid that is one row by one column. Other examples are: fig, ax = plt.subplots( ) for one axis in the figure (the default) fig, axs = plt.subplots( 2, 2 ) for four axes in a 2 × 2 grid 6 pyplot was actually written to emulate a popular non-Python software: MATLAB. This is not really important to know for its use.
3.4 Data Visualization
97
Fig. 3.11 This is the structure for two figures in Matplotlib terminology. Panel (a) on the left is a basic structure with one axis (ax) in the figure space. Panel (b) on the right is a structure for two axes (ax1 and ax2) in a (1 × 2) grid Table 3.4 These are a few Matplotlib annotation commands Graph component Figure title Axis title X-axis label Y-axis label
Command suptitle set_title xlabel ylabel
Example fig.suptitle( figureTitle ) ax.set_title( axisTitle ); ax1.set_title( axis1Title ) ax.set_xlabel( X-label ); ax1.set_xlabel( X1-label ) ax.set_ylabel( Y-label ); ax1.set_ylabel( Y1-label )
fig, (ax1, ax2 ) = plt.subplots( 1, 2 ) for two axes in a 1 × 2 grid. For a 2 × 2 grid, you access the individual axes using axs[0, 0], axs[0, 1], axs[1, 0], and axs[1, 1]. Matplotlib documentation states that ax and axs are the preferred names for the axes for clarity. If you create a blank figure space using fig = plt.subplots( ), you can subsequently add axes using fig.add_subplot( nrows, ncols ). I illustrate some figure structures in Fig. 3.11. You use the axis name to access parts of a graph such a as title, X and Y axis (i.e., spine) labels and tick marks, and so forth. You use the figure name to access the figure title. You do this using the axis name chained to the appropriate graph components. I list options in Table 3.4. There are Matplotlib functions to create scatter, pie charts, bar charts (horizontal and vertical), and much more. My recommendation is to use the Pandas and Seaborn graphing commands for the actual graphs. The Matplotlib annotation components I provide in Table 3.4 allows you to manipulate those graphs. I provide some examples in the next subsections.
98
3 Shallow Survey Analysis
3.4.3 Pie Charts Pie charts are ubiquitous in survey analysis and reports. They are commonly used with nominal and ordinal data to show shares. I show some typical examples in Table 3.5. There are, of course, an infinite number of possibilities. A Pandas DataFrame plot method can be used to produce pie charts as well as other types of charts. The method is chained to the DataFrame. Its arguments are the variable to plot and a named argument “kind = ” that indicates the kind of plot to create. The argument is “kind = pie” in this case. I list the available plot kinds in Table 3.6. There are other plotting arguments for title, legend, and so forth. I illustrate in Fig. 3.12 how to create a basic pie chart for the public opinion question regarding likelihood to vote. There are two items to note in this code snippet. First, the plot command ax = df_x.plot( y = ‘vote’, kind = ‘pie’, legend = False ) is assigned to a variable ax. You do not have to previously create ax, or fig for that matter, because the plot method does this for you behind the scenes. It is, nonetheless, available to you. This variable holds all the plotting information based on what was specified in the call statement. This includes the axes themselves, tick marks, labels, scaling, and so forth, everything needed to create and manipulate the plot. For example, in the next line beginning with ax.set in the snippet, a title and Table 3.5 These are just a few illustrative examples of types of questions that are candidates for pie charts
Table 3.6 This is a list of plot kinds in Pandas. An example of a call to plot is df.plot( y = ‘var’, kind = ‘pie’ ). Notice the use of the single quotes
Question Gender Income Segment Brand purchased Voting intention Political party Number patients seen Satisfaction
Question type Nominal Ordinal Nominal Nominal Nominal Nominal Ordinal Ordinal
Possible focus Surround Surround Surround Core Core Surround Core Core
Graph type Line (default) Vertical bar Horizontal bar Histogram Boxplot Kernel density estimation kde plot (same as “kde” above) Area Pie Scatter Hexbin
Kind keyword line bar barh hist box kde density area pie scatter hexbin
3.4 Data Visualization
99
Likelihood to Vote Extremely likely
Somewhat likely
Not very likely
100% likely
Base: all respondents; n = 34,520 Question: What is your likelihood to vote in the next presidential election?
Fig. 3.12 This is a basic pie chart based on the Pandas plot method. Notice the annotation describing the sample size and base. It is Best Practice to annotate the base as often as possible
label are set.7 The second item to note is the annotation at the bottom of the chart. This is based on the variable “base” at the top of the code snippet and a footer( ) function that I defined in the Best Practices section of the Jupyter notebook. For another example, I illustrate how to create an age-gender distribution report for the yogurt data in Fig. 3.13, and then in Fig. 3.14, I create two pie charts for this distribution. I provide an alternative way to produce the two pie charts in Fig. 3.15.
3.4.4 Bar Charts Pie charts are frequently discouraged despite the fact that they are equally frequently used is survey analysis. The reason is the angle issue I mentioned above. Tufte (1983), a leading visual display expert, quipped that the only thing worse than a pie chart is several of them. This may be extreme, but yet there is some basis for it. In fact, there is a view that bar charts are superior because we, as humans, are more adept at seeing lengths than angles. A bar chart can be viewed as an unrolled pie chart. See Paczkowski (2016), Few (2007), and Cleveland (1994).
7 The
label is set to a blank in this case.
100
3 Shallow Survey Analysis
Fig. 3.13 This is the age-Gender distribution for the yogurt data
Bar charts have an added advantage over pie charts: You can display other information on a bar chart that may not be practical to display on a pie chart. For example, you can display error bars at the top of each bar, a symbol representing an average or median value, or horizontal or vertical lines indicating a standard of some kind. These could, of course, overwhelm the bars themselves making the chart cluttered and, therefore, diminish its value. Nonetheless, a judicious use of other symbols can add more insight than the bars alone. Consider the Surround Question about gender in the yogurt survey. A simple bar chart can be used to display the gender distribution. See Fig. 3.16 for the calculation of values and Fig. 3.17 for a bar chart. A side-by-side bar chart (SBS) of the joint distribution of two variables is a variation of the bar chart. I show one in Fig. 3.19 for the joint distribution of age and gender that I produced above. The SBS bar chart was produced by first staking the data I created in Fig. 3.13. Stacking means that the rows of the DataFrame are transposed, one row at a time. I show how to do this in Fig. 3.18. The stacked data are then used to produce the SBS bar chart in Fig. 3.19.
3.4 Data Visualization
101
Fig. 3.14 These two pie charts are based on the age-gender distribution in Fig. 3.13
A bar chart can be plotted with the bars vertical or horizontal. My preference is usually horizontal bars because the tick labels for the bars are more readable; they sometimes run into each other with vertical bars. A bar chart can be added to a DataFrame display as a style. Pandas defines several styles, one of which is bar. To define a bar style, chain the bar method to the style accessor. Its arguments are the variable(s), the bar alignment, and their color. The variables are defined in a list; if omitted, then all variables will have a separate bar chart. The alignment is “left,” “zero,” or “mid”; “left is the default.” I recommend using bars as a Best Practice to highlight distributions for quicker analysis. You will see bar charts attached to DataFrame displays throughout this book.
3.4.5 Other Charts and Graphs I stated above that pie and bar charts are the most common visual displays for analyzing (and reporting) survey data because they are easy to produce and present and somewhat easy to interpret. There are times when you have to examine the
102
3 Shallow Survey Analysis
Males
Females
18 to 35 years old
18 to 35 years old
68 years old and over
7.8%
10.8%
16.8%
36 to 40 years old
68 years old and over
36 to 40 years old
19.3%
18.2%
16.1% 15.9%
61 to 67 years old
17.3%
18.3%
61 to 67 years
17.8% 22.8%
41 to 50 years old
41 to 50 years old 51 to 60 years old Base: All male consumers; n = 653 Question: What is your age?
18.8%
51 to 60 years old Base: All female consumers; n = 1347 Question: What is your age?
Fig. 3.15 This shows an alternative development of the two pie charts in Fig. 3.14
distribution of a variable to determine its shape, that is, skewness. Histograms and boxplots are very effective for this. In other instances, a bar chart can be overlaid on a crosstab to show both the tabulations as numbers and a visual display of the relative magnitudes of those numbers to show the joint distribution of two variables. Figure 3.13 is a good example. There are other visual displays, however, that are equally effective in showing the joint distribution of two nominal or ordinal variables: the mosaic chart and the heatmap. I will discuss both in the following two subsections.
3.4 Data Visualization
Fig. 3.16 This shows the yogurt consumers’ gender distribution
Fig. 3.17 This shows the yogurt consumers’ gender bar chart based on the data in Fig. 3.16
103
104
3 Shallow Survey Analysis
Fig. 3.18 This is the code to stack data for a SBS bar chart. It uses the data created in Fig. 3.13
Fig. 3.19 This is the SBS bar chart for the age-gender distribution. It uses the stacked data created in Fig. 3.18
3.4 Data Visualization
3.4.5.1
105
Histograms and Boxplots for Distributions
Histograms are typically introduced in the first week of a basic statistics course. You are taught how they are constructed and, most importantly, how to interpret them. Their construction is not important here aside from the programming statements you need. What is important is their shape. Some are symmetric around a center point, usually the mean and/or median. Some have one peak or “hump” so they are unimodal; others are bimodal or multimodal. The modality is important because if, say, a distribution is bimodal, then the data are the result of two underlying distributions: two different data-generating processes. This is a complex topic that is discussed in Paczkowski (2016, 2022). Suffice it to say that the modality is a signature for a complex underlying process. Another signature is the distribution’s skewness. Skewness refers to the elongation of a tail, either left or right, of the distribution. This points to outliers but also to a bias in the descriptive statistics so easily and routinely calculated and presented. If a distribution is right skewed, so the tail is elongated to the right, then the sample mean is pulled to the “high” end and is overstated; skewed to the left has the opposite effect. I show how to generate a histogram using the Pandas plot method in Fig. 3.20. The boxplot is introduced in a basic statistics class as a tool to summarize the distribution using robust statistics. These are statistics that are more resistant to outliers. Percentiles are used, typically 25% (first quartile or Q1), 50% (the median), and 75% (third quartile or Q3). The Interquartile Range (IQR) is I QR = Q3 − Q1. The distance Q3 + 1.5 × I QR defines an upper fence or barrier, similarly for Q1 − 1.5 × I QR. Data points outside the fences are outliers. Whiskers are drawn from Q3 to the largest data point that is less than or equal to the upper fence, similarly for a whisker below Q1. I provide a graphic description of a boxplot in Fig. 3.21 and a boxplot in Fig. 3.22 for the same data in Fig. 3.20.
3.4.5.2
Mosaic Charts
A mosaic chart is a graph of a crosstab of two nominal or ordinal variables but without the numbers. The numbers are replaced by colored or shaded blocks, the size of the blocks in proportion to the numbers in the cells of the crosstab. This has an advantage over the crosstab itself since a table of numbers becomes challenging to interpret once the table becomes large. A simple 2 × 2 table is easy to interpret— there are, after all, only four cells to examine. A 3 × 3 table is more challenging with 9 cells, and a 4 × 4 table is even more challenging with 16 cells. How do you begin to “see” relationships in these larger tables when all you see is numbers? One way, of course, is to color the cells. The drawback to this solution is that the colors could blend together and if the table is large, the absolute number of colors could
106
3 Shallow Survey Analysis
Fig. 3.20 This is an example of a histogram. Notice that this is right skewed
overwhelm your capacity to discern patterns.8 A graph is, of course, simpler for “seeing” relationships. This is what the mosaic chart does. I illustrate creating a mosaic chart in Fig. 3.23 with the Seaborn mosaic function. This uses the DataFrame with the two variables for the crosstab (i.e., “Age” and “Gender”) and creates a crosstab behind the scene. You could create the crosstab explicitly and then use it directly. I show the code snippet for this in Fig. 3.24. The mosaic chart can handle multiple dimensions, not just the two I showed so far. For example, you could display the age and gender distribution by segments. I illustrate this in Fig. 3.25. Notice that although it is possible to create this more complex display, it does not mean it is more informative. This particular one is 8 This
is an especially acute problem for people who are color challenged.
3.4 Data Visualization
107
Anatomy of a BoxPlot Range Median o
x ¯
Smallest > Fence Q1
Largest < Fence * Q3
IQR 1.5 × IQR Lower Fence
1.5 × IQR Upper Fence
Fig. 3.21 Definitions of parts of a boxplot. Based on code from https://texample.net/tikz/ examples/box-and-whisker-plot/, last accessed January 13, 2021. Source: Paczkowski (2022). Permission granted by Springer
Fig. 3.22 This is an example of a boxplot. Notice that this is right skewed reflecting the histogram in Fig. 3.20
108
3 Shallow Survey Analysis
Fig. 3.23 This mosaic chart was produced using just the DataFrame; a crosstab was produced by the function. The axes and figure space were set using the fig, ax = plt.subplots(1, 1) statement, which set a 1 × 1 plotting space, so that the annotation could be added. The gap argument sets a small gap between the cells of the mosaic
Fig. 3.24 This mosaic chart was produced using a crosstab produced by the Pandas crosstab function. The chart itself is not shown since it is the same as in Fig. 3.23
now crowded with overlapping labels, making it more challenging to see the central messages and, thus, negating the chart’s purpose. My recommendation is to use two, but no more than three, variables.
3.4 Data Visualization
109
Fig. 3.25 This mosaic chart uses three variables: age, gender, and marketing segment
3.4.5.3
Heatmaps
Suppose you have a matrix such as a correlation matrix or a crosstab table. You can display it by color coding each of its cells. The colors, however, are more effective in displaying relationships in the matrix if different shades or intensities are used to show membership in a range of values in the matrix. Cells with values in a lower range are filled with a low color intensity, and cells with values in a higher range are filled with a high color intensity. You can then quickly identify cells with low and high values based on the color intensity. The intensity can be interpreted as
110
3 Shallow Survey Analysis
Fig. 3.26 This heatmap uses the age and gender data. The argument cmap = “YlGnBu” specifies the color palette for the color map
“heat”—low intensity, low heat; high intensity, high heat. A scale is needed to aid interpretation of the “heat,” so this scale is a thermometer or color map. You can specify the colors for the color map using a color palette. Some available palettes are: • • • •
YlGnBu: Variations on yellow/green/blue Blues: Variations on blues BuPu: Variations on blue/purple Greens: Variations on greens
Many other palettes are available. See the Seaborn documentation for more. I show a heatmap for the age-gender distribution for the yogurt survey in Fig. 3.26.
3.5 Weighted Summaries: Crosstabs and Descriptive Statistics
111
3.5 Weighted Summaries: Crosstabs and Descriptive Statistics You can apply weights to your crosstabs and descriptive statistics. The first task is to verify that the sum of the weights equals the population total. You can do this using either the Pandas sum method or statsmodels’ DescrStatsW function that returns a number of weighted properties. I list these in Table 3.7. I show the application of the sum_weights properties in Fig. 3.27 along with a “brute force” way to sum the weights for the yogurt consumers. The population size was determined to be 240,780,262 yogurt consumers based on internal studies and various syndicated industry reports examined by the client. You can see that the sum of the weights equals this population total as it should if the weights are correctly calculated.
Table 3.7 These are the properties of statsmodels’ DescrStatsW function. Source: https://www. statsmodels.org/dev/generated/statsmodels.stats.weightstats.DescrStatsW.html. Last accessed July 1, 2020 Property command demeaned mean nobs std std_mean sum sum_weights sumsquares var
Calculation Data with weighted mean subtracted Weighted mean of data Alias for number of observations/cases, equal to sum of weights Standard deviation with default degrees of freedom correction Standard deviation of weighted mean Weighted sum of data Sum of weights Weighted sum of squares of demeaned data Variance with default degrees of freedom correction
Fig. 3.27 The sum of the weights can be checked using either method shown here
112
3 Shallow Survey Analysis
Fig. 3.28 Weighted descriptive statistics are defined in a dictionary, calculated based on this dictionary, and then the results are put into a multi-indexed DataFrame
Fig. 3.29 Weighted crosstabs of the yogurt data
Once the weights are checked, they can be used to calculate weighted summary statistics using the DescrStatsW properties in Table 3.7. I show an example in Fig. 3.28 for calculating weighted statistics. You can also get weighted crosstabs. I show in Fig. 3.29 how to apply weights for the yogurt data.
Chapter 4
Beginning Deep Survey Analysis
Contents 4.1
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Hypothesis Testing Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Examples of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 A Formal Framework for Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 A Less Formal Framework for Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Types of Tests to Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Quantitative Data: Tests of Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Test of One Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Test of Two Means for Two Populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Test of More Than Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Categorical Data: Tests of Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Single Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Comparing Proportions: Two Independent Populations . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Comparing Proportions: Paired Populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Comparing Multiple Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Advanced Tabulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Advanced Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Extended Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Geographic Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114 115 118 118 119 120 122 122 126 131 142 143 144 146 147 153 158 159 162 165 166
I had divided the analysis of survey data into Shallow Analysis and Deep Analysis. The former just skims the surface of all the data collected from a survey, highlighting only the minimum of findings with the simplest analysis tools. These tools, useful and informative in their own right, are only the first that should be used, not the only ones. They help you dig out some findings but leave much buried. I covered them and their use in Python in the previous chapter. Much that could really be said about the Core Questions is often left buried inside the data. Digging out more insightful, actionable, and useful information, rather than leaving it buried and useless, requires a different and more complex tool set. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_4
113
114
4 Beginning Deep Survey Analysis
You can go deeper into your survey data by hypothesis testing and modeling of key relationships, in both cases looking for insight beyond the obvious, insight not revealed by Shallow Analysis. This is Deep Analysis. Shallow Analysis is simple. Deep Analysis is complex and challenging with more possibilities. And it is more time-consuming. In this chapter, I will focus on three advanced Deep Analysis methods: 1. Hypothesis testing 2. Advanced tabulation 3. Advanced visualization I will go to a deeper level in Chap. 5 where I discuss modeling. If you wish, you could stop with this chapter and feel confident that you have gone deep into your data. The next chapter will raise your confidence even higher.
4.1 Hypothesis Testing Hypothesis testing is a much neglected and at the same time, as a contradiction, overly used and abused aspect of survey analysis. The neglect is on par with weighting survey data to represent the true target population. As I remarked about Shallow Analysis, simple means and proportions are often reported as if they reflect true means and proportions of attitudes, opinions, and interests in the target population. This belief, however, is far from the truth. They may not statistically reflect the population. By “statistically reflect,” I mean that the long-run mean of an estimate of a population parameter equals the actual population parameter after allowing for random error in the data due to sampling issues and unknown causal variations in respondents. The long-run mean is the expected value. In like manner, a difference between two means or two proportions or the difference between one of these and a hypothesized value is not tested, suggesting that any calculated difference is true and correct. Both are wrong. The data may not, and probably do not, reflect the target population, and the difference may or may not reflect the population parameter differences. My focus is on statistically significant differences since I covered weighting in the previous chapter. I have seen many survey reports that are devoid of any statistical testing. Why this is the case is a mystery, but I can hazard two guesses. The first is that the survey analyst does not sufficiently understand the concept, and therefore, it is easier to omit tests. The second is that they believe the reader of a survey report, the client, is either uninterested in significance tests or is also uncomfortable with them. In either case, results are not “stat tested” more often than they are tested. My observation about the lack of testing is, of course, not universal. There is a flip side. Many survey analysts do statistically test hypotheses looking for differences between, say, two proportions or test if one proportion differs statistically from a
4.1 Hypothesis Testing
115
hypothesized value. The issue is the overuse—the abuse—of testing to the point that there is a “cult of significance.” This “cult” characterization is mostly an academic issue, concerned with significance testing in academic studies and reported in professional journals. But there is also a cultlike belief in significance testing in pragmatic studies such as market research surveys and political polling that focuses on testing every finding, every calculation, every possible difference for “significance.” This belief is the opposite extreme from doing no testing. See Ziliak and McCloskey (2008) for an extensive discussion about how statistical testing may create more problems than it solves. Those at the cultlike extreme believe that anything that is insignificant must be completely ignored; it is of no value for decisions. They fail to distinguish between statistical significance and practical significance. Practical significance is the use of the information about a difference in an actual decision. Suppose you show that a difference in means of 0.3 units (whatever the units might be) is statistically significant. The question that has to be answered is, “Does this difference result in any change in operations or strategies, or tactics, or plans?” In other words, is there any impact on business decisions or voting decisions or policy decisions from knowing a difference is statistically significant? If the answer is no, then do not do a test and certainly do not report any results. See Daniel (1977), Rosen and DeMaria (2012), and Ellis and Steyn (2003) for discussions of practical versus statistical significance. I provide a high-level overview of hypothesis testing in the next subsection. This is background for those cases where practical significance is acknowledged and testing is between the two extremes.
4.1.1 Hypothesis Testing Background A hypothesis is a conjecture, something you believe is true but need to prove. For example, you might conjecture that the mean price for any product in your product category sold through local grocery stores is $1.25. This is a statement about all price points in the market for that product. If μ represents the population mean price, then μ = $1.25. Given the large number of grocery stores, it is too costly to collect price data on sales in all of them. You have to infer this mean price based on sample data collected through a survey of grocery stores or shoppers at those stores. The price point you hypothesize is for the population, the survey data are a sample, and the process of going from the sample to the population is inference; the logical direction is not from the population to the sample. The statistical hypothesis statement is HO : μ = $1.25
(4.1)
HA : μ = $1.25
(4.2)
116
4 Beginning Deep Survey Analysis
where HO is the conjecture you want demonstrate. Note that these statements are in terms of population parameters, not sample statistics. A parameter is an unknown, constant numeric characteristic of the population, and a statistic is a quantity calculated from sample data. This is a common textbook description of these two fundamental concepts. See, for example, Moore and Notz (2014). The statement HO is called the Null Hypothesis because you can rewrite it as HO : μ − $1.25 = 0, which is a null statement. The statement HA is called the Alternative Hypothesis. It determines the type of test you will do: one-sided or twosided. If you write it as HA : μ = $1.25, then it suggests a two-sided test because either HA : μ < $1.25 or HA : μ > $1.25 is true, not both. If you write it instead as HA : μ < $1.25 or HA : μ > $1.25, then a one-sided test is suggested. In some instances, the Alternative Hypothesis is the conjecture you believe is true. For example, a pricing manager could argue that the mean price for all products is greater than $1.25 so that the company should increase its price. The hypotheses are written as HO : μ = $1.25 and HA : μ > $1.25. The objective is to show that HA is correct. The type of test is important for the interpretation of the p-value and, therefore, the test itself. A p-value is the area of the tail of a probability distribution cut off by the test statistic. Some distributions, such as the normal distribution and Student’s t-distribution, have two tails, so which one is relevant for a particular test? The Alternative Hypothesis determines which one? More formally, the p-value is the probability that the value of the test statistic is at least as large as this observed value. If the Alternative Hypothesis is one-sided (i.e., with a “greater than” or “less than” sign), then there is only one p-value possible for the appropriate tail. If the Alternative Hypothesis is two-sided, then the p-value is the sum of both tails. This may conceptually be simple enough, but the implementation is otherwise because software does not know the Alternative Hypothesis and, therefore, which tail to use for the p-value. Most software typically reports only one p-value. If a normal or Student’s t-distribution is relevant for a test, the p-value is calculated for the value of the test statistic and then doubled to reflect the two tails. Some software report three p-values: one for each tail and one for both through doubling. If only one p-value is reported, then it must be divided by two if your Alternative Hypothesis is one-tailed. This can certainly be confusing, if not ignored, by most casual data analysts. The confusion, however, is simple to clear up using the following guidelines:1 • Two-tail: You have the required value. • Right-tail only: Reported p-value divided by 2 • Left-tail only: One minus the reported p-value divided by 2 This is moot, of course, if your software reports all three.
1 Based on: https://stats.stackexchange.com/questions/267192/doubling-or-halving-p-values-forone-vs-two-tailed-tests/267197. Last accessed July 29, 2020.
4.1 Hypothesis Testing
117
In most instances, your conjecture involves a comparison of one group versus another so that one of the groups is better, larger, higher, stronger, and so forth, than the other. For example, you may conjecture that more women vote for a democratic presidential candidate than men. If pW is the population proportion of women who vote democratic and pM is the corresponding proportion of men, then the statistical version of this conjecture is HO : pW = pM
(4.3)
HA : pW > pM .
(4.4)
In this case, the Alternative Hypothesis is the conjecture you want to prove is true. The tactic is to show, based on sample survey data, that the Null must be rejected to show that the statement that is important, the Alternative, is correct. In this respect, this testing procedure is a statistical version of the reductio ad absurdum logic argument in which a statement is proven true by showing that its counterpart or complement is false. The Alternative Hypothesis is true, therefore, if the evidence leads you to reject the Null Hypothesis. If you reject the Null Hypothesis so a difference is inferred, then the next logical question concerns the practical significance of the difference, the concept I introduced above. From a practical perspective, can you take any meaningful action based on this information? Are all differences, any differences, no matter how small or trivial, equally important that action must be taken on them? Probably not. The costs of acting on a trivial difference, no matter how significant, must be balanced against the returns from that action.2 Is the mean difference in market prices of practical importance? Is a gender difference in voting predilection practically important for the next election? These are value judgments outside the scope of statistical testing. Yet those researchers who do, in fact, conduct statistical testing will test all possible relationships, all possible pairs of variables or factors, looking for significant differences regardless of how large or small and regardless of their practical importance. This is the abuse of statistical testing, which actually leads to other statistical issues I will discuss below. My recommendation is to use statistical testing of meaningful conjectures, those that, if significant results are found, have practical applications. A small observed difference does not have to be tested for significance3 because even if statistically significant, they have no practical meaning.
2 As a personal anecdote, I once had a client who wanted to know if a difference of one cent in the prices of two products was significant—the products were selling for about $10 each. 3 To be “stat tested” as many like to say.
118
4 Beginning Deep Survey Analysis
4.1.2 Examples of Hypotheses The number and types of hypotheses in survey research are infinite and broad to say the least. Here is an extremely short list meant for illustrative purposes only. Consumer Studies • The average market price for an 8 oz. container of yogurt is $1.25. • There is no difference between male and female consumers in their purchase of product X. Public Opinion/Voting Behavior • There is no effect of social media on voting behavior. • There is no difference in the mean age of voters/consumers/veterans/etc. by racial/ethnic groups. Medical/Pharmaceutical • The mean number of prescriptions written for illness X is Y . • The mean number of patients seen per week in a medical practice is X.
4.1.3 A Formal Framework for Statistical Tests Cox (2020) described a general definition of statistical testing and recommended a framework for doing testing. Suppose you formulate a hypothesis, designated as HO , for a Core Question. This is the Null Hypothesis. Since null means “nothing,” “not there,” or “zero,” the Null Hypothesis states that there is no difference between the true parameter in the population and a hypothesized value for it. The hypothesis could be that the population of voters voting for a democratic candidate is p0 , the true mean hours people watch a cable network news channel is μ0 , or the true mean amount shoppers spend on a product category is μ0 . The hypothesis is tested using sample data and a measure that is the parameter’s sample counterpart. This might be a sample mean, sample proportion, or difference in means or proportions. Let y be sample data and t (y) the value of a statistic, called the test statistic, that is the measure of a population counterpart. The test statistic might be in standardized form. For example, t (y) could be the sample mean less its hypothesized value with this difference relative to the mean’s standard error. If Y¯ = 1/n × Yi , then t (y) = (Y¯ −μ0 )/seY¯ with μ0 the hypothesized population mean and seY¯ the sample mean’s standard error. The test statistic has random variability because the data have random variability. It has a probability distribution. Common distributions are the normal, Student’s t, χ 2 , and F . They are related as I briefly describe in this chapter’s Appendix.
4.1 Hypothesis Testing
119
Let Y be a random variable for the population measure. For instance, Y could be the target population mean or proportion. Let t (Y ) be a random variable (since it is a function of a random variable) for the test statistic for the target population. Then t (y) is an observed value that equals t (Y ) if the Null Hypothesis is true. In this case, there should be no difference between t (y) and t (Y ), except for random variation in the data. That is, t (y) − t (Y ) = 0 on average. Large values of t (y) indicate that HO is incorrect and must be rejected in favor the HA . More formally, following Cox (2020), you can calculate p = P r(t (Y ) ≥ t (y) | HO )
(4.5)
where p is the p-value. The use of the p-value is controversial. As Cox (2020, p. 3) states, “The p-value is an indicator of the consistency with (HO ) . . . not a measure of how large or important any departure may be, and obviously not a probability that (HO ) is correct.” A small value indicates inconsistency, but this could be incorrect because of luck of the draw of the sample. Generally, the p-value is compared to a standard defined prior to the statistical test. The standard indicates the acceptable probability of making a wrong decision (i.e., rejecting HO when it is true; this is referred to as a Type I Error) based on the data that itself is subject to random variation. The standard is usually designated as α. The hypothesis testing rule is to reject HO if p-value < α. A conventionally agreed upon standard with most survey work is α = 0.05, but there is nothing scary about this value.4
4.1.4 A Less Formal Framework for Statistical Tests There are many statistical tests available for different types of data and situations. Regardless of data and situation, they all share a common framework that involves: 1. 2. 3. 4. 5.
Specifying the Null and Alternative Hypotheses Specifying the standard significance level, α (usually α = 0.05) Specifying and calculating a test statistic based on the survey data Comparing the p-value for the calculated test statistic and the α level Interpreting the results and drawing a conclusion
I illustrate these steps in Fig. 4.1 and will provide some examples of them in the following subsections.
4 I once did some survey analysis work for a large food manufacturing company (to remain nameless) that used α = 0.20.
120
4 Beginning Deep Survey Analysis
Interpret Results and Draw Conclusion
Compare p-value to a
Specify and Calculate a Test Statistic
Specify Significance Level: a = 0.05
Specify Null and Alternative Hypotheses Fig. 4.1 These are the five steps you should follow for hypothesis testing
4.1.5 Types of Tests to Use There are numerous statistical tests, many of which are most likely of little use with survey data because this type of data is mainly categorical. Nonetheless, I divide tests into two groups: quantitative and categorical. You will use tests in the latter group most often. Those in the first group are used with ratio or interval data. For example, you could have a Surround or Core Question on the number of units purchased in the last shopping occasion, or the number of patients seen per week, or the number of times voted in the last 10 elections. These are quantitative questions. Tests in the second group are used with a Surround or Core Question that categorizes people by demographic characteristics (e.g., age, income, education), shopping behavior (e.g., where shop, frequency of shopping, items purchased), political party affiliation (e.g., Democrat, Republican, or Independent), and so on. I flowcharted testing conditions and test types in Fig. 4.2 to aid your selection of a test and to act as a guide for the discussion in the following subsections. The flowchart is not comprehensive because of the large number of tests available, but it is extensive enough to be useful for most situations.
4.1 Hypothesis Testing
121
Fig. 4.2 This chart will help you decide the test to use for different objectives and conditions
Data Type
Quantitative
Categorical
Objective: Test One Proportion
Objective: Test One Mean
Condition: Small Sample: t-Test
Condition: Large Sample: Z-Test
Objective: Compare Two Means
Z-Test
Objective: Compare Two Proportions
c 2 -Test
Condition: Small Sample: t-Test
Condition: Independent Populations
Condition: Equal s2 : Pool Std. Errors
Condition: Unequal s2 : Sum Std. Errors
Condition: Paired Populations
Std. Error of Difference
Condition: Large Sample: Z-Test
Objective: Compare Multiple Means
Tukey’s HSD Test
Objective: Analyze Contingency Table
Objective: Test Independence
c 2 -Test for Independence
Objective: Test Homogeneity
c 2 -Test for Homogeneity
Objective: Compare Multiple Proportions
Marascuilo Procedure
122
4 Beginning Deep Survey Analysis
4.2 Quantitative Data: Tests of Means I will first discuss tests for quantitative data with a focus on means: one mean against a standard, a comparison of two means, and comparisons of more than two means. There are thus two possible conjectures that form the Null Hypothesis: 1. The population mean equals a specific value. 2. The means for two or more groups are equal. This section is, therefore, divided into three parts, one for each type of comparison. This is in alignment with the left branch of Fig. 4.2, so each of the following subsections follow the main blocks of that branch.
4.2.1 Test of One Mean A test requires a test statistic. There are two possibilities; your choice depends on your data. The test statistics are the z-statistic and the Student’s t-statistic. The formulas are z=
Y¯ − μ0 ∼ N (0, 1) √ σ/ n
(4.6)
Y¯ − μ0 ∼ tn−1 √ s/ n
(4.7)
and t=
where μ0 is the hypothesized value, σ is the population standard deviation, and s is the sample standard deviation. The n − 1 for the t-distribution is the degrees-offreedom that determine the shape of the distribution. See the Appendix for a brief comment. The sample standard deviation is the square root of the sample variance: n s = 2
− Y¯ )2 . n−1
i=1 (Yi
(4.8)
If the sample size is large, as a rule of thumb greater than 30, then a z-test should be used; otherwise, use the t-test. The reason for the switch from a z-test to a t-test is the convergence of the two distributions as the sample size approaches 30. After n = 30, there is no difference between the t-distribution and the normal distribution which is the base for the z-test. I show a comparison between the normal distribution and the t-distribution in Fig. 4.3. Consider the yogurt data for the first conjecture that the population mean price equals a specific value. The Null Hypothesis is that the mean market price of
4.2 Quantitative Data: Tests of Means
123
Fig. 4.3 This is a comparison between the standard normal distribution and the Student’s tdistribution with two different degrees-of-freedom. The t-distribution with 30 degrees-of-freedom is almost identical to the standard normal curve
a yogurt container is $1.25. Whether it is higher or lower than this amount is unimportant, so this is a two-tailed test expressed as HO : μ = $1.25
(4.9)
HA : μ = $1.25.
(4.10)
You could use an unweighted or weighted sample depending on availability of weights. I drew a random sample of 25 observations to illustrate the t-test and show an unweighted and weighted test in Figs. 4.4 and 4.5, respectively. I illustrate in Figs. 4.6 and 4.7 how to conduct a z-test with unweighted and weighted data, respectively, using the full yogurt sample. There are, therefore, four cases: 1. 2. 3. 4.
Unweighted t-test Weighted t-test Unweighted z-test Weighted z-test
124
4 Beginning Deep Survey Analysis
Fig. 4.4 This illustrates how to conduct an unweighted t-test of a continuous variable
Fig. 4.5 This illustrates how to conduct a weighted t-test of a continuous variable
Notice, first, that the degrees-of-freedom for the t-test is n − 1, so with n = 25, the degrees-of-freedom are 25 − 1 = 24 < 30. Also notice that the p-values for three of the four cases are less than α = 0.05, indicating the Null Hypothesis should be rejected. The fourth case is marginal. A t-test or z-test is appropriate since price is a continuous quantity. I used the DescrStatsW function in the statsmodels package to do the four tests. This function requires instantiation before it can be used. Basically, you instantiate a class containing a function by specifying the class parameters and saving a version (i.e., an instance of it) in a variable. DescrStatsW has two parameters: One is the variable for the test, and the other is the weights. Weights are optional, but the variable is not. In my four examples, I saved the instantiated function in the variable calc. Nothing happened at this point; you merely have a copy, an instance, of the function saved in the variable, ready to be used. To use it, you have to call that variable and chain to it the test you want to do as I show in my examples. I use the ttest_mean or ztest_mean functions, each of which has its own parameter, the
4.2 Quantitative Data: Tests of Means
125
Fig. 4.6 This illustrates how to conduct an unweighted z-test of a continuous variable
Fig. 4.7 This illustrates how to conduct a weighted z-test of a continuous variable
hypothesized mean for the Null Hypothesis. The hypothesized mean is saved as the variable hm, so this is passed as an argument to the functions. Each function returns a tuple that contains the relevant test statistic value and its p-value, in that order, for the z-test and these two plus the degrees-of-freedom for the t-test. The first is a 2-tuple and the second is a 3-tuple. A tuple is a Python quantity (called a container) written with its elements bracketed by parentheses, such as (x, y) or (2, 6, 8). These are “packed.” If you use the setup I show in my examples, the tuples are “unpacked” with the elements given names so you can easily access them.
126
4 Beginning Deep Survey Analysis
4.2.2 Test of Two Means for Two Populations Let me now discuss comparing means for two populations. I will use the words “population” and “groups” interchangeably to refer to the same concept. There are two possibilities for these populations: They could be independent of each other or dependent on each other. An example of the former is a population of buyers of a product and a population of non-buyers, a population that voted for a republication candidate and one that voted for a democratic candidate, or those who got a COVID19 vaccination and those who did not. An example of a dependent case is a survey in which respondents are asked a question (e.g., would you vote for candidate X) and then are told something about the candidate’s position on an issue they did not previously know, after which they are asked again if they would vote for X. As another example, someone who received two COVID vaccinations might be asked, after the first one, “How do you feel?” and then asked the same question after the second vaccination. These are before/after questions using the same respondent to elicit two responses to the same question. Other surveys may ask a husband a question and then ask his wife the same question. Because they are married, they may (certainly not always) respond similarly. The same holds for siblings, parents and their children, coworkers, neighbors, and so forth. Notice that these are all examples of paired objects: husbands/wives, parents/child, self/self, before/after. So the tests are paired samples tests. See Weiss (2005) for some examples. Regardless of the case, you can calculate differences. It is just a matter of what differences. For the independence case, the differences are calculated at the level of the mean of each group. If Y¯1 is the mean for the first group and Y¯2 is the mean for the second group, then you simply calculate Y¯1 − Y¯2 . However, if the groups are not independent, for example, for a before/after study, you simply calculate the difference between the first and second response for each respondent and then calculate the mean of these differences. That is, if di = Yi1 − Yi2 is the difference between Y1 and Y2 for individual i, then the mean of the differences is d¯ = 1/n × ni=1 di . A little algebra will show you that this mean is simply Y¯1 − Y¯2 . While these differences are easy to calculate, the standard errors are another issue. I will consider the standard errors for each case separately. I will also discuss the relevant test statistic for each.
4.2.2.1
Standard Errors: Independent Populations
Independent populations imply no connection, no tie between them. Formally, the covariance between them is zero, so the correlation is zero. The correlation is the covariance divided by the product of the individual variances. Consequently, there is no connection between the sample groups formed from the two populations. For example, the yogurt problem has two groups: buyers of the client’s brand and buyers of other brands. The difference in the mean price paid is simply Y¯1 − Y¯2 . This is part of the numerator for the test statistic. The numerator is always a function of the
4.2 Quantitative Data: Tests of Means
127
means; in this case, the difference is the means. The other part of the numerator is the hypothesized value based on the Null Hypothesis. So the general form of the numerator is sample value—hypothesized value. The denominator, however, is the standard error of this difference. The denominator is always a standard error. In this case, the standard error depends on knowledge of σ 2 for each population. This is easily handled using the unbiased sample variance. If you assume that the population variances are the same for the two groups, then you use a weighted average of the sample group variances, so you combine, or pool, the sample variances. The two sample variances have to be weighted by their respective sample sizes to correctly pool them. If si2 is the sample variance for group i, i = 1, 2, and ni is the respective sample size, then the pooled sample standard deviation is sp =
(n1 − 1) × s12 + (n2 − 1) × s22 n1 + n2 − 2
(4.11)
The pooled standard deviation is the square root of the weighted average of the group variances with weights (ni −1)/(n1 +n2 −2), i = 1, 2. The degrees-of-freedom are based on summing the individual degrees-of-freedom: (n1 − 1) + (n2 − 1) = n1 + n2 − 2. See Weiss (2005, p. 492). Also see Guenther (1964) for a discussion of (4.11), as well as this chapter’s Appendix. The pooled t-statistic is t=
(Y¯1 − Y¯2 ) − (μ1 − μ2 )
. sp × n11 + n12
(4.12)
If the population variances are different, or you assume they are different, then use the sample variances without pooling (i.e., without weighting) so the t-statistic is t=
(Y¯1 − Y¯2 − (μ1 − μ2 ) . s22 s12 + n1 n2
(4.13)
As an example, consider the average price paid for yogurt by two groups: buyers of the client’s brand and non-buyers of the client’s brand. In this case, the price point per se is unimportant. What is important is whether or not the mean price, whatever it is, is the same for these two groups. From a pricing strategy perspective, an insignificant difference in the prices (i.e., they are the same statistically) indicates that the yogurt market is highly competitive with little or no price competition.5 The hypotheses are: 5 Economists refer to this as a perfectly competitive market. All firms in such a market are price takers, meaning they have no influence on the market price. Therefore, there is only one market price.
128
4 Beginning Deep Survey Analysis
Fig. 4.8 This illustrates how to conduct an unweighted pooled t-test for comparing means of a continuous variable for two populations
Fig. 4.9 This illustrates how to conduct a weighted pooled t-test for comparing means of a continuous variable for two groups
HO : μBuyers = μnon−Buyers
(4.14)
HA : μBuyers = μnon−Buyers .
(4.15)
I first selected a random sample of n = 25 as before for a simple t-test and then used the entire sample for both unweighted and weighted tests. I show the results in Figs. 4.8, 4.9, 4.10, and 4.11. Notice the vast difference in the p-values between the weighted and unweighted tests. The weighted are highly significant. For these tests, I used the CompareMeans module from the statsmodels package and instantiated it with the data for each population. I also used the from_data subfunction to tell CompareMeans to calculate the means from the two arguments. The instantiated function was saved as before. In this example, I did not have to pass any other arguments to it. It just had to be called with the appropriate test: ttest_ind or ztest_ind. Tuples are returned and unpacked as before.
4.2 Quantitative Data: Tests of Means
129
Fig. 4.10 This illustrates how to conduct an unweighted z-test for comparing means of a continuous variable for two groups
Fig. 4.11 This illustrates how to conduct a weighted z-test of a continuous variable
There are several conditions that should be checked before the above tests can be applied in practice. A major one is the constancy of the variance of the data, or homoscedasticity. I did not explore this issue. A reference is Weiss (2005).
4.2.2.2
Standard Errors: Dependent Populations
The standard error for paired samples is actually simple. Recall that the differences are calculated for each observation. That is, di = Yi1 − Yi2 . So you have a batch of numbers that are the di . You merely calculate the standard deviation of these differences and divided by the is sd = nsquare root of n. The standard deviation n √ ¯ 2 s i=1 (di −d) /n−1 with d¯ = 1/n i=1 di . Then the standard error is d/ n. As an example, consider an automobile manufacturer that wants to make a minor design change to one of its models. Its market research staff decides to survey several dealers, asking each one the average number of cars it sells each month
130
4 Beginning Deep Survey Analysis
(even though the manufacturer knows these numbers). The questionnaire has a paragraph that describes the design change and then asks the dealers how many it expects to sell because of this change. Since these are averages, decimal numbers are acceptable. This is a before/after study design with each dealer paired with itself. The hypotheses are HO : μ1 = μ0
(4.16)
HA : μ1 > μ0
(4.17)
where μ0 is the mean units sold before the dealers are told about the design change and μ1 is the mean units expected to be sold after the dealers learn about the change. The manufacturer expects the change to increase sales, so the Alternative Hypothesis is one-tailed. The implication for evaluating the p-value of a test is that the reported value must be halved. I show the data and setup for a test in Fig. 4.12. I use the scipy module stats that has a function for a paired test: ttest_rel. This function has two parameters each as an
Fig. 4.12 This is a paired t-test example. Notice that the p-value is divided by 2 for the one-sided Alternative Hypothesis
4.2 Quantitative Data: Tests of Means
131
array or list: the before and after data in that order. The function returns the t-statistic and its p-value. But since the function does not know your Alternative Hypothesis, the p-value is for both the upper and lower tail of the t-distribution, hence the halving for this problem. Based on this adjusted p-value, the Null Hypothesis is rejected. Since the t-statistic is positive (4.574), you can conclude that expected sales will be greater because of the minor design change.
4.2.3 Test of More Than Two Means I will shift to the VA survey for the next discussion. Recall that respondents were not explicitly asked their age, just the year they were born. Their age, however, is easily calculated by subtracting the year of birth from the year of the survey: 2010. A missing value analysis, which I show in Fig. 4.13, indicates that 206 respondents omitted their year of birth so their age could not be calculated. Also, there are 153 vets whose service branch could not be determined. These records were dropped for the subsequent analysis. The net sample size is n = 8362. This report is generated with a function I wrote using the sidetable package. I show the distribution of the age, net of the missing data, in Fig. 4.14. This distribution is somewhat normal although there may be a slight left skewness. I tested this using the skewtest function in the scipy.stats package. The test statistic is −18.3097. A symmetric distribution has a skewness value of zero. The direction of skewness for a nonzero test statistic is indicated by the sign: Negative is left skewed, Fig. 4.13 This is a missing value analysis for the age variable
132
4 Beginning Deep Survey Analysis
Fig. 4.14 The age of the veterans responding to the VA survey was calculated. The distribution is somewhat normal with a slight left skewness
and positive is right skewed. This test statistic value indicates left skewness; the pvalue is 0.0000, indicating that this test statistic for age skewness is significant. I calculated the mean age of the vets by their service branch and show this in Fig. 4.15. You can see that there is some variation in the mean age, as should be expected. It appears that the mean age of the Marine Corp. vets is less than the other branches, while the “Other Services” mean age is definitely higher. You can test if the mean age varies by the service branch using an omnibus F-test from an analysis of variance (ANOVA) table. An ANOVA table gives you a framework for analyzing the total variance in a data set. It is constructed by estimating a linear regression model (OLS) and then using the estimated model object with the anova_lm function. I illustrate this in Fig. 4.16. At this point, it is not important for you to know or understand OLS. Just view it as a mechanism to do calculations for an ANOVA analysis. I will, however, discuss OLS in Chap. 5. The ANOVA table components are based on the decomposition of the total sum of squares (SST), a measure of the total variation in a one-way table. The SST is calculated as ni=1 (Yi − Y¯ )2 , the sum of the squared deviations from the
4.2 Quantitative Data: Tests of Means
133
Fig. 4.15 The mean age of the veterans by their service branch
Fig. 4.16 The ANOVA table assists you in constructing the omnibus F-test for age of the veterans by their service branch. The table shows the decomposition of the total sum of squares and the F-statistic based on the component parts. Notice that F = 30.385172 = 6294.225452/207.147931
134
4 Beginning Deep Survey Analysis
Table 4.1 This is a typical one-way table layout. This example is for the military service branch problem that has seven branches. The categorical variable is the branch with seven levels. The levels are the columns and the vets’ age is the measure. Each row is a vet respondent to the VA survey. The column means are indicated using “dot notation,” which uses a dot to indicate what was summed or averaged. So X¯.1 indicates the mean is for column 1 over all vets. The overall mean is X¯.. . The layout is similar to that in Guenther (1964) Levels of categorical variable Air Force Army Coast Guard Age11 Age12 Age13 Age21 Age22 Age23 .. .. .. . . . Agen1 Agen2 Agen3 Sample sizes n1 n2 n3 True means μ.1 μ.2 μ.3 X¯.1 X¯.2 X¯.3 Estimated means True variances σ.12 σ.22 σ.32 2 2 2 Estimated variances s.1 s.2 s.3
Marine Corp. Age14 Age24 .. . Agen4 n4 μ.4 X¯.4 σ.42 2 s.4
Multiple Age15 Age25 .. . Agen5 n5 μ.5 X¯.5 σ.52 2 s.5
Navy Age16 Age26 .. . Agen6 n6 μ.6 X¯.6 σ.62 2 s.6
Other Age17 Age27 .. . Agen7 n7 μ.7 X¯.7 σ.72 2 s.7
sample mean. This should look familiar. It is the numerator of the sample variance: s 2 = ni=1 (Yi −Y¯ )2/n−1. The n − 1 is the degrees-of-freedom and is necessary so that E(s 2 ) = σ 2 . See this chapter’s Appendix for this point. So the sample variance is really the average SST. A one-way table has a single categorical variable and a quantitative variable. A categorical variable has levels that group the quantitative variable. In this example, the variable Service Branch is categorical with seven levels, and the quantitative variable is age. A table is one-way because of the one categorical variable that classifies the quantitative measure. You can contrast this with a two-way table that has two categorical variables and one quantitative variable. Higher-level tables are certainly possible so an ANOVA table can become quite complex, allowing extensive hypothesis testing. A one-way table is, of course, the simplest with only one hypothesis. A one-way table for the Service Branch looks like the one in Table 4.1. The question is, “Does the measure vary from one level of the categorical variable to the next beyond random noise?” The “random noise” part is important because all data have a random component that may mislead you to believe that the variation is real when it may be just this random noise. The Null Hypothesis is that there is no difference among the level (i.e., group) means disregarding the random noise. The Alternative Hypothesis is that there is at least one difference. These two hypotheses are expressed as HO : μ.1 = μ.2 = . . . = μ.L = μ
(4.18)
HA : ∃1μ.i = μ
(4.19)
where μ.i is the mean for level or group i, i = 1, 2, . . . , L, and μ is an overall mean and the symbol “∃1” means “at least 1.” Clearly, if μ.1 = . . . = μ.L , then
4.2 Quantitative Data: Tests of Means
135
there is only one mean that is μ. An ANOVA is an approach to test this conjecture. I use dot notation to indicate which subscript is operated on either by summation or averaging. So μ.j is the average of all n observations for level j . To understand ANOVA, consider an observation Yij on a quantitative measure for respondent i for level j of the categorical variable. For our problem, the measure is age and the levels are service branches. You can write Yij = μ + (μ.j − μ) + (Yij − μ.j ). Notice that two terms cancel, so this is equivalent to Yij = Yij , a mere tautology. You may think nothing is accomplished by this simple operation, but a great deal actually is gained. In this expression, μ is the overall mean such that μ=
L 1 nj × μ.j N
(4.20)
L n μ.j N
(4.21)
j =1
=
j =1
if nj = n, ∀j where ∀j means “for all j , N is the population size, and nj is the size of level j of the categorical variable with L levels. The sample size is the same for 1 n j each categorical level, so the samples are balanced. This uses μ.j = × i=1 Yij nj as the level j population mean. The difference (μ.j − μ) is the deviation of the j th level’s mean from the overall mean, μ. The difference (Yij − μ.j ) is the deviation of the ith observation for level j from the level j mean. L L Notice from (4.21) that N/n × μ = L j =i μ.j . But N = j =1 n.j = j =1 n = L (L×n) L × n if nj = n, ∀j . Then /n = L × μ = j =1 μ.j . Now define βj = μ.j − μ. You now have L
L βj = (μ.j − μ)
j =1
(4.22)
j =1
=
L
μ.j − L × μ
(4.23)
j =1
= L×μ−L×μ
(4.24)
=0
(4.25)
so the sum of the deviations from the overall mean is zero. This is actually consistent with a standard result involving deviations from a mean. In general, the sum of deviations from a mean is always zero.6
6 It
is easy to show that for X¯ = 1/n
¯ = 0. (Xi − X)
136
4 Beginning Deep Survey Analysis
Table 4.2 These are two examples of effect coding. In Panel (a), there are two levels: L = 2. In Panel (b), there are three levels: L = 3. In both panels, the base is L (a) Observation 1 2 3 4 5
Level L L H L H
X −1 −1 1 −1 1
(b) Observation 1 2 3 4 5
Level L M H L H
XM −1 1 0 −1 0
XH −1 0 1 −1 1
In addition to defining βj , you can also define ij = Yij − μ.j . If Yij ∼ N (μ.j , σ 2 ), then ij ∼ N (0, σ 2 ) by the Reproductive Property of Normals and the facts that E(Yij − μ.j ) = E(Yij ) − μ.j = 0 and V (Yij − μ.j ) = σ 2 . The Reproductive Property of Normals says that a linear function of a normally distributed random variable is also normal. See Paczkowski (2020) and Dudewicz and Mishra (1988). 2 You can now write Yij = μ + βj + ij , L j =1 βj = 0, ij ∼ N (0, σ ). You can expand this slightly by noting that Yij = μ + βj + ij is written as Yij = μ + βj × Xij + ij with 1 if observation i is in the j th level Xij = −1 if observation i is in the base level. This coding for Xij is called effects coding, and Xij is an effects variable. If L = 2, the base level is the first level in alphanumeric order.7 Since Xij is 1 or −1, there is no need to show it; it is implicit. Nonetheless, it is there. See Paczkowski (2018) for a thorough discussion of effects coding. With the inclusion of Xij , however, it is clear that Yij = μ + βj × Xij + ij is a regression model, which I discuss in more detail in Chap. 5. If L > 2, the −1 still codes the base level, 1 codes the level you are examining, and all other levels are coded as 0 (i.e., they are not relevant). Incidentally, there is a restriction on the number of effects coded variables you can create and use. It is one less than the number of levels: L − 1. This is done to avoid a numeric problem with the OLS estimation that is the foundation of ANOVA calculations. This problem is called multicollinearity. This is beyond the scope of this book. See Paczkowski (2022) for a discussion in the Business Data Analytics context. Also, see Gujarati (2003) and Greene (2003). I show an example of effects coding in Table 4.2. In a one-way table, the quantitative measure is a dependent variable, and the effects coded categorical variable is an independent variable. A model relating the two is Yij = μ + βj × Xij + ij 7 It
(4.26)
can actually be any level. As you will see, however, the first levels is dropped by statsmodels.
4.2 Quantitative Data: Tests of Means
137
where Yij is the value of the measure for observation i for group j , μ is the overall mean, βj is the effect of group j , and ij is a noise disturbance term. For the Null Hypothesis in (4.18), this means βj = 0, ∀j . The hypotheses is HO : βj = 0, ∀j
(4.27)
HA : ∃1βj = 0.
(4.28)
You need the best, unbiased estimates of μ and μ.j to estimate this model since βj = μ.j − μ. Unbiased means that on average, you get the correct answer. These estimates are given by Y¯.. for μ and Y¯.j for μ.j . Using these values, you have Yij = Y¯.. + (Y¯.j − Y¯.. ) + (Yij − Y¯.j ), which is similar to what I wrote above. This can be rewritten as Yij − Y¯.. = (Y¯.j − Y¯.. ) + (Yij − Y¯.j ). Squaring both sides and summing terms8 gives ni ni L L L (Yij − Y¯.. )2 = (Yij − Y¯.j )2 + nj × (Y¯.j − Y¯.. )2 (4.29) j =1 i=i
j =1 i=i
SST
j =1
SSW
SSA
SST is the overall or total variation in the data as I stated above. It is the total variation around the overall mean of the 1-D table. SSW is the within-group variation around the group mean; it is the within-group sum of squares. The SSA is the variation of each group’s mean around the population mean showing how the group means vary around this mean. It is called the among-group sum of squares. This is an important identity in analysis of variance studies: SST = SSW + SSA. The SSW is the weighted average of the column variances with weights equal to the column degrees-of-freedom, which is nj − 1. From basic statistics, s.j2 = (Yij −Y¯.j )2/(n.j −1) so (n.j − 1) × s 2 = (Yij − Y¯.j )2 . You can show that .j SSW = σ 2. E N −L The SSA is the variability of the group means. If these group means are the same, then Y¯.1 = Y¯.2 = . . . = Y¯.L = Y¯.. and SSA = 0. You can also show that L 2 SSA j =1 nj × (μj − μ) 2 =σ + . E L−1 L−1 You can now define an F-statistic for testing the Null Hypothesis:
FC,L−1,N −L
8 The
SSA L −1 ∼F = 1−α,L−1,N −L SSW N −L
cross-product term cancels after summing terms.
(4.30)
138
4 Beginning Deep Survey Analysis
Table 4.3 This is the general structure of an ANOVA table. The degrees-of-freedom (df ) and the sums of squares are additive: (L − 1) + (N − L) = N − 1 and SST = SSA + SSW Source of variation Among groups Within groups Total
df L−1 N −L N −1
Sum of squares SSA SSW SST
Mean squares MSA = SSA/L−1 MSW = SSW/N −L MST = SST /N −1
FC MSA/MSW
if the Null Hypothesis is true. The “C” in the left-hand side subscript indicates that this is calculated from sample data. The term on the right-hand side of the ∼ is the theoretical value from the F-distribution at the 1 − α level with L − 1 and N − L degrees-of-freedom. In particular, notice that SSA L−1 E FC,L−1,N −L = SSW E N −L L 2 j =1 nj × (μj − μ) 2 σ + L−1 = σ2
E
(4.31)
(4.32)
using the results I stated above. The numerator on the right-hand side is the estimate of the expected value of SSA and is called the Mean Square Among Groups (MSA). The denominator is the estimate of the expected value of SSW and is called the Mean Square Within Groups (MSW ). It is also called the Mean Square Error (MSE). The F-statistic is then FC,L−1,N −L =
MSA ∼ F1−α,L−1,N −L MSW
(4.33)
If the means are equal, SSA = 0 and the F-statistic equals 1. If the means are not equal, SSA > 0 and the F-statistic exceeds 1. The F-statistic’s p-value tells you how much greater than 1 is significant and, therefore, if the Null Hypothesis should be rejected or not rejected. What happened to βj and ij ? Using their definitions, you can see that SSA and SSW can be written in terms of βj and ij . Since the expression containing βj and ij is an OLS model, you use OLS to estimate βj and ij and then use the estimates to calculate the required sum of squares. I will discuss OLS in Chap. 5. The ANOVA table summarizes the sum of squares, the degrees-of-freedom, the mean squares, and the calculated F-statistic. I show the general structure in Table 4.3. What does this have to do with the question of the equality of mean ages across the military service branches? A simple OLS model can be estimated with the age of each respondent as the dependent variable and their military branch as the
4.2 Quantitative Data: Tests of Means
139
independent variable. Since the branch is categorical, effects coding is used. I show the ANOVA table, constructed from the regression results, in Fig. 4.16. You can see that the F-statistic is 30.39 and its p-value is 0.0000. Since the Null Hypothesis states that there is no effect of the military branches, that is, that the vet’s mean age does not statistically vary by service branch, this hypothesis must be rejected at any α level. Evidence supports the conjecture that age varies by military branch, so calculating a simple summary mean age would not provide a reasonable view of vet ages. To briefly recap, the F-test from the ANOVA tests if the means of the ages are the same across the branches. This is the Null Hypothesis. Having rejected this, you need to further check the source of the difference. Clearly, if you do not reject the Null Hypothesis that the means are the same, then you do not have to go any further. Knowing that the means are different, however, is tantamount to knowing that the branches have an effect on the age.9 Otherwise, it does not matter which branch you study; the mean age is the same. Which branch has the different mean age? The almost instinctive analysis plan you might develop to find where the difference lies is to test the difference in means for each pair of branches looking for a significant difference in at least one pair. These are the t-tests for two independent populations I discussed above. For this problem, there are 72 = 21 pairs.10 You might reason that you should expected 5% of these to be significant, on average, with a probability of rejecting the Null Hypothesis at 5%. Then you should expect at least one significant pair (= 21×0.05). Is this correct? To check this reasoning, consider a modified version of the Null Hypothesis, one that contains only three means: HO : μA = μN = μM for the Army, Navy, and Marine Corps, respectively. I will use this simplification strictly for pedagogical purposes. As observed by Paczkowski (2016), this modified Null Hypothesis is tantamount to three separate Null Hypotheses reflecting the three pairs from 32 : HO1 : μA = μN
(4.34)
HO2 : μA = μM
(4.35)
HO3 : μN = μM
(4.36)
You reject the modified Null Hypothesis if you reject any of these. Specifically, if you reject HO1 , then HO is rejected; if you reject HO2 , then HO is rejected; finally, if you reject HO3 , then HO is rejected. You have to reject at least one to reject HO . Clearly, you do not reject HO if you do not reject all three. The
9 Of course, the military branch does not determine your age. But the age distribution varies by branch is the main point. 7! 10 7 = = 21. 2 2! × 5!
140
4 Beginning Deep Survey Analysis
probability of rejecting at least one is, based on elementary probability theory, P r(Rejecting At Least One) = 1 − P r(Not Reject All). The probability of not rejecting all three Null Hypotheses is the probability of not rejecting the first hypothesis and not rejecting the second hypothesis and not rejecting the third hypothesis. By elementary probability theory again, P r(Not Rejecting All) = P r(Not Rejecting HO1 ) × P r(Not Rejecting HO2 ) × P r(Not Rejecting HO3 ). If α = 0.05 is the probability of rejecting a Null Hypothesis, that is, it is the probability of making a mistake and incorrectly rejecting a Null, then 1 − α = 0.95 is the probability of making a correct decision, which is not rejecting a Null Hypotheses. Therefore, the probability of not rejecting a modified Null Hypothesis is 1 − α, and so the P r(Not Rejecting All) is (1 − α)3 and P r(Rejecting At Least One) = 1 − (1 − α)3 . For α = 0.05, this evaluates to 0.143. This is the probability of incorrectly rejecting at least one of the modified Null Hypotheses and, so, incorrectly rejecting the original Null Hypothesis. This probability is almost three times the probability you set for making a mistake, which is 0.05. Your chances of making the wrong decision for the modified Null Hypothesis are much higher than what you might plan. In general terms, P r(Incorrect Decision) = 1 − (1 − α)k
(4.37)
where k = h2 is the number of pair-wise tests for h components of the full Null Hypothesis. This probability rises as a function of h. See Fig. 4.17. If k = 1 for only one test, then (4.37) simplifies to P r(Incorrect Decision) = α.
Fig. 4.17 The probability of an incorrect decision rises exponentially as a function of the number of tests
4.2 Quantitative Data: Tests of Means
141
The α is sometimes referred to as the comparison-wise error rate and 1−(1−α)k as the family-wise error rate (FWER). It should be clear that with a sufficient number of tests, your chances of finding a significant difference is very high; the error rate is inflated. For our problem, the full Null Hypothesis has h = 7 components and k = 21 tests. The family-wise error rate with 21 potential pair-wise tests is 0.659(= 1 − 0.9521 ). The probability of making an incorrect decision is 0.659. The significance level is effectively inflated. In fact, at this inflated level, you can expect to have at least 14 (≈ 21 × 0.659) significant pairs, not 1. How do you handle this inflation situation? The classic way is to use Tukey’s Honestly Significant Difference (HSD) test to compare all pairs of options for a statistical difference but control the significance level to be your desired level, usually 0.05. See Paczkowski (2016) for a discussion of this test. I show results for Tukey’s HSD test in Figs. 4.18 and 4.19. You can see in Fig. 4.18 that the familywise error rate is constrained to 0.05, so the p-value for each pair-wise comparison is adjusted. From Fig. 4.19, you can see that the mean age for the Marine Corp.
Fig. 4.18 This is a partial listing of the results from the Tukey HSD test comparing the mean age of vets by the branch of service. There are 21 comparisons. The column labeled “reject” indicates if the Null Hypothesis of no difference between groups should be rejected. If true, then you reject the Null for that group pair
142
4 Beginning Deep Survey Analysis
Fig. 4.19 This graph shows 95% confidence intervals for the mean differences in Fig. 4.18. The vertical dashed line highlights the Marine Corp. group and clearly shows that for this group, the mean is significantly lower than the other groups
vets is statistically lower than for the other services. The heatmap of the p-values I show in Fig. 4.20 emphasizes the statistical significance of the difference between the Marine Corp. mean age and that of each of the other branches. The procedural implication of this discussion is that when you have more than two means to compare, you must 1. Check the ANOVA result. If you reject the Null Hypothesis. 2. Check Tukey’s HSD test to identify where is the difference. This is clearly more involved than if you have one or two means.
4.3 Categorical Data: Tests of Proportions I will discuss tests for proportions in this section. These extend what I covered in the previous section but with some obvious twists for the nature of proportions. Regarding the testing tree in Fig. 4.2, I will now move down the right branch. You can test proportions for either one or two samples using the z-test or, for a two-sided alternative, the chi-square test. The two tests will give the same answer,
4.3 Categorical Data: Tests of Proportions
143
Fig. 4.20 This heatmap of the p-values from Tukey’s HSD test emphasizes that the Marine Corp. mean age is statistically different from the other branches
that is, the same p-values, since the chi-square is the square of a standardized normal random variable. The squaring of the standardized normal has no impact on the pvalues. Refer to the Appendix for this last point.
4.3.1 Single Proportions The VA survey has a question that asked respondents if they ever served in combat or a war zone.11 You might expect that everyone in the military would serve or be deployed to a war zone or experience combat at least once in their career. There are many specialists in the military, however, who never see combat. For example, those responsible for logistics are less likely to see combat or be directly in a war zone although they might be in a combat area. The logistics personnel are referred to as the “tail” of a “beast” that is the military. Those who see combat or are in a combat zone are the “teeth.” There is a ratio of those in the teeth to the tail: the tooth-to-tail ratio (T3R). See McGrath (2007) for details as well as the brief article “Tooth-to-tail ratio” in Wikipedia.12 You might hypothesize, as an example, that two-thirds of the military are in a support function and one-third in a combat role.
11 QA7: “Did you ever serve in a combat or war zone?” There is a clarifying statement: “Persons serving in a combat or war zone usually receive combat zone tax exclusion, imminent danger pay, or hostile fire pay.” 12 Source: https://en.wikipedia.org/wiki/Tooth-to-tail_ratio. Last accessed September 24, 2020.
144
4 Beginning Deep Survey Analysis
Therefore, T 3R = 0.33/0.67 = 0.49 or almost one-half. You just have to test one of these proportions. The Null and Alternative Hypotheses are: HO : p0 = 0.33 HA : p0 = 0.67 where p0 is the hypothesized proportion in the support section of the military (i.e., the “teeth”). The proportion in the “tail” is 1 − p0 = 0.67. You can use the data from the VA survey question to test this conjecture. This is a one-sample test. The test statistic is Z=
pˆ − p0 p0 × (1 − p0 ) n
(4.38)
where pˆ is the sample proportion and Z ∼ N (0, 1). A chi-square test statistic is just χ12 = Z 2 . The subscript “1” for the χ 2 is the number of degrees-of-freedom. This is a function of the number of Z random variables involved; in this case, it is only one. A missing value report in Fig. 4.21 for this question shows 192 missing values. These have to be deleted before a test is done. I show results for both the z-test and χ 2 test, net of the missing values, in Fig. 4.22. I used proportions_ztest and proportions_ttest in the statsmodels’ stats module.
4.3.2 Comparing Proportions: Two Independent Populations You can compare two population proportions in a manner similar to how you compared two population means. As before, you have a case of two independent populations with proportions p1 and p2 . The z-statistic is Z=
(pˆ1 − pˆ2 ) − (p1 − p2 ) . p1 × (1 − p1 ) p2 × (1 − p2 ) + n1 n2
(4.39)
Unfortunately, you generally do not know the population proportions for the standard errors, which is why you are doing a survey. In (4.38), you hypothesized p0 . Now, you do not have the hypothesized values for p1 and p2 , just a hypothesized relationship. You can, however, estimate these proportions using sample information. Pool the data for the two samples representing the two populations to get pˆp =
Y1 + Y2 , n1 + n2
(4.40)
4.3 Categorical Data: Tests of Proportions
145
Fig. 4.21 This is a missing value report for QA7
Fig. 4.22 These are the statistical test results for QA7 using the hypothesis statements in 4.3.1. Both tests suggest rejecting the Null Hypothesis that the support staff is two-thirds of the armed forces. The positive sign for the Z-statistic further suggests that the observed proportion is greater than two-thirds
146
4 Beginning Deep Survey Analysis
substitute this for the population proportions, and use Z=
(pˆ1 − pˆ2 ) − (p1 − p2 ) . 1 1 pˆp × (1 − pˆp ) × + n1 n2
(4.41)
You use the same z-test in statsmodels that I described above. The difference is that the counts are the number of successes in the respective samples. The function assumes that these counts are the number of successes in each independent sample. The total number of observations, of course, should be consistent with these counts of successes. The function operates the same way as the z-test for comparing means if the data are encoded as 0/1 since the mean of a 0/1 variable is just the proportion.
4.3.3 Comparing Proportions: Paired Populations Suppose you have two categorical random variables, each at two levels and each for two paired populations, for example, buy/not buy a product by husbands and wives, like/not like a form of music by siblings, and so forth. From a conceptual viewpoint, this is just the discrete case of the means comparison I discussed above. You could create a two-way crosstab as I had discussed in Chap. 4. The question here is the significance between the proportions in the table. For this, a χ 2 test of significance can be used. The classic test is the McNemar Test. This is for a 2 × 2 table of two nominal variables, so it is a test of a simple, basic crosstab. Suppose the table looks like the stylized one I provide in Table 4.4. There are two nominal variables: X and Y , each with just two levels, say “low” and “high.” The cell and marginal counts are as shown. The research question is, “Are the corresponding marginal probabilities the same?” That is, does pa + pb = pa + pc and pc + pd = pb + pd ? You can see that terms cancel so that the question implies to whether or not pb = pc . The hypotheses are HO : pb = pc HA : pb = pc This is called marginal homogeneity. See Agresti (2002) for a discussion. Table 4.4 This is a stylized crosstab for two nominal variables: X and Y . The sample size is a+b+c+d =n
X low X high Total
Y low a c a+c
Y high b d b+d
Total a+b c+d n
4.3 Categorical Data: Tests of Proportions
147
You can easily do a McNemar Test using the researchpy package. You would create a crosstab using the researchpy crosstab function with the argument test = “mcnemar”. For example, if rpy is the researchpy alias, you would use rpy.crosstab( rowVar, columnVar, test = “mcnemar” ). This will return the crosstab with report of the McNemar’s chi-square, its p-value, and Cramer’s phi statistic. Cramer’sphi is the square root of the McNemar chi-square divided by the sample size: φ = χ 2/n. This is a measure of association and varies between 0 and 1. So it is like a correlation coefficient. I provide an example application in Chap. 6 when I discuss Net Promoter Score analysis. You install researchpy using pip install researchpy or conda install -c researchpy researchpy.
4.3.4 Comparing Multiple Proportions There is no ANOVA table for proportions because ANOVA requires a continuous dependent variable. Consider a typical survey question that asks respondents to select from multiple possible answers, but select all answers that apply to them. This is a Choose all that Apply (CATA) question, also referred to as “Select all that apply” or “Choose n of m responses (where n ≤ m).” The VA survey has a CATA question, QC1a: “What are the reasons you haven’t applied for any VA disability benefits?” There are 13 CATA options. This is a follow-up to a previous question, QC1: “Have you ever applied for VA disability compensation benefits?” If they answered No, then they were asked QC1a. The complication for analyzing this CATA question is that you have to take into account the response to QC1 because it determines the base for the CATA question. It is useful to do an attribution of differences of the sample size for the CATA analysis that will follow. This attribution shows how the total sample of vets was allocated to yield the final frequencies that will be used. First, the total number of respondents is 8710 of which 1953 were ineligible to answer the CATA question because they answered Yes to QC1. The CATA questions, QC1a, address why they never applied, so only those who responded No to QC1 could answer them. The number eligible for the CATA questions is 6757. However, of this amount, 116 did not answer QC1 at all yet are “counted” in the total sample for the CATA questions when they should not since their eligibility is unknown. They need to be dropped leaving 6641 respondents. Of this total “eligible” for the CATA questions, 4062 answered Yes to the first CATA option, “Don’t have a service connected disability,” 1902 answered No, and 677 did not provide an answer. A missing value report for QC1 is shown in Fig. 4.23. The code for another version is shown in Fig. 4.24 with the visual in Fig. 4.25. Table 4.5 summarizes this attribution of differences. A tabular display of the 13 CATA options in Fig. 4.27 shows, for each option, the frequency count, the rate of responses, and the share of responses. The frequency is the number of positive or Yes responses. The rate is the frequency divided by the net sample size. The net sample size is the total sample eligible to answer the CATA questions less those who had missing responses (i.e., 677) or n = 5964. The rate is
148
4 Beginning Deep Survey Analysis
Fig. 4.23 This is a missing value report for QC1
the proportion of the net sample that selected each option. The share is the frequency divided by the total responses for the option. This is the sum of the frequencies of all the options. You can view each respondent as having one vote for each option, so each person got 13 votes to cast for this example. They can cast 0 or 1 of them for each option. They cannot apply more than one vote. The frequency is the total votes cast for each option, and the total responses is the total votes cast over all options. The code to create the summary table for the CATA vet question is in Fig. 4.26. The CATA report in Fig. 4.27 is informative but only partially because it does not tell you about the statistical differences in the response patterns among the options. A response pattern is the set of responses—Yes and No—for an option across all respondents. Each option has its own response pattern. The question is, “Is there a difference among the response patterns?” If the data are real numbers, called floating-point numbers (or, simply, floats) in Python, then an ANOVA test could be done to determine if there is a statistical difference among the factors. The DataFrame is converted from wide form to long form with two variables: the floating-point numbers as one variable and the option labels as a second. An ANOVA F-test would indicate if there is a statistical difference. I show an example in Fig. 4.16. The data for the 13 options are stacked using the Pandas melt function that creates two variables: variable and value where “variable” is categorical. The F-statistic is very large, its p-value is very small, so the Null Hypothesis is rejected. The Null Hypothesis is that there is no difference among the levels of the option variable.
4.3 Categorical Data: Tests of Proportions
149
Fig. 4.24 This is code to generate a missing value report for QC1. The summary table and pie chart from this code are shown in Fig. 4.25
Cochrane’s Q test is an alternative to the F-test. This procedure looks for differences among a series of “treatments” for blocks of measures when those treatments are binary coded as 0 and 1 values. The treatments in this case are the CATA questions. The blocks are the respondents themselves.13 The measures are the 0/1 response values, so Cochrane’s Q test is applicable. This is a chi-square test that is a more general version of the McNemar Test for two nominal variables that I discussed in Sect. 4.3.3.
13 In the Design of Experiments literature, a treatment is an experimental condition placed on an object that will be measured. The measure is the effect of that treatment. The objects may be grouped into blocks designed to be homogeneous to remove any nuisance factors that might influence the responses to the treatments. Only the effect of the treatments is desired. In the survey context, the treatments are the CATA questions, and the blocks are the respondents themselves. See Box et al. (1978) for a discussion of experimental designs.
150
4 Beginning Deep Survey Analysis
Fig. 4.25 This is missing value summary table and pie chart from the code shown in Fig. 4.24 Table 4.5 This is the attribution of differences for the VA CATA question: QC1
Category Total Ineligible Eligible Missing Non-missing No answer QC1a respondents Yes responses No responses
Count 8710 −1953 6757 −116 6641 −677 5964 4062 + 1902 5964
Cochrane’s Q test has three assumptions:14 1. The number of blocks is “large.” 2. The blocks are randomly selected from the population of all possible blocks. 3. The measures for the treatments are coded 0 or 1. 14 See
the article “Cochran’s Q test” at https://en.wikipedia.org/wiki/Cochran%27s_Q_test. Last accessed September 30, 2020.
4.3 Categorical Data: Tests of Proportions
151
Fig. 4.26 This is the code to create the CATA summary table shown in Fig. 4.27
These assumptions are satisfied for our problem. Each vet is a block, and clearly, the number of blocks is large. The number of vets (i.e., blocks) is a sample of the population of vets. Finally, the values are coded as 0 and 1. The Null Hypothesis is that there are no differences in the proportions measured “positive” among the treatments versus the Alternative Hypothesis that there is at least one proportion that is different from the others. This is just the proportion counterpart to the Null and Alternative Hypotheses for means I discussed above. Figure 4.28 has the proportions for the CATA question. I provide the setup for the Cochrane’s Q test and the result in Fig. 4.29. The p-value is less than 0.05, so you can conclude that there is at least one difference. The next step is to find which pair of questions, the treatments, are different. This is the same issue as for the means. However, Tukey’s HSD procedure is inappropriate in this case because, as I already noted, you have proportions and not means. The Marascuillo Procedure is appropriate. See Marascuilo (1964) and Marascuilo and McSweeney (1967). This is a series of chi-square tests applied to each pair of treatments. Since there are 13 CATA questions, there are 78 (= 13 2 ) pairs. This is a lot. Nonetheless, the procedure can be used to identify the significant pairs for further analysis.
152
4 Beginning Deep Survey Analysis
Fig. 4.27 This is the summary table resulting from the code in Fig. 4.26
The steps for the Marascuillo Procedure are: 1. Create a list of all pair-wise combinations of the options. If n is the number of options, then the number of pairs is n2 . This is how I found the 78 pairs. The list of pairs is found using the combinations module in the itertools package, which has a number of functions designed to be highly efficient in terms of speed and memory use when used for iterative situations. They are described as iterator “building blocks” in the Python documentation.15 This means they can be used in those situations when you have to iterate over a series of operations. One itertool useful for this application is combinations, which creates the actual combinations not the count of combinations. 2. Calculate a chi-square value for the number of options less one. One column is dropped because the number of degrees-of-freedom is c − 1 where c is the number of factors involved. 3. Calculate the proportion for each option and then the difference in proportions for each pair-wise combination. The proportions are simply the column means. The mean of a 0/1 encoded variable is the proportion of 1 values. 4. Do a chi-square test of each pair-wise combination. Save the p-values. 5. Summarize the test results. I illustrate these steps in Fig. 4.30 and summarize the results in Fig. 4.31 but only for a few pair-wise combinations since there are 78 pairs. The Python code for creating the summary is shown in the figure. Since this output is long, your interest
15 See
2020.
https://docs.python.org/3/library/itertools.html#itertools.count. Last accessed October 1,
4.4 Advanced Tabulations
153
Fig. 4.28 This is a proportion summary table for the CATA question, QC1a
may be better served by creating a shorter summary, perhaps focusing on the top five significant pairs. The criteria for the top five might be the absolute difference in the proportions. I illustrate how this is done in Fig. 4.32.
4.4 Advanced Tabulations A crosstab, as I noted before, is the most used, and probably abused, tool in survey analysis. A tab per se, however, is limited to simple tabulations. More complex tabulations may be required although they may not necessarily be more informative—complex does not imply informative and, in fact, may obfuscate information and insight. Pivoting a table, basically reshaping or reorganizing its rows and columns, can be informative and insightful. Such a table is simply referred to as a pivot table.
154
4 Beginning Deep Survey Analysis
Fig. 4.29 This is the summary table of results for the Cochrane’s Q test for the CATA Question QC1a
As an example of a pivot table, suppose you want to examine the proportion of veterans who ever enrolled in VA healthcare by their military branch and gender. The question is QE1, “Have you ever been enrolled in VA healthcare?”. This is a Core Question, and the military branch and gender are Surround Questions that help to clarify it. The Core Question can easily be examined using a pivot table. Before doing so, however, you need to examine the data for this question. A survey respondent could choose one option from among three: “Yes,” “No,” and “Don’t Know.” The last, most likely, was included to reflect people’s tendency to enroll in programs but then later forget they did so. For whatever reason this was included, I
4.4 Advanced Tabulations
155
Fig. 4.30 This is the Python code for the Marascuillo Procedure for the CATA question QC1a
view it as a nuisance option. The distribution I provide in Fig. 4.33 shows that 6.6% of the sample selected it. I believe this is small and does not help to understand the core issue: enrollment in the VA healthcare. I dropped this option and at the same
156
4 Beginning Deep Survey Analysis
Fig. 4.31 This is a summary of the results for the Marascuillo Procedure for the CATA question QC1a. The report was truncated for this display because the output is quite long due to the number of pairs
time recoded the other two options: “No” = 0 and “Yes” = 1.16 The recoding “No” = 0 is used to simplify calculations: Recall that the mean of 0/1 values is the sample proportion. In this case, the mean is the sample proportion that ever enrolled in the VA healthcare. I show the cleaning and recoding in Fig. 4.34. I first made a copy of the DataFrame using the DataFrame’s copy( ) method because I did not want to lose the original data. Then I used its replace method with a dictionary as the argument. The dictionary has one key, which is the column to be recoded (i.e., everEnrolled). The value for the key is another dictionary that has the old values as keys and the recodes as the values. Notice that “Don’t Know” is recoded as a Numpy NaN using np.nan. This allowed me to use the dropna method to drop all respondents with a “Don’t Know” response. This type of recoding is very common in survey analysis.
16 The
original data had “Yes” = 1, “No” = 2, and “Don’t Know” = 3.
4.4 Advanced Tabulations
157
Fig. 4.32 This is an abbreviated summary of the results for the Marascuillo Procedure for the CATA question QC1a. Just the top 10 pairs in terms of the absolute difference in the proportions are shown. The Pandas nlargest function is used
Fig. 4.33 This is the frequency distribution for the enrollment question, QE1. The cumulative columns usually included with the distribution were omitted since accumulation does not make sense in this case
The pivot table, based on the clean data, is shown in Fig. 4.35. The arguments to the pivot_table function are the DataFrame, a list that identifies the index or row labels for the table, and the variable to be aggregated in the table. The list for the index consists of the “branch” and “gender” variable names indicating that a MultiIndex will be needed. A MultiIndex shows how the rows of the DataFrame
158
4 Beginning Deep Survey Analysis
Fig. 4.34 This is the frequency distribution for the enrollment question, QE1, after the “Don’t Know” responses were deleted and data were recoded
are divided into layers, in this case branch and gender layers. Basically, a Cartesian Product of branch and gender is created.17 The Pandas pivot_table function is very powerful and flexible. It is an option that you definitely have to further explore for your analyses.
4.5 Advanced Visualization In addition to creating advanced tabulations using pivot tables, you can also use advanced data visualizations to further analyze your survey data. These include: 1. Extended visualization to show multiple variables and how they might be changed. These include, but are not limited to: (a) Grouped elements (b) Facet (lattice) graphs
17 “In mathematics, specifically set theory, the Cartesian product of two sets A and B, denoted A × B is the set of all ordered pairs (a, b) where a is in A and b is in B.” Source: Wikipedia article “Cartesian product”: https://en.wikipedia.org/wiki/Cartesian_product. Last accessed on October 2, 2020. For this problem, the collection of branches is one set, and the collection of gender is another.
4.5 Advanced Visualization
159
Fig. 4.35 This is the pivot table for the enrollment question, QE1, after the “Don’t Know” responses were deleted and data were recoded
(c) Use of hues and markers to highlight subgroups (d) Dynamic graphs using plotly, a dynamic graphing package available for Python 2. Geospatial maps for portraying geographic distributions I will discuss each of these in the following subsections.
4.5.1 Extended Visualizations A simple way to extend a visual is to create groups of objects such as bars or boxplot. The groups are based on a categorical variable. I prefer to use boxplot in this case because I have found them to be very informative and revealing about distributions among the groups. I show a grouped boxplot for the vets’ age distribution in Fig. 4.36. Notice the pattern. Compare the information you glean from this figure against Fig. 4.14.
160
4 Beginning Deep Survey Analysis
Fig. 4.36 This is a grouped boxplot of the vets’ age distribution
In many instances, standard visuals (i.e., pie and bar charts) are extended by making them three-dimensional (3-D) and using colorful palettes. The sole aim of using 3-D charts is the infographic impact a third dimension offers beyond what a “flat” 2-D visual offers. They are simply more dramatic. The third dimension, however, is difficult to interpret and comprehend since we, as human beings, have perspective problems. For example, we have difficulty distinguishing between angle sizes of slices in a 3-D pie chart when the slices fade away from us. This is on top of the difficulties we have distinguishing between angles of slices in a 2-D pie chart; the third dimension compounds, if not accentuates, our physical visual shortcoming. Bright colors usually add little to nothing to the visual display but are used because the palettes are available.18 Our depth perception problems do not imply that 3-D graphs should not be used for data exploration and analysis. Twisting and turning a 2-D graph so that you look at it from different angles using dynamic graphing tools can reveal more patterns and anomalies than you originally expected or saw. This twisting and turning hold for 3-D graphs as well. The 3-D graphs can often be rotated to highlight different perspectives of the data, but this may result in exposing some hidden bars while hiding others that were originally visible. These visuals (3-D and bright palettes) are,
18 See, for example, comments by N. Robbins at https://www.forbes.com/sites/naomirobbins/2015/
03/19/color-problems-with-figures-from-the-jerusalem-post/?sh=21fd52f71c7f. Last accessed December 20, 2020. Also see Few (2008).
4.5 Advanced Visualization
161
Fig. 4.37 This is a 3-D bar chart of Question E1 of the VA survey
nonetheless, more for infographics and presentations, not for scientific examination and analysis of data, whether survey data or not, with the goal of extracting Rich Information from the data. The use of 3-D and color palettes tends to obfuscate rather than expose information in data. To illustrate the obfuscation, consider a 3-D bar chart for two categorical variables and one quantitative variable. Using the VA data, the categorical variables are the military branch and gender. The quantitative variable is the percent of respondents who indicated in Question E1 (“Have you ever been enrolled in VA healthcare?”) that they have enrolled at some time in VA healthcare. The 3-D is shown in Fig. 4.37. I show the code to generate this chart in this chapter’s appendix because it is long. This bar chart is typical for what will appear in a presentation to clients or upper management. Unfortunately, it does not clearly and unambiguously display Rich Information. Notice that some bars are hidden or obscured by other bars. For example, look at the second row of bars for Male respondents. The next to last bar for the Army respondents is barely perceptible. Notice also that it is impossible
162
4 Beginning Deep Survey Analysis
to clearly identify the heights of the bars of the Female respondents because they are placed too far from the Z-axis. Finally, it is impossible to distinguish between the height of the bar for Female Coast Guard veterans and their Male counterparts because the bars are separated to give the depth perception. Not only can you not tell the response rate for these two groups, but you also cannot tell if they have the same rate or different rates. The pivot table display in Fig. 4.35 is more informative and provides Rich Information in a clear and concise way. For the scientific examination of survey data, you could use facet graphs to emphasize relations across several variables in your data set. Facet graphs are also referred to as panel graphs or trellis graphs. See Robbins (2010) for some discussion. The idea is that one large graph can be split into several smaller graphs, usually with the conditions that the axes are scaled alike and that they are placed next to each other (vertically or horizontally). The like-scaling is important because this simplifies comparisons. I show one possible configuration of a facet plot in Fig. 4.38 that uses the same data I presented in Figs. 4.35 and 4.37. This involves using the grid capability in the Seaborn graphics package.
4.5.2 Geographic Maps An important tool that has become more important is a geographic map illustrating where respondents live, work, shop, vacation, and so forth. This geography will most likely be part of the Surround Question set. These maps could then be used like any other Surround Question, or they could be used for the deeper analysis of the Core Questions. For the former, you could simply display locations as part of the profiling of respondents, no different than summarizing gender, income, and education distributions with pie and bar charts. You could, however, subset your geographic data by other Surround Questions (e.g., gender, age, education) and drill down on Core Questions. I illustrate one example of a geographic map with the San Francisco airport customer satisfaction study. There was a question that asked respondents for their country of origin, and then for those who originated in the United States, it further asked for their home state. I view this as a set of Surround Questions for profiling purposes. I imported the satisfaction data into a Pandas DataFrame, keeping the country of origin and state. Then I subsetted that data to get just the respondents who came from the United States. After a simple examination of the data, I found that Guam was included, but also, and more importantly, most of the people came from California. This should not be surprising since the airport is in California. I decided that I needed to filter out those people from Guam as well as those from California who would distort any analysis. I used the query method for this. I provide the code in Fig. 4.39. Once the data were subsetted, I then calculated the percent of respondents by state. California was not included, so the base is all respondents from the remaining 49 states. I used the value_counts method with the normalize = True argument,
4.5 Advanced Visualization
163
Fig. 4.38 This is a faceted version of a bar chart of Question E1 of the VA survey. I created a grid of one row and two columns. The code snippet indicates that creating this is more efficient than creating the 3-D bar chart in Fig. 4.37. You can also easily change the grid display to create alternative views
which I also multiplied by 100 to get percents. I put the percents along with the state codes into a DataFrame and renamed the columns to “State” and “Percent.” You can see how I did this in Fig. 4.39. Once I had the data I wanted, I then created a color-coded map of the United States using the plotly package. You can install plotly using pip install plotly or conda install -c plotly plotly. You then import the graph_objects module using import plotly.graph_objects as go where go is the alias for graph_object. I created the geographic map using the code I provide on Fig. 4.40. I show the final map
164
4 Beginning Deep Survey Analysis
Fig. 4.39 This illustrates how the data are prepared for the geographic map of the US state of origin of the San Francisco airport customer satisfaction respondents. Notice how the query method is used to select US residents and filter out Guam and California
Fig. 4.40 This illustrates how is the setup to produce the geographic map of the US state of origin of the San Francisco airport customer satisfaction respondents
4.5 Advanced Visualization
165
Fig. 4.41 This is the geographic map of the US state of origin of the San Francisco airport customer satisfaction respondents. Notice that Texas and Washington state are the top states of origin
in Fig. 4.41. What stands out immediately is that many of the non-California respondents come from Texas and Washington state, followed by Oregon, New York, Florida, and Illinois.
4.5.3 Dynamic Graphs A shortcoming of the data visualization tools I discussed and presented so far is that they are a static view. This means they show just an image that you cannot interact with to examine different views of the same data without changing code. A dynamic view of your data gives you a more flexible and powerful way to extract information from your data. A dynamic set of tools is also time-saving since you do not have to change code, but instead, you change views by dragging and clicking on the image. I summarize these points in Fig. 4.42. The map in Fig. 4.41 is actually dynamic although that is not obvious on a printed page. You can resize and drag the map using a mouse as well as place the mouse cursor over any state on the map to see the state two-character code (e.g., NJ for New Jersey) and the percent of respondents from that state.
166
4 Beginning Deep Survey Analysis
Analysis Methods to Assemble Data Bricks
Dynamic
Static Non-interactive/Rigid
Interactive/Flexible Drill-downs/Links Change views
No drill-downs Just snapshots
Fig. 4.42 This is a high-level summary of the differences between a static and dynamic visualization of data
Appendix This appendix provides brief overviews and, maybe, refresher material on some key statistical concepts.
Refresher on Expected Values The expected value of a random variable is just a weighted average of the values of the random variable. The weights are the probabilities of seeing a particular value of the random variable. Although the averaging is written differently depending on whether the random variable is discrete or continuous, the interpretation is the same in either case. If Y is a discrete random variable, then E(Y ) =
+∞
yi × p(y)
−∞
where p(y) = Pr(Y = y), the probability that Y = y, is a probability function such that 0 ≤ p(y) ≤ 1 p(y) = 1 So E(Y ) is just a weighted average. This is the expected value of Y .
Appendix
167
If Y is a continuous random variable, then E(Y ) =
+∞
−∞
yf (y)dy
where
+∞
−∞
f (y) = 1
The function, f (y), is the probability density function of Y at y. It is not the probability of Y = y, which is zero. It is easy to show the following: 1. E(aX) = a × E(X) where a is a constant. 2. E(aX + b) = a × E(X) + b where a and b are constants. 3. V (aX) = a 2 × V (X) where V (·) is the variance defined as V (X) = E[X − E(X)]2 . 4. V (aX + b) = a 2 × V (X). It is also easy to show, although I will not do it here, that the expected value of a linear function of random variables is linear. That is, for two random variables, X and Y , and for ci a constant, then E(c1 × X + c2 × Y ) = c1 × E(X) + c2 × E(Y ) You can also show that V (Y1 ± Y2 ) = V (Y1 ) + V (Y2 ) if Y1 and Y2 are independent. If they are not independent, then V (Y1 ± Y2 ) = V (Y1 ) + V (Y2 ) ± 2 × COV (Y1 , Y2 ) where COV (Y1 , Y2 ) is the covariance between the two random variables. For a random sample, Y1 and Y2 are independent.
Expected Value and Standard Error of the Mean You can now show that if Yi , i = 1, 2, . . . , n, are independent and identically distributed (commonly abbreviated as iid) random variables, with a mean E(Y ) = μ and variance V (Y ) = E(Y − μ)2 = σ 2 , then E(Y¯ ) =
1 × E(Yi ) n n
i=1
1 μ × n n
=
i=1
168
4 Beginning Deep Survey Analysis
1 ×n×μ n =μ
=
and 1 V (Y¯ ) = 2 × V (Yi ) n n
i=1
1 × σ2 2 n n
=
i=1
=
1 × n × σ2 n2
=
σ2 . n
This last result can be extended. Suppose you have two independent random variables, Y1 ∼ N (μ1 , σ12 ) and Y2 ∼ N (μ2 , σ22 ). Then V (Y¯1 + Y¯2 ) =
1 2 1 1 × V (Y ) + × V (Yi2 ) i1 n21 i=1 n22 i=1
n
n
=
1 2 1 1 2 × σ + × σ22 1 n21 i=1 n22 i=1
=
1 1 × n1 × σ12 + 2 × n2 × σ22 n21 n2
=
σ12 σ2 + 2. n1 n2
n
n
This last result is used when two means are compared.
Deviations from the Mean Two very important results about means are: 1. ni=1 (Yi − Y¯ ) = 0. 2. E(Y¯ − μ) = 0 The first is for a sample; the second is for a population. Regardless, both imply that a function of deviations from the mean is zero. To show the first, simply note that
Appendix
169 n n (Yi − Y¯ ) = Yi − n × Y¯ i=1
i=1
=
n
1 Yi n n
Yi − n ×
i=1
=
n
i=1
Yi −
i=1
n
Yi
i=1
= 0. The second uses the result I showed above that E(Y¯ ) = μ. Using this, E(Y¯ − μ) = E
1 Yi n n
−μ
i=1
1 = × E(Yi ) − μ n n
i=1
1 = ×n×μ−μ n = 0.
Some Relationships Among Probability Distributions There are several distributions that are often used in survey analyses: 1. 2. 3. 4.
Normal (or Gaussian) distribution χ 2 distribution Student’s t-distribution F-distribution
These are applicable for continuous random variables. They are all closely related, as you will see.
Normal Distribution The normal distribution is the basic distribution; other distributions are based on it. The Normal’s probability density function (pdf ) is f (y) = √
1 2π σ 2
e
− (Y −μ) 2 2σ
2
170
4 Beginning Deep Survey Analysis
where μ and σ 2 are two population parameters. A succinct notation is Y ∼ N (μ, σ 2 ). This distribution has several important properties: 1. All normal distributions are symmetric about the mean μ. 2. The area under an entire normal curve traced by the pdf formula is 1.0. 3. The height (i.e., density) of a normal curve is positive for all y. That is, f (y) > 0 ∀y. 4. The limit of f (y) as y goes to positive infinity is 0, and the limit of f (y) as y goes to negative infinity is 0. That is, lim f (y) = 0 and
y→∞
lim f (y) = 0
y→−∞
5. The height of any normal curve is maximized at y = μ. 6. The placement and shape of a normal curve depends on its mean μ and standard deviation σ , respectively. 7. A linear combination of normal random variables is normally distributed. This is the Reproductive Property of Normals. A standardized normal random variable is Z = (Y −μ)/σ for Y ∼ N (μ, σ 2 ). This can be rewritten as a linear function: Z = 1/σ × Y − μ/σ . Therefore, Z is normally distributed by the Reproductive Property of Normals. Also, E(Z) = E(Y )/σ − μ/σ = 0 and V (Z) = 1/σ 2 × σ 2 = 1. So, Z ∼ N (0, 1). I show a graph of the standardized normal in Fig. 4.43.
Chi-Square Distribution If Z ∼ N (0, 1), then Z 2 ∼ χ12 where the “1” is one degree-of-freedom. Some properties of the χ 2 distribution are: 1. The of n χ 2 random variables is also χ 2 with n degrees-of-freedom: sum 2 2 n Zi ∼ χn . 2. The mean of the χn2 is n and the variance is 2n. 3. The χn2 approaches the normal distribution as n → ∞. I show a graph of the χ 2 pdf for 5 degrees-of-freedom in Fig. 4.44.
Student’s t-Distribution The ratio of two random variables, where the numerator is N (0, 1) and the denominator is the square root of a χ 2 random variable with ν degrees-of-freedom divided by the degrees-of-freedom, follows a t-distribution with ν degrees-offreedom.
Appendix
171
Fig. 4.43 This is the standardized normal pdf
Z
χν2 ν
∼ tν
I show a graph of the Student’s t pdf for 23 degrees-of-freedom in Fig. 4.45.
F-Distribution The ratio of a χ 2 random variable with ν1 degrees-of-freedom to a χ 2 random variable with ν2 degrees-of-freedom, each divided by its degrees-of-freedom, follows an F-distribution with ν1 and ν2 degrees-of-freedom.
172
4 Beginning Deep Survey Analysis
Fig. 4.44 This is the χ 2 pdf for 5 degrees-of-freedom. The shape changes as the degrees-offreedom change
χν21 /ν1 χν22 /ν2
∼ Fν1 , ν2
Note that the F1,ν2 is t 2 . You can see this from the definition of a t with ν degreesof-freedom: Z ∼ tν . χν2 ν
I show a graph of the F-distributionpdf for 3 degrees-of-freedom in the numerator and 15 degrees-of-freedom in the denominator in Fig. 4.46.
Equivalence of the F and t Tests for Two Populations Guenther (1964, p. 46) shows that then there are two independent populations, and the F-test and the t-test are related. In particular, he shows that F1,n1 +n2 −2 = tn21 +n2 −2 .
(4.42)
Appendix
173
Fig. 4.45 This is the Student’s t pdf for 23 degrees-of-freedom. The shape changes as the degreesof-freedom change
Fig. 4.46 This is the F-distribution pdf for 3 degrees-of-freedom in the numerator and 15 degreesof-freedom in the denominator. The shape changes as the degrees-of-freedom change
174
4 Beginning Deep Survey Analysis
Fig. 4.47 This is the Python code I used to create Fig. 4.37
Part of this demonstration is showing that the denominator of the F-statistic is SSW = n1 + n2 − 2
(n1 − 1) × s12 + (n2 − 1) × s22 n1 + n2 − 2
×
1 1 + n1 n2
(4.43)
which is the result I stated in (4.11) and (4.12).
Code for Fig. 4.37 The code to generate the 3-D bar chart in Fig. 4.37 is shown here in Fig. 4.47. This is longer than previous code because more steps are involved. In particular, you have to define the plotting coordinates for each bar in addition to the height of each bar. The coordinates are the X and Y plotting positions, and the base of the Z-dimension in the X − Y plane. This base is just 0 for each bar. Not only are these coordinates needed, but the widths of the bars are also needed. These are indicated in the code as dx and dy. The height of each bar is the Z position in a X − Y − Z three-
Appendix
175
dimensional plot. This is indicated by dz. The data for the graph comes from the pivot table in Fig. 4.35. I did some processing of this data, which I also show in Fig. 4.47. Notice, incidentally, that I make extensive use of list comprehensions to simplify list creations.
Chapter 5
Advanced Deep Survey Analysis: The Regression Family
Contents 5.1 5.2
The Regression Family and Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Identity Link: Introduction to OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 OLS Regression Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 The Classical Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Steps for Estimating an OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Predicting with the OLS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Logit Link: Introduction to Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Logistic Regression Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Steps for Estimating a Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Predicting with the Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 The Poisson Link: Introduction to Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Poisson Regression Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Example of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Steps for Estimating a Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Predicting with the Poisson Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
178 179 180 180 181 182 186 187 189 192 194 200 200 200 201 201 202 203
I will discuss some advanced analysis methods in this chapter. Specifically, I will discuss modeling survey responses using linear regression for continuous variable responses, logistic regression for binary variable responses, and Poisson regression for count responses. The latter two are particularly important and relevant for survey data analysis because many survey Core Questions have discrete, primarily binary and count, responses such as “Will you vote in the next presidential election?”, “Do you shop for jewelry online?”, and “How many times have you seen your doctor?” Logistic regression leads to a form of analysis called key driver analysis (KDA) which seeks the key factors that drive or determine a Core Question. This is a common form of analysis in, for example, customer satisfaction studies where
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_5
177
178
5 Advanced Deep Survey Analysis: The Regression Family
knowing the degree of customer satisfaction is not enough for directing business policy decisions. What determines or “drives” the satisfaction scores is equally important, if not more so, because the drivers can then be used as policy tools.
5.1 The Regression Family and Link Functions Linear regression, using the ordinary least squares (OLS) approach, is the form of regression analysis most people are exposed to in an introductory statistics course. Regression per se is concerned with fitting a function to data with the goal of extracting mean trends in a target variable (i.e., a dependent variable) as well as identifying the factors (i.e., independent variables or features) that cause or drive the target and its patterns. The patterns could be direction (increasing or decreasing), degree of linearity (linear or curvilinear), and spread (increasing, decreasing, or constant variance). These are explanatory goals. There is also a predictive goal so that a model estimated from data can inform someone of outcomes likely in the future or under different conditions or circumstances. The function I referred to in the previous paragraph is, more formally, a function or transformation of the mean of the target that equals a linear combination of the independent variables. The mean is the expected value. Basically, there is a link function that links this mean to this linear combination. For OLS, the link function is simple: it is the identity function. You can view the identity function as a simple 1–1 mapping or transformation of the expected value of the target to the linear combination of the independent variables. It is 1–1 because, by the nature of the regression model, the expected value of the target is already identically equal to the linear combination, so there really is no functional transformation needed. There are many link functions, three of which I list in Table 5.1. The identity link is just one of them. I will review the linear regression based on OLS and the identity link in the next section followed by the logit link. The Poisson link is less commonly used in survey analysis which is a mistake because many survey Core Questions are about counts. I will, nonetheless, develop the background for the Poisson regression in this chapter. See Montgomery et al. (2012) and Cameron and Trivedi (2005) for detailed developments of the Poisson link. The link functions lead to a family of methods for fitting data. This family is the generalized linear model (GLM) family which is very broad since there are many link functions. Each link function defines a “cousin” in the family, so OLS is just one “cousin.” This is a broader perspective of regression than what is taught in a basic statistics course because it points to a commonality across a range of what appears to be different methods. The methods are related, and the link functions provide that commonality. One form or another of the target variable often, but not always, appears in a questionnaire. The continuous target I mention in Table 5.1 could result from Core Questions such as:
5.2 The Identity Link: Introduction to OLS Regression Table 5.1 This is a partial listing of link functions
179 Target variable type Continuous Discrete (binary) Discrete (count)
Link function Identity Logit Log
• How long on average have you pursued (e.g., hours spent watching TV on a typical evening in the past month)? • How much did you spend on average on your last shopping occasion on (e.g., average price of yogurt)? • What proportion of your income did you spend this past year on (e.g., jewelry)? The binary target could be just a simple Yes/No response to a Core Question such as: • Did you vote in the last presidential election? • Did you see an ad for (e.g., product, service, political) on TV last week? • Did you buy (e.g., jewelry) as a gift this past holiday season? And finally, the count target could result from Core Questions such as: • How many times did you vote in the last five elections? • How many patients did your medical practice treat this past week? • How many times do you shop online in a typical week?
5.2 The Identity Link: Introduction to OLS Regression OLS regression is the first, and often the only, regression approach introduced in a basic statistics class. It is actually one of the many ways to estimate a regression model and perhaps the simplest to explain and develop in an introductory course. This explains why it is the first (and almost always the only) approach taught. The objective is often to tell you the fundamental concepts, so you are aware of them without overwhelming you with the complexities of regression, let alone the notion of a regression family. To do penetrating survey data analysis, however, to address the Core Questions, you must be familiar with regression concepts, not just be aware. I will introduce some of the complexities in this section which will enable you to use this tool to address a Core Question involving a continuous target. To do this, I will use the yogurt case study which asked a Core Question about the average number of yogurts purchased and the average price paid for those yogurts. See Paczkowski (2016) for a similar analysis.
180
5 Advanced Deep Survey Analysis: The Regression Family
5.2.1 OLS Regression Background The objective for regression analysis is to fit a line (linear or curvilinear) to a collection of n data points or observations. One variable in the collection is designated as the dependent or target variable to be explained, and the others are the independent or feature variables used for the explanation. There is an assumed relationship between the two, usually linear. If Yi is the ith target observation and Xi is the corresponding independent variable observation, then the linear relationship is expressed as Yi = β0 + β1 × Xi where β0 is the intercept or constant term and β1 is the slope. This is a straight line. There is actually another factor added to this equation to account for unknown and unknowable factors that cause any observation to randomly deviate from the straight line. This is a disturbance term written as i . It is assumed to be drawn from a normal distribution with mean zero and variance σ 2 . The reason for the zero mean is that, on average, this factor should not be a key driver of the dependent variable; it is basically a nuisance factor that must still be accounted for. The full model is then Yi = β0 + β1 × Xi + i
(5.1)
i ∼ N (0, σ 2 )
(5.2)
for i = 1, 2, . . . , n. The intercept and slope must be estimated from your data. These estimates are written as βˆ0 and βˆ1 . Once they are known, you can estimate Yi as Yˆi = βˆ0 + βˆ1 ×Xi . The key to the estimations is an estimate of the disturbance term called a residual or error. The residual is ei = Yi − Yˆi . You derive the estimates using a sum of squares of the residuals called the error sum of squares (SSE) which is ni=1 (Yi − Yˆi )2 . In particular, you derive the formulas for the two parameters by minimizing SSE with respect to the two parameters. This method is called OLS because you are finding the minimum (i.e., least) value for the error sum of squares.
5.2.2 The Classical Assumptions There are a series of assumption, which I refer to as the Classical Assumptions, for OLS regressions. These are basically special case assumptions in the sense that they are the simplest set needed to get fundamental regression results. The assumptions are: Normally distributed Mean zero Homoskedasticity Independence I
i ∼ N , ∀i E(i ) = 0, ∀i V (i ) = σ 2 , ∀i COV (i , j ) = 0, ∀i = j
5.2 The Identity Link: Introduction to OLS Regression
Independence II Linearity Fixed X Continuous Y
181
COV (i , Xi ) = 0, ∀i Model is linear in the parameters X is fixed in repeated samples Y is continuous
See Paczkowski (2022, Chapter 6) for a discussion of these assumptions. Also see Kmenta (1971). These assumptions lead to the very important Gauss-Markov Theorem: Gauss-Markov Theorem: 1 Under the Classical Assumptions of the linear regression model, the estimators β0 and β1 have the smallest variance of all linear and unbiased estimators of β0 and β1 . They are the best linear unbiased estimators (BLUE) of β0 and β1 . It is this theorem that gives us a sense of confidence in using the OLS method to estimate the parameters of a linear model. See Greene (2003) for a proof.
5.2.3 Example of Application Assume that a Core Question for a consumer yogurt survey is the price elasticity of yogurt. This measure is used by economists and business pricing specialists to gauge the responsiveness of the number of units sold of a product (i.e., the quantity demanded) to a change in the product’s price. It is used to develop a pricing strategy or respond to a competitor’s price or marketing action. The elasticity is defined as the percentage change in units sold divided by the percentage change in price or Q
%ΔQ %ΔP ΔQ P × 0. Basically, X “pushes” Yˆ away from Y¯ . Let me introduce two models. The first is the restricted model that does not have an explanatory variable, just a constant term. It is restricted because β1 = 0 since there is no X. The second model is the unrestricted model that has an explanatory variable. The key question is: “Which model, restricted or unrestricted, is better?” The ANOVA helps you answer this question. You already know the three sum of squares from above. Now define the mean squares which are the sum of squares divided by their respective degrees-offreedom. The mean square for regression is MSR = SSR/dfSSR , and the mean square for error is MSE = SSE/dfSSE where dfSSR = 1 and dfSSE = n − 2. Clearly, dfSSR + dfSSE = n − 1 = dfSST . These mean squares are really measures of variance. Recall from elementary statistics that the sample variance is s 2 = (Yi −Y¯ )2/n−1 which equals SST /n−1. I will now define the F-statistic as FC =
MSR ∼ F1,n−2 MSE
(5.31)
So, FC is a ratio of two measures of variance. Consider the expected values of the numerator denominator of FC . It can be shown that E(SSR/1) = σ 2 +β12 ×SXX and SSE ¯ 2 . Therefore, if H0 : β1 = 0 is = σ 2 where SXX = (Xi − X) and E n−2 true, then both SSE/n−2 and SSR estimate σ 2 in an unbiased way, and, hence, you should expect the ratio FC to be close to 1, on average. If H0 is not true, then SSR is greater than σ 2 on average, and you should expect FC to be larger than 1. You use FC as a test statistic and reject H0 when FC is too large. “Too large” means large relative to a standard set by the F-distribution with 1 and n − 2 degrees-of-freedom. The F-statistic, FC , is the test statistic used to compare the two models: the restricted model and the unrestricted model. The FC statistic tells you which one is better. The hypotheses are:
206
5 Advanced Deep Survey Analysis: The Regression Family
Table 5.2 This is the general structure of an ANOVA table. The degrees-of-freedom (df ) and the sums of squares are additive: 1 + (n − 2) = n − 1 and SSR + SSE = SST Source of variation Regression Residual Total
df 1 n-2 n-1
Sum of squares SSR SSE SST
Mean squares MSR = SSR/1 MSE = SSE/n−2 MST = SST /n−1
FC MSR/MSE
H0 : Restricted model is better, i.e., β1 = 0
(5.32)
HA : Unrestricted model is better, i.e., β1 = 0
(5.33)
Notice that HA is not concerned with whether or not the parameter is > 0 or < 0, but only with it being 0 or not. There is a p-value associated with FC just as there is one for all test statistics. The decision rule is simple: reject H0 if p-value < 0.05; do not reject otherwise. The ANOVA table summarizes the sum of squares, the degrees-of-freedom, the mean squares, and the calculated F-statistic. I show the general structure in Table 5.2.
ANOVA Conjecture I stated in the text when I derived the fundamental ANOVA result, almost as a conjecture, that (Yˆi − Y¯ ) × (Yi − Yˆi ) = 0. This can be shown as follows. Consider the first term Yˆi − Y¯ = βˆ0 + βˆ1 Xi − Y¯ = Y¯ − βˆ1 X¯ + βˆ1 Xi − Y¯ ¯ = βˆ1 (Xi − X) for a simple OLS model with one explanatory variable. Now consider the other term: Yi − Yˆi = Yi − βˆ0 − βˆ1 Xi = Yi − Y¯ + βˆ1 X¯ − βˆ1 Xi ¯ = (Yi − Y¯ ) − βˆ1 (Xi − X) Collecting terms, summing, and simplifying using the formula for βˆ1 prove the conjecture. The formula for βˆ1 is βˆ1 +
¯ × (Xi − X) ¯ (Xi − X) . 2 ¯ (Xi − X)
5.4 The Poisson Link: Introduction to Poisson Regression
207
Odds-Ratio Algebra The odds-ratio derivation in (5.25) used a trick involving the natural log: x = eln (x) . To see that this is correct, use the Taylor series expansion which is f (x) =
∞
f i (a) ×
i=0
(x − a)i i!
where f i (a) is the ith derivative of the function evaluated at a (note that f 0 (a) is just the original function evaluated at a) and i! is i factorial. For the odds ratio, I used f (x) = eln (x) . First note that for Z = cx where c is a constant, then dZ/dx = cx × ln (c). For our problem, c = e so ln (e) = 1. Now use this with the first two terms of the Taylor series expansion: eln (x) = eln (a) + eln (a) × (x − a). Let a = 1 so ln (a) = 0. Then, eln (x) = eln (1) + eln (1) × (x − 1) = 1+x−1 = x. This is why I used p1/1−p1 = ep1/1−p1 .
Elasticities from Logs Assume that your model is ln (Y ) = β0 + β1 × ln (X). Then 1 1 × dY = β1 × × dX. Y X Therefore, dY X Y × = β1 = ηX . Y dX
208
5 Advanced Deep Survey Analysis: The Regression Family
Other OLS Output The statsmodels output reports the Log-Likelihood, AIC, and BIC. The LogLikelihood is the maximum value of the likelihood function. This is used as a factor in the calculation of AIC and BIC, both of which show the amount of information left in the data. The AIC is calculated as AI C = −2 × ln (L) + 2 × p, and BIC is calculated as BI C = −2 × ln (L) + p × ln (n) where L is the maximum likelihood value. See Paczkowski (2022) and Montgomery et al. (2012) for discussion of these measures. The statsmodels output has a few other statistics in the Diagnostic Section: • The Jarque-Bera test for normality of the residuals. It tests whether the residuals have a skewness and kurtosis matching a normal distribution. The skewness and kurtosis are reported in this section. The statsmodels documentation notes that “this test only works for a large enough number of data samples (>2000) as the test statistic asymptotically has a Chi-squared distribution with 2 degrees of freedom.” • The omnibus test is another test of normality. • The condition number is a check for multicollinearity. I did not discuss this problem in this book. See Paczkowski (2022) for a discussion. Also see Montgomery et al. (2012, p. 298).
Chapter 6
Sample of Specialized Survey Analyses
Contents 6.1
6.2 6.3 6.4
Conjoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Analysis Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Creating the Design Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Fielding the Conjoint Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Estimating a Conjoint Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.6 Attribute Importance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Net Promoter Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
210 210 210 211 212 214 215 217 224 228
There are many specialized analyses for Core Questions. I will summarize a few in this chapter to highlight the possibilities with Python. These specialized analyses are: 1. 2. 3. 4.
Conjoint analysis; Net promoter score analysis; Correspondence analysis; and Text analysis.
These examples only skim the surface of the complexity of analysis for each topic. They are not meant to show all the intricacies of, say, conjoint analysis or net promoter score analysis; each would require a separate volume. The intent is just to illustrate possibilities and how addressing them can be handled.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_6
209
210
6 Sample of Specialized Survey Analyses
6.1 Conjoint Analysis In this first example, I will discuss conjoint analysis in the context of two objectives: estimate price elasticities and identify key features for a new product. I will show how to specify a model, create a conjoint design, and analyze and interpret estimation results. I will use a case study throughout this section. For a detailed discussion of this approach, see Paczkowski (2018) for pricing analysis and Paczkowski (2020) for new product development.
6.1.1 Case Study A watch manufacturer wants to develop and price a new men’s fitness watch, but she can only market one watch. There are four features, factors, or attributes that define a watch. These are: • • • •
Price: $149.99, $179.99, $229.99. Compatibility: Android, iOS, Windows. Measure: Calories, distance, heart rate. Rain/splash proof: Yes, no.
There are 54 possible watches based on all arrangements (i.e., permutations) of these attributes: 54 = 3 × 3 × 3 × 2. The Core Question is: “What is the best watch to sell and at what price?” This is a pricing and new product development compound question. A model is needed to answer it which forms the Core Question for a survey. The model is a conjoint model. This is one member of a family of choice models. These other members of the choice family could be used to answer the Core Question, but conjoint is the simplest to illustrate possibilities. I will only discuss the conjoint member of the choice family. Discussion of the other members of the family will take us too far from this chapter’s intent which is only to highlight special analysis cases.
6.1.2 Analysis Steps There are several steps you should follow to do a conjoint analysis. These are: 1. 2. 3. 4. 5.
Identify the products’ attributes and their levels. Create a design matrix of the attributes and their levels. Field the study and collect responses. Create a data matrix for estimation and estimate a model. Analyze the results.
6.1 Conjoint Analysis
211
The first step was completed by the watch manufacturer’s management and marketing teams. The last step, analyze the results, involves calculating attribute importances which are the importances of each attribute, not their levels, in determining overall preference of a product. For this watch example, there is another calculation that is required: calculate the price elasticities for input into a pricing strategy. See Paczkowski (2018) for a thorough discussion of the use of elasticities in pricing strategies. The middle steps are the heart of conjoint analysis, and this is where I will focus my discussion.
6.1.3 Creating the Design Matrix The model is for the total preference or total utility for combinations of attributes of a product concept. Total utility is composed of pieces called part-worths, each partworth measuring the contribution of each attribute’s level. The goal is to estimate these part-worth utilities enabling the calculation of total utility for each of the 54 watches. Ordinary least squares (OLS) regression can be used for this estimation. The issue is the matrix of the data. This is called a design matrix. The design matrix is just the X matrix for the OLS regression that will be used for estimation of the part-worth utilities. Unlike the regression modeling examples I used in the previous chapter where the independent variables were part of the Surround Questions (and maybe part of some Core Questions), the design matrix in this problem is strictly a function of the attributes and their levels. When I say it is a function, I mean the design matrix is specified by combinations of −1, 0, and 1 values reflecting the levels of the attributes. Each combination is interpreted as a product (a watch in this case). The combinations are based on experimental design concepts. This coding is the effects coding I discussed previously. The design matrix is created by first determining its size, that is, the number of rows and columns. The columns are set by the attributes and their levels and the rows by the number of parameters that have to be estimated. The rows are sometimes called “runs” in the experimental design literature. The number of parameters reflects the number of levels for each attribute less one to avoid the dummy variable trap. Table 6.1 shows the number of levels for each attribute for the watches and the number of parameters that have to be estimated. I used the pyDOE2 package which has a function named gsd (Generalized Subset Designs), which is used for factors at more than two levels, to generate a design matrix with a minimum of eight runs. You can install this package using either pip install pyDOE2 or conda install -c conda-forge pydoe2. Since the watches have 54 combinations, I cannot generate a useful design matrix of 54 runs. That would be too large because, remember, you can interpret a run as a product. So, there are 54 possible products which is not a large number but it is still inconvenient. I decided to use a one-third subset (called a one-third fraction) of 54, or 18 runs. The gsd function parameter is a list of the levels for the attributes and the desired fraction (i.e., one-third). I show the setup in Fig. 6.1. I converted the design matrix into a
212
6 Sample of Specialized Survey Analyses
Table 6.1 This is an attribution of the parameters for the watch case study. The final minimum number of runs is the number of parameters plus one for the constant in a linear model. This is a minimum; you can certainly use more than the amount shown Attribute Compatibility Measure Price Rain/splash proof Subtotal Constant Minimum runs needed
Number of levels 3 3 3 2 11
Parameters needed 2 2 2 1 7 1 8
Fig. 6.1 This is the setup for the design matrix generation for 18 runs
DataFrame for convenience as I show in Fig. 6.2. I then recoded the DataFrame’s values to match the descriptive words for each attribute’s levels and show this in Fig. 6.3. This final DataFrame has 18 rows or 18 runs. Each run is a product that will be shown to a consumer in a questionnaire.
6.1.4 Fielding the Conjoint Study A conjoint task is shown to a consumer as part of a larger questionnaire. The task consists of asking each consumer to evaluate a product, one product at a time, and then indicate the likelihood of them buying that product. A 0–10 likelihood-topurchase scale is often used. I show an example in Fig. 6.4. The presented product is said to be on a “card.” This is not an actual card since physical cards are not used
6.1 Conjoint Analysis
213
Fig. 6.2 The design matrix in a DataFrame. The values correspond to the levels for each attribute beginning at 0 since Python is zero-based
Fig. 6.3 The recoded design matrix in a DataFrame
214
6 Sample of Specialized Survey Analyses
Fig. 6.4 This is an example of a conjoint card presented to a consumer
anymore; it is just an old expression still widely used. Each consumer is shown a series of cards, one card per run which equates to one card per product. For the watches case study, there are 18 runs or products; therefore, each consumer sees 18 cards.
6.1.5 Estimating a Conjoint Model The collected response data are imported into a Pandas DataFrame for analysis. The sample size for this case study is 385 consumers. Since each one saw 18 cards, the total size of the DataFrame is 6930 rows (6930 = 385 × 18). The columns are the response as an integer (0–10) and the effects-coded attribute levels which is the design matrix. Surround Question data, such as demographics, are also in the DataFrame. An OLS model was estimated using the natural log of the responses as the dependent variable, the natural log of price, and the effects-coded attributes. The natural logs were used for two reasons. First, the price elasticity is needed for pricing the product. Econometric theory shows that when the natural logs are used in a regression model, the estimated coefficient for the log price term is the elasticity. Refer back to Chap. 5 for this point. Second, in most if not all consumer demand studies, logs are used for sales and price variables because they normalize distributions that might be skewed. Skewness is not an issue here, but I wanted to stay close to accepted approaches. See Paczkowski (2018) for examples of the use of logs in demand studies. The effects coding was done using the C( ) function that I used earlier to handle a categorical variable, that is, to encode it to a series of dummy variables. The same function can be used for effects coding but with one additional argument: a sum indicator that itself has an argument which is the base level for the summation. The base level I used in all cases was the one that made the most analytical sense. The keyword “sum” indicates that the estimated coefficients for the relevant categorical variable sum to zero. The coefficients are the estimated ones plus the
6.1 Conjoint Analysis
215
base which is omitted as for dummy encoding. As an example, for “Compatibility,” the effects coding is done using the statement C( compat, Sum( ’Windows’ ) ) where “Windows” is the base level. An OLS model was set up using the same general statements as before. I show the setup for this case study in Fig. 6.5. The regression results are also shown. The results are interpreted just the way the OLS results were earlier interpreted. The only difference is that the model is set because the independent variables are defined by the experimental design that leads to the design matrix. You could, perhaps, add some demographic variables, such as gender or education, to the model to get estimated parameters by different groups. You cannot, however, change the attributes and their levels for two reasons. First, these attributes and levels were defined by management or the client for the study and are integral parts of the proposed product. Second, these are what the respondents saw so their responses are conditioned on these settings. Changing any of the attributes or levels (i.e., dropping a level or attribute) is tantamount to changing the conditions after the fact. The estimated coefficient for the log of price is, as I mentioned above, the price elasticity. Figure 6.5 shows that this is −1.3. The watches are slightly price elastic. This should make intuitive sense since there are many different types of watches in the market which makes the market very competitive. Even though these are fitness watches, the competitive aspect of the market is still there.
6.1.6 Attribute Importance Analysis A major output of a conjoint analysis is the importance ranking of the attributes. In a new product development context, this helps product and pricing managers focus on attributes that are the key drivers for consumer acceptance of the new product. Not all attributes are or should be expected to be equally important to consumers. Which one is the most important? The importance question is answered by calculating the maximum and minimum total utility for the product based on the estimated coefficients from the OLS estimation. The minimum is determined by setting each attribute to its lowest level, this lowest level being the one with the minimum estimated coefficient. Likewise, the maximum utility is found by setting each attribute to its highest level based on the estimated coefficients. The difference between the two utilities is the range: maximum minus minimum. This implies that the difference equals the sum of the ranges in levels for each attribute. Dividing the attribute ranges by the utility range gives you, in proportion terms (or percents if you further multiply by 100), the contribution to the utility range of each attribute. These proportions are the attribute importances. I found the importance in two steps. First, as I show in Fig. 6.6, I retrieved the estimated part-worths and placed them in a DataFrame. To retrieve the part-worths, which are the estimated coefficients of the regression model, I used the params method attached to the regression model. The constant is not used to calculate
216
6 Sample of Specialized Survey Analyses
Fig. 6.5 This is the OLS setup and estimation results for the watches case study
6.2 Net Promoter Score
217
Fig. 6.6 The estimated part-worths are retrieved from the regression output and placed into a DataFrame
importances, so the constant, the first coefficient, is omitted. Then, second, I used the DataFrame to calculate the importance of each attribute as I show in Fig. 6.7. This is a longer code snippet that involves using list comprehensions and the Pandas groupby function. The results are also in the figure. Notice that price is the most important attribute followed by measure, but the difference in the percent importance is very large. This indicates that price is the driving factor, which should be expected because of the elastic demand and the competitive watch market.
6.2 Net Promoter Score I already covered customer satisfaction surveys in Chap. 5. In that case, I looked at a key driver analysis (KDA) for satisfaction. Many satisfaction questionnaires have, in addition to the overall satisfaction question, an additional question about recommending the product or service. This question is usually on an 11-point likelihood-to-recommend Likert scale. For example, the San Francisco Airport survey included a recommendation question: “On a scale of 0 to 10, how likely is it that you would recommend SFO to a friend or colleague?” The 0 point represents “Not at all Likely” and the 10 represents “Extremely Likely.” This is used to calculate a net promoter score (NPS). The idea behind NPS is that people can be divided into three groups based on their likelihood-to-recommend rating. Those in the top-two box (9 and 10 on the 11-point scale) are the ones who will tell other people about the product or service. For the SFO survey, these would be the people who would tell others to use the San
218
6 Sample of Specialized Survey Analyses
Fig. 6.7 The estimated part-worths are retrieved from the regression output and placed into a DataFrame
Francisco Airport; they would promote the airport and are classified as “Promoters.” Anyone with a 7 or 8 rating would also tell others to use the product or service, but they would not be as strong an advocate as those in the top-two box. These people would be passive promoters and are labeled “Passive.” Anyone with a rating less than 7 is considered to be a detractor: they would not recommend the product or service, and certainly those with a rating in the bottom-three box would be strong detractors. The entire group with a rating below 7 is labeled “Detractors.” These divisions into three groups are not universally used; you can certainly use whatever divisions you want. These are, however, the most common that I have seen in applications. I show some data in Fig. 6.8. The likelihood-to-recommend variable is labeled “NETPRO” because this is the way it was labeled in the questionnaire. Notice that the ratings are in the data column. It would be useful to recode these to the three
6.2 Net Promoter Score
219
Fig. 6.8 This is a display of the first five records of the SFO likelihood-to-recommend data. The likelihood variable is “NETPRO” which is how it was labeled in the questionnaire
Fig. 6.9 This shows how the SFO likelihood-to-recommend data can be recoded to the three NPS labels
promoter labels and add this as a new variable to the DataFrame. I show in Fig. 6.9 how this recoding can be done using a list comprehension. The NPS is often presented either in a small tabular form or as a simple bar chart. But more can be done with it. As an example, the SFO questionnaire had
220
6 Sample of Specialized Survey Analyses
Surround Questions on the respondents’ demographics. One is, of course, gender. So, a question might be: “Does the NPS differ by gender?” This can be checked using the cross-tab analyses I discussed in prior chapters or by estimating a logit model and calculating the odds of someone being a promoter given they are male vs being a promoter given they are female. I will not show either of these because the basic methodology and modeling setup have already been shown. There is, however, another form of analysis that can be done that I did not show before. This is a decision tree analysis. A decision tree is a way to determine (i.e., decide on) or model the key drivers for a dependent variable that is a Core Question. It is called a decision tree because the output is in the form of a tree, albeit inverted, but a tree diagram nonetheless. As an inverted tree, the root is at the top, and the branches flow downward rather than upward as with a real tree. It is a model because it allows you to determine what is important for explaining the dependent variable. This approach differs from, but yet is related to, the regression approach I discussed before. The two are related by how the regression is done. OLS explains the dependent variable by fitting the best straight line to the data where the line has an intercept and slope. That line is the one that satisfies an optimality criterion, which happens to be the minimum of a sum of squares of the residuals. The decision tree explains the dependent variable by fitting a series of constants to the independent variable data. The constants divide the space of the independent variables into regions. The best tree is the one that divides the space in the most optimal manner, optimal based on some criteria. The criteria are more complicated than a sum of squares, but there are criteria nonetheless. A decision tree is a way to do regressions. See Beck (2008) for an interesting discussion of this perspective. Recall that a regression could have a continuous dependent variable or a discrete one. In the former case, you use OLS; in the latter, you use logistic regression. But you still have a regression, just two members of the same family, cousins, if you wish. A decision tree is another member of the regression family, but unlike the other two approaches, this one handles either a continuous or discrete dependent variable with some modifications, of course. When the dependent variable is continuous, the tree is called a regression tree; when discrete, it is called a categorical tree. See Paczkowski (2022) for some detail about decision trees. For this approach, you have to identify some features as the most likely to account for the net promoter scores. Suppose you limit these to the respondents’ gender, how safe they feel at the airport, how long they have been using the airport for their travels, and if they went through the TSA pre-check security line. These scales have to be recoded and then labeled encoded because the decision tree algorithm does not handle strings, only numbers. The length-of-use question has five levels: 1. 2. 3. 4. 5.
Less than 1 year 1–5 years 6–10 years 10+ years Blank/multiple responses
6.2 Net Promoter Score
221
I recoded this to three levels: 5 years or less, more than 5 years, and missing if blank or multiple responses. The safety question is a five-point Likert scale question with “5” = “Extremely Safe”; there is also a 0 for missing response. I recoded this question to top-two box, bottom-three box, and missing. Finally, the TSA question has five levels: 1. 2. 3. 4. 5.
Yes No Don’t know Did not go through security at SFO Blank/multiple responses
I recoded this to three levels: Yes, no, and missing. After the recoding, the tree is fit as before in Chap. 5. I show all this in Fig. 6.10 and then the final grown tree in Fig. 6.11. See Paczkowski (2022) for a discussion on interpreting decision trees. You can go one step further. There are typically two measures in a customer satisfaction study: overall satisfaction and likelihood to recommend. The NPS is based on the second. A good question is: “What is the relationship between satisfaction and promotion”? You can explore this question using the airport satisfaction study. First, I imported the satisfaction and likelihood-to-recommend
Fig. 6.10 This is the setup for the NPS decision tree
222
6 Sample of Specialized Survey Analyses
Fig. 6.11 This is the NPS decision tree
Fig. 6.12 This is the code snippet to import the two variables: satisfaction and likelihood to recommend
scores and deleted all records with a missing value. I also recoded both scores to strings to make interpretation easier. I show both steps in Figs. 6.12 and 6.13, respectively. I then created a cross-tab and ran a McNemar test for marginal homogeneity. The Null Hypothesis is that the two marginal probabilities are the same, that is, HO : pb = pc HA : pb = pc
6.2 Net Promoter Score
223
Fig. 6.13 This is the code snippet to recode the two variables: satisfaction and likelihood to recommend. Missing values are deleted
Fig. 6.14 This is the code snippet to test for marginal homogeneity for satisfaction vs promoter
as I discussed in Chap. 4. I show the test results in Fig. 6.14. You can see that the Null Hypothesis is rejected, so there is a difference between the two scores. In addition, Cramer’s phi statistic indicates moderate association between the two variables. √ Notice, based on what I mentioned in Chap. 4 that φ = χ 2/n, that φ = 269/2625.
224
6 Sample of Specialized Survey Analyses
Fig. 6.15 This is a Venn diagram of those respondents who are satisfied and promoters. The numbers agree with those in the cross-tab in Fig. 6.14. The circles are very close because the intersection is large
You can also create a Venn diagram of those who are T2B satisfied and those who are T3B promoters. The diagram is in Fig. 6.15. I created two lists of the index numbers using a list comprehension with the enumerate function. This function returns the objects in a list along with their index numbers. The two lists are one for those satisfied and one for promoters. The diagram is based on the matplotlib_venn package which you can install using pip install matplotlib-venn or conda install -c conda-forge matplotlib-venn.
6.3 Correspondence Analysis I discussed cross-tabs in Chap. 3 as a major tool for survey analysis. I then slightly expanded the table concept in Chap. 4 with pivot tables which provide more flexibility allowing you to create different arrangements of a table. The pivot table allows you to view your tabular data in different ways, perhaps allowing you to see new patterns and relationships. The problem with tables is their size. The smallest practical table, a 2 × 2, can be informative just by visual inspection. Statistical tests such as the McNemar test can be used to test for relationships. Its small size allows you to just look at it to see relationships. Some tables, however, are large which makes it more challenging, if not impossible to literally see relationships. Statistical tests help, but they are limited since the average is over all the data. A visual
6.3 Correspondence Analysis
225
display, preferably in two dimensions, would handle this problem. The question is how to create the display. The answer is a methodology called correspondence analysis (CA). See and Greenacre (2007), Greenacre (1984), and Jobson (1992) for discussions about CA. Correspondence analysis is based on the singular value decomposition (SVD) of a cross-tab. See Paczkowski (2020) and Jobson (1992) for discussions of SVD. This produces three matrices that when multiplied yield the original cross-tab. The three matrices are: 1. Left: for the rows of the cross-tab; 2. Right: for the columns of the cross-tab; and 3. Middle: the singular values that provide information about the variance in the cross-tab. The left and right matrices give plotting coordinates when combined with the singular values. The singular values are in ranked descending order and show the importance of the dimensions: small values mean a dimension can be ignored; usually, the first two are used. Note that: 1. Squared singular values are the inertias of the cross-tab; and. 2. Sum of inertias = total inertia of variance of the cross-tab. This total inertia is χ 2/n, so there is a connection between correspondence analysis and the chi-square analysis of the cross-tab. See Greenacre (2007) and Greenacre (1984) for this connection. The yogurt study had a Core Question regarding the brand last purchased. The final DataFrame has a column named Segment which is based on a prior segmentation study that classified people by four marketing segments: Dessert Lover, Health Fanatic, Health-Conscious, and Normal Food Consumer. These segment labels were attached to each survey respondent. Although not a Surround Question, the segment designation nonetheless serves the same function; it provides extra classification information. I show a cross-tab of brand by segment in Fig. 6.16. A correspondence analysis function in the prince package was used. Prince is described as “a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence analysis (CA).”1 This can be installed using pip install prince or conda install -c bioconda prince. The function in price is CA which is instantiated with the number of components to extract, the number of iterations to find a solution, and a random seed. Once instantiated, you can fit a cross-tab and plot the results from the fit. I show the crosstab in Fig. 6.16 and the plot in Fig. 6.17. The plot actually consists of two plots, one on top of the other. This is called a biplot and is often referred to as a map. The first plot is a scatterplot of the rows of the cross-tab. The plotting coordinates are in the columns of the left matrix produced
1 See
https://pypi.org/project/prince/. Last accessed January 30, 2021.
226
6 Sample of Specialized Survey Analyses
Fig. 6.16 This is the cross-tab of brand purchased by segment assignment. It is the foundation for the correspondence analysis
by the SVD.2 The second plot is a scatterplot of the columns of the cross-tab based on the columns of the right matrix from the SVD. The columns are referred to as dimensions. The maximum number of dimensions is the minimum of the number of rows of the table less one and the number of columns of the table less one, that is, min(r − 1, c − 1) where r is the number of rows and c is the number of columns. See Clausen (1998). Not all the dimensions are needed. Typically, just the first two are used. Each one accounts for a proportion of the total variance of the table. The cumulative inertia tells you the amount. As a rule of thumb, retain those dimensions that account for more than 70% of the variance. See Higgs (1991) and Sourial et al. (2010). The axes of the plot are the dimensions formed by the plotting coordinates. The amount of the variance explained is indicated for each axis (each dimension). From Fig. 6.17, you can see that the first dimension, which is the abscissa, accounts for 66.7% of the variance of the cross-tab. The second dimension, which is the ordinate, accounts for 25.6%. Together, these two dimensions account for 92.3% of all the variance. I show a summary table in Fig. 6.18. The axes have labels denoting the dimension used on that axis. These have to be interpreted since they are derived quantities without any meaning otherwise. You can see in the map that there are two extreme points in the vertical dimension that are polar opposites: Dessert Lovers and Health Fanatics. A reasonable interpretation of the vertical axis is a contrast in views of foods and perhaps in diet. The horizontal axis also has two extremes but for the brands: E and G on one end and D on the other. Not knowing anything about these three brands, the best that could be said is
2 There
is actually more to calculating the coordinates, but this brief description will suffice.
6.3 Correspondence Analysis
227
Correspondence Map Yogurt Study D
Brand Segment 0.10
E
Component 1 (25.61% inertia)
Dessert Lover
G
0.05
Normal Food Consumer Major Competitor 0.00 C
Health-Conscious -0.05 Client
Health Fanatic -0.2
-0.1 0.0 Component 0 (66.69% inertia)
0.1
0.2
Fig. 6.17 This is the correspondence map of brand purchased by segment assignment for the yogurt study
that the horizontal dimension is a contrast between what these brands represent. You should also notice that the client’s brand, the sponsor of this (albeit fictional) study, is close to (in the space of) health-conscientious consumers, either the fanatics or just those who are watching what they eat for their health. Also notice that the brand considered to be the major competitor to the client’s brand is more closely associated
228
6 Sample of Specialized Survey Analyses
Fig. 6.18 This is a summary of the brand purchased by segment assignment correspondence analysis
with normal food consumers, while the client’s brand is more health oriented. The marketing and strategic question is: “Why is the major competitive brand viewed as the major competitor since they are not in the same marketing space?”
6.4 Text Analysis A common feature in almost all questionnaires is a verbatim response. This is in the form of an “Others: Please Specify” option to a question for which the questionnaire writer lists all possible response options he/she can think of, but leaves one open, the verbatim, to capture any option not listed. Sometimes, rather than being the ubiquitous “Others: Please Specify” as a catch-all, a verbatim question may be added to elicit clarifying responses to another question. It may not be sufficient to know that someone is satisfied or not based on a Likert scale question, but why they are satisfied or not may be more insightful. As an example, the Toronto City Council was in the process of considering whether or not to allow a new casino in Toronto and where it should be located. Starting in November, 2012, it conducted many and varied consultations with the public and key stakeholders in the city. One of the methods of public consultation
6.4 Text Analysis
229
Fig. 6.19 These are the first five records of the Toronto casino data. Only the data for the Likert scale and verbatim question are used
was a “Casino Feedback Form,” a survey, distributed online and in person. The Council collected 17,780 responses.3 I will use two questions for my example: 1. Please indicate . . . how you feel about having a new casino in Toronto (5-point Likert scale). 2. What are your main reasons for this rating? (verbatim response) The final sample size is n = 17,766. Fourteen records were deleted for unknown reasons. I show the first five records for the data in Fig. 6.19 and a missing value report in Fig. 6.20. Notice that the verbatim question, named Reason, is missing a lot of data. The first issue is how to handle the missing data. The easiest approach is to simply delete those records with missing values. The Pandas dropna() method with the inplace = True argument is used. This is written as df_casino.dropna( inplace = True ) where df_casino is the name of the DataFrame. The first task you must complete before any analysis is done with text data is to clean your data. This involves removing any leading white spaces and punctuation marks. You can do this in two steps. First, use the Pandas str accessor and its methods. An accessor is a method to access the contents of a DataFrame to operate on an attribute of the DataFrame. There are four accessors: 1. dt for accessing and operating on dates and times; 2. cat for accessing and operating on categorical data;
3 See https://www.r-bloggers.com/do-torontonians-want-a-new-casino-survey-analysis-part1/. The data and questionnaire can be found at https://www.toronto.ca/city-government/dataresearch-maps/open-data/open-data-catalogue/#16257dc8-9f8d-5ad2-4116-49a0832287ef.
230
6 Sample of Specialized Survey Analyses
Fig. 6.20 This is a missing value report for the Toronto casino data
3. str for accessing and operating on string (i.e., object) data; and 4. sparse for accessing and operating on sparse data. The accessor str has a method strip that can be called to remove leading and trailing white spaces. If a DataFrame has an attribute (i.e., a column) named myString which is a string (i.e., object in Pandas terminology), then its features can be operated on by str. These are chained together as one call: df.myString.str.strip( ). The accessor str has the same string manipulation methods as the Python str function. A short list of useful methods is: 1. 2. 3. 4. 5.
strip, upper, lower, replace, and find.
Punctuation marks are removed using a regular expression. Regular expressions are very powerful text manipulation methods, but the language, which is small, is arcane and difficult to read and write. They are based on a series of metacharacters that form a pattern to be matched in a string. There are only a few metacharacters you will use in your work: 1. 2. 3. 4. 5. 6.
. (a single dot or period): matches any character; * (asterisk): matches 0 or more occurrences of the preceding character; + (plus): matches 1 or more occurrences of the preceding character; ? (question mark): matches 0 or 1 occurrence of the preceding character; \d matches all digits: [0–9] (note the backslash); \w matches any alphanumeric character and the underscore: [a-zA-Z0-9_] (note the backslash);
6.4 Text Analysis
231
7. \s matches a white space: [\t \n \r \f \v] where: • • • • •
\t is a horizontal tab character \n is a line feed character \r is a carriage return character \f is a form feed character \v is a vertical tab character
8. \W matches any non-word character (note the backslash); 9. \b is a word boundary (note the backslash) • Example: \ba matches a if at a boundary: all is a match; ball is not a match. The actual boundary rule is complex and a little confusing. There is also a collection notation and two positional metacharacters. A collection is indicated a set of square brackets, [ ], which means to include the content. For example, [a-zA-Z0-9] means to include lowercase a–z, uppercase A–Z, and any digit 0–9. The positional metacharacters are: 1. ^(caret/circumflex/hat): look at the beginning of a string BUT inside [ ] it says to negate the contents; and 2. $: look at the end of a string. Regular expression capabilities are made available to you when you import the Python re package, usually with the alias re. Use import re. An alias is not necessary since the package name is so short. The punctuation marks can be removed by applying a lambda function to each record of the cleaned data. The lambda function contains a regular expression that locates the negation of all alphanumeric characters and white spaces. This negation leaves just the punctuation marks. When a match occurs, they are replaced by nothing, so they are effectively removed. I show the code cleaning all the data in Figs. 6.21 and 6.22. A simple analysis involves just determining the length of each verbatim response, that is, the number of characters comprising the verbatim. The characters are all alphanumeric characters as well as white spaces. The length can be found using the len function applied to each record and then adding that length as a new variable to the DataFrame. I show the code for this in Fig. 6.23.
Fig. 6.21 This is the code to remove white space using the str accessor
232
6 Sample of Specialized Survey Analyses
Fig. 6.22 This is the code to remove punctuation marks
Fig. 6.23 This is the code to calculate the length of each verbatim response
You can now explore the lengths of the verbatim responses. You might conjecture that people who are against the casinos might write long responses, and in fact, the more they are against the casinos, the longer their response. A first graphical tool is, of course, the histogram. I show this for the length variable in Fig. 6.24. Notice that the histogram is skewed right suggesting support for the conjecture. You could follow up on this observation with a skewness test but the visual evidence is clear. However, the histogram does not tell you how these verbatim lengths vary by the Likert scale ratings. Are the lengths longer for those who are more against the casinos than those who are not which is what the conjecture implies? This can be checked using boxplots of the lengths by the rating scale. I show this in Fig. 6.25. This display is more informative because it shows how the lengths increase as people become more against the casino. If you look closely at the distribution pattern, you will see that the pattern is a slight negative exponential as the ratings move from Strongly in Favor to Strongly Oppose indicating that those who are opposed are not just strongly opposed but perhaps vehemently opposed by the lengths of their comments. A visual that is more infographics than scientific data visualization is the word cloud. A word cloud shows the keywords in some container of words, the
6.4 Text Analysis
233
Fig. 6.24 This is the histogram of the length of each verbatim response. Notice the right skewness
Fig. 6.25 This is the boxplots of the length of each verbatim response by the Likert scale ratings
234
6 Sample of Specialized Survey Analyses
Fig. 6.26 This is the word cloud for the verbatim responses
importance of the words indicated by their size relative to all other words. The basis for the word cloud is a data matrix, the data container, called the document-term matrix (DTM). This is a matrix with documents for the rows and the words or terms or tokens (all three used interchangeably) as the columns. The documents are the verbatim responses, and the terms are the words extracted from each document. Not all words in a verbatim can or should be used. Some words, such as “the,” “and,” and “a,” are meaningless and can be deleted. These are stop words. Also, abbreviations, contractions, foreign words, and phrases need to be handled, sometimes also being deleted. A final culled down list of words is the count of their occurrence in each document and the counts recorded in the DTM. The frequencies are then weighted to reflect the importance of each word in each document. The weights are called the term frequency-inverse document frequencies (tfidf ). This final weighted DTM is used to create a word cloud. I show a word cloud for the casino survey in Fig. 6.26. For more information about DTMs and the tfidvs, see Paczkowski (2020, Chapter 2) and Sarkar (2016). Although I advocate word clouds for infographics and management/client presentations, not for the scientific data analysis, they can be, nonetheless, informative. The word cloud in Fig. 6.26 indicates that there is a concern about jobs, addictions, social, people, and money. Other words such as crime, traffic, and problems also stand out. An interpretation might be that Toronto residents are concerned about
6.4 Text Analysis
235
the possible negative impacts of casinos, the negative impact being effects on jobs, crime, and addictions, probably gambling addictions since casinos are gambling businesses. Other forms of text analysis are possible: • Hierarchical clustering of documents to group documents for drill down on who wrote them; • Latent topic analysis to identify topics. The topics are basically latent messages that can be extracted. For both of these forms of analysis, see Paczkowski (2020, Chapter 2) and Sarkar (2016).
Chapter 7
Complex Surveys
Contents 7.1 7.2 7.3 7.4
7.5
Complex Sample Survey Estimation Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tabulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 CrossTabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 One-Sample Test: Hypothesized Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Two-Sample Test: Independence Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Two-Sample Test: Paired Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
239 240 241 244 245 245 246 247 248 248
The surveys I considered until now have been “simple” sample surveys. This is not to say that the sampling is trivial or unimportant. It is to say that their design is uncomplicated and easily developed. Simple sample surveys are based on simple random sampling (SRS). Recall that random sampling could be with and without replacement. The former refers to placing a sampled unit back into the population. In essence, the population becomes infinitely large because it is never depleted. Without replacement means that the population size always gets smaller with each sampled unit. So, the probability of selecting any unit changes as units are sampled. Simple sample surveys also include stratified random sampling and clustering sampling. See any textbook treatment of sampling, such as Cochrane (1963), for discussions about these different forms of sampling. A class of surveys I have not considered is “complex” sample surveys. What is a complex sample survey? Chaudhuri and Stenger (2005, p. 250) define it as any survey based on sampling that is not simple random sampling with replacement from an unstratified population. These methods use combinations of unequal probabilities, stratification, and clustering. This is the case for many large surveys involving many diverse groups. Small samples, targeting a specific audience, are usually SRS without replacement samples. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_7
237
238
7 Complex Surveys
Recall that defining the target audience requires specifying who will be asked to participate in the survey and calculating how many people you will need. For the surveys I considered so far, there is just one group of people. That group could be divided into subgroups (e.g., by gender), but no special allowance is made for them except to say that you need so many people from a subgroup. Almost all market research surveys and opinion polls are like this. Simple random sampling or stratified random sampling is used to collect the needed respondents. The problem becomes complex by the way these respondents are collected. There are stages to the data collection, whereas with the simple design, there are no stages beyond just the one requiring data collection. As an example contrasting simple and complex surveys, a jewelry study for an upcoming Valentines’ Day event may focus on collecting data mostly from men since they are the main jewelry buyers for this occasion. So the target audience could be 75% men and 25% women. Nothing is needed beyond the specification that these proportions sum to 100%. However, a national study of the health benefits of a mask wearing during the COVID-19 pandemic may involve sampling people in clusters and then sampling within those clusters. What is a cluster? It is a group of objects that, at one level, can be considered as a whole. An apartment building is a cluster example. One design strategy is needed for the cluster sampling part of the study, and a second strategy is needed for the second part which is data collection in the cluster. The units in the first part are primary sampling units (PSUs), and those in the second are secondary sampling units (SSUs). The problem could be even more complicated if a stratification is used with the secondary units where the samples are collected based on proportional allocation or probability sampling. With a simple survey, these features of data collection typically do not enter the design at all (except to specify that a certain proportion most be male, female, young, educated, and so forth), but they do enter the design in a very important and complicated fashion with complex sampling. It is the inclusion of these extra features that makes a sampling design complex, and this complexity must be accounted for when analyzing the results. I will not delve into all aspects of complex sample survey designs. That would be, to abuse the word, too complex for the scope of this book. See Cochrane (1963), Kish (1965), Lohr (2009), and Lumley (2010) for detailed discussions of complex sample surveys. In what follows, I will discuss only five aspects of complex sample survey design: 1. 2. 3. 4. 5.
estimation effects; sample size calculation; parameter estimation; tabulations; and hypothesis tests.
7.1 Complex Sample Survey Estimation Effects
239
I will illustrate how to implement each of these using the samplics package. You can install it using pip install samplics.1 Once installed, it is imported using import samplics as smpl. I will also use the VA survey for examples.
7.1 Complex Sample Survey Estimation Effects Survey weights complicate parameter estimation in complex sample surveys. Recall that these parameters are the total, mean, proportion, and ratio. The weights, as I previously noted, account for unequal selection probabilities, non-response, and post-stratification. Their inclusion, however, makes some estimates nonlinear. For example, recall from Chap. 1 that you can estimate the total for a segment as Yˆk =
n
wi × Yi × Ik (i)
(7.1)
i=1
where n is the sample size, wi is the weight for object i, and Ik (i) is the indicator variable indicating inclusion of the ith object in segment k. Assume there are K segments. The estimate for the total over all segments is Yˆ =
n K
wi × Yi × Ik (i)
(7.2)
k=1 i=1
This is a linear estimator because it is a linear combination of the weights, wi , and the data, Yi . Now consider the mean. The estimator is n wi × Yi × Ik (i) ˆ n (7.3) Yk = i=1 i=1 wi × Ik (i) and the grand mean is Yˆ =
K n
k=1 i=1 wi × Yi × Ik (i) K n k=1 i=1 wi × Ik (i)
(7.4)
These are nonlinear estimators since they are ratios of two random variables and are not linear combinations of the data. See Williams (2008) for this observation. Linear estimators are available, but nonlinear ones are not. Therefore, methods
1 The
version I will use is 0.3.2, a beta version, as of February 26, 2021.
240
7 Complex Surveys
are needed to calculate the estimates, primarily the variances which are highly nonlinear. In general, incorrect estimation of the variances, and therefore the standard errors, complicates hypothesis testing. In particular, complex sample surveys have larger standard errors than SRS without replacement. The implication is that if standard errors are based on SRS formulas, then they will be too small, the test statistics will be too large, and the p-values will be too small. Consequently, the Null Hypothesis will be rejected more often than it should. There are several methods for estimating variances for complex sample surveys. These include: 1. 2. 3. 4. 5.
Replicate methods; Balanced repeated replication (BRR) methods; Jackknife or jackknife repeated replication methods; Bootstrapping; and Taylor series linearization (TSL).
The TSL is most commonly used. It works by using the Taylor series expansion to approximate “the estimator of the population parameter of interest by a linear function of the observations. See the Appendix for Chap. 5. These approximations rely on the validity of Taylor series or binomial series expansions. An estimator of the variance of the approximation is then used as an estimator of the variance of the estimator itself. Generally, first-order approximations are used (and are adequate), but second- and higher-order approximations are possible.” See Pedlow (2008, p. 944), Demnati and Rao (2007) for some details. The examples below will be based on the TSL.
7.2 Sample Size Calculation I previously discussed and illustrated a simple way to calculate sample size in Chap. 2. This was for a simple survey of yogurt consumers. The major requirement is the major requirement is that the margin of error be a “plus/minus” figure, for example, ±3% on either side of the estimated quantity of interest. This is still the case. Let us suppose you need the sample size for a simple random sample with a margin of error of ±3% for a proportion, where the target proportion is 50%. I show the setup in Fig. 7.1. There are four steps: 1. 2. 3. 4.
parameter specification; instantiation; calculation; and answer display.
A more complicated calculation involves strata. In this case, I will use the seven military branches as the strata. The target or expected proportion of vets for each branch was obtained from the Department of Veterans Affairs records and is shown
7.3 Parameter Estimation
241
Fig. 7.1 This shows how to calculate the sample size for a proportion using the samplics package
in Fig. 7.2.2 The sample size function is instantiated as before, but this time, a method is specified for the calculations: “Wald.” The calculated sample size for each stratum is determined and printed. The alpha level is 0.05 by default.
7.3 Parameter Estimation I noted in Chap. 1 that there are several quantities, or parameters, for the population that you can estimate from the sample data. These are: • • • •
totals; means; proportions; and ratios.
I illustrated means and proportions several times for simple sample designs. The calculations were straightforward using basic statistics concepts. The calculations of the associated standard errors were also shown if necessary for hypothesis testing. When the sample design is complex, the calculation of the standard errors is more challenging because the probabilistic structure of the sampling hierarchy must be considered. There are several sophisticated techniques available for variance estimation as I noted above. One is the Taylor linear estimator (TSL). I had previously calculated the age of the vets in the VA survey. This was based on their year of birth which was asked for in the questionnaire. For this example,
2 These
numbers are the proportion of vets in the branches in 2018.
242
7 Complex Surveys
Fig. 7.2 This shows how to calculate the sample size for a proportion for a stratified random sample. In this example, the military branches are the strata
I categorized the ages into four groups: 20–40, 41–60, 61–80, and 80+. These age groups will be used later. I also recoded the military branches since they are string objects in the DataFrame and objects cannot be used in estimations. I chose to use a list comprehension for this recoding; a LabelEncoder from the sklearn package could have been used just as well. I show the recoding in Fig. 7.3. I first wanted to calculate the mean age regardless of military branch. I also wanted to have a weighted mean, so I used the weights provided as part of the sample data. I show the setup and the calculation in Fig. 7.4. The mean age, its standard error, the 95% confidence intervals around the mean, and the coefficient of variation (CV) are reported. The CV, recall, is a measure of variation adjusting for the mean: CV = σμ/μ where σμ is the standard error of the mean. Notice from the output that CV = 0.233735/61.104045 = 0.003825 as reported. Also notice that only one stratum is used, which is the whole data set. I repeat this example in Fig. 7.5 but allow for the branches as strata. Notice that the sample means in the two figures are the same (they should be since there is only one sample) but the standard errors
7.3 Parameter Estimation
Fig. 7.3 This shows how to recode some of the VA data
Fig. 7.4 This shows how to calculate the mean age of the vets in the VA study
243
244
7 Complex Surveys
Fig. 7.5 This shows how to calculate the mean age of the vets in the VA study allowing for the branches as strata
differ. The one for the stratified allowance is smaller indicating that stratification gives you more precise estimates. You might be puzzled how this agrees with my comment above that complex sample survey standard errors are larger. That was the case against SRS without replacement; SRS is not relevant here.
7.4 Tabulation There are two types of tabulations. The first corresponds to the value_counts DataFrame method which returns the frequencies of each category of a categorical variable. A normalizing option allows you to get proportions. The samplics package has a Tabulation class which returns similar output but which also includes standard errors and 95% bounds. Another class, CrossTabulation, returns a cross-tab but not in a rectangular array; it displays a listing that includes the standard errors and 95% bounds.
7.4 Tabulation
245
Fig. 7.6 This shows a simple tabulation of a categorical variable
7.4.1 Tabulation I show how to do a simple tabulation in Fig. 7.6. Notice that the data are in list format. I show another example in Fig. 7.7 for proportions, but this example uses the sampling weights and strata.
7.4.2 CrossTabulation You can create a cross-tab but the results will be in a simple list format. The layout is comparable to the Tabulate layout. I show how you can do this in Fig. 7.8. The sample weights are used as well as the strata. The strata, however, are numerically coded as I described above. Notice that the chi-square statistics are also reported. These are the chi-square statistics for independence I described in Chap. 4. In this example, the p-values are both less than 0.05 indicating that the Null Hypothesis of independence must be rejected.
246
7 Complex Surveys
Fig. 7.7 This shows a simple tabulation of a categorical variable for proportions. Notice the use of sampling weights and the strata
7.5 Hypothesis Testing I reviewed hypothesis testing in Chap. 4. The samplics package has a t-test function for doing a one-sample t-test against a hypothesized mean and another for two sample t-tests, both independent and paired. They all take a sampling weight and a stratifying variable. These functions require, of course, a quantitative measure since means are calculated. They also require that there be no missing values. Recall from Chap. 4 that some software packages print a single p-value and others print three. The single p-value is for the Alternative Hypothesis: HA : μ = μ0 . The other cases are HA : μ < μ0 and HA : μ > μ0 . The samplics package prints all three. It also prints results for two cases: 1. Equal variances. 2. Unequal variances. The standard errors are different.
7.5 Hypothesis Testing
247
Fig. 7.8 This shows a simple cross-tabulation of two categorical variables
7.5.1 One-Sample Test: Hypothesized Mean I looked at the vets’ age for this example. I hypothesized that the mean age of a vet is 60 years old. I know from prior analysis that there are missing values, so I created a temporary DataFrame which has the missing values deleted. Doing this ensures that records with sampling weights and service branch are deleted even though they may not have missing values. I show the results of the t-test in Fig. 7.9.
248
7 Complex Surveys
Fig. 7.9 This shows a one-sample t-test with a hypothesized population mean
7.5.2 Two-Sample Test: Independence Case For this example, I compared the means age for vets by gender. It is natural to assume that the two genders are independent populations. The results are in Fig. 7.10. Since the two populations are independent, a “paired” parameter must be set to False. This same function setup can be used for two matched samples; just use “paired = True.”
7.5.3 Two-Sample Test: Paired Case The Two-Sample Test: Independent Case setup can be used for two matched samples; just use “paired = True” and specify the two variables. The two variables are in a list such as “y = [ y1, y2 ].” Since there are two groups, you could also give them meaningful names using the varnames parameter with the names in a list.
7.5 Hypothesis Testing
249
Fig. 7.10 This shows a two-sample t-test for the independence case. Notice that the parameter “paired” is set to False. Use “paired = True” for matched or paired samples
Chapter 8
Bayesian Survey Analysis: Introduction
Contents 8.1 8.2
8.3
8.4 8.5 8.6 8.7
8.8
8.9
Frequentist vs Bayesian Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Digression on Bayes’ Rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Bayes’ Rule Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Bayes’ Rule Reexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 The Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 The Marginal Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.6 The Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.7 Hyperparameters of the Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computational Method: MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Digression on Markov Chain Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Sampling from a Markov Chain Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . Python Package pyMC3: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Basic Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benchmark OLS Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using pyMC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 pyMC3 Bayesian Regression Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Bayesian Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extensions to Other Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.1 Sample Mean Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.2 Sample Proportion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.3 Contingency Table Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.4 Logit Model for Contingency Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.5 Poisson Model for Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.1 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.2 Half-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.3 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_8
253 259 259 261 262 263 263 264 264 265 265 269 269 270 272 273 274 274 280 289 290 290 291 295 297 300 300 300 301
251
252
8 Bayesian Survey Analysis: Introduction
I previously discussed and illustrated deep analysis methods for survey data when the target variable of a Core Question is measured on a continuous or discrete scale. A prominent method is OLS regression for a continuous target. The target is the dependent or left-hand-side variable, and the independent variables, or features (perhaps from Surround Questions such as demographics), are the right-hand-side variables in a linear model. A logit model is used rather than an OLS model for a discrete target because of statistical issues, the most important being that OLS can predict outside the range of the target. For example, if the target is customer satisfaction measured on a 5-point Likert scale, but the five points are encoded as 0 and 1 (i.e., B3B and T2B, respectively), then OLS could predict a value of −2 for the binary target. What is −2? A logit model is used to avoid this nonsensical result. I illustrated how this is handled in Chap. 5. A central, but subtle, characteristic of the data, not the models, I used in the previous chapters was their unilevel feature. By “unilevel,” I mean that all measurements on survey Core Questions, and certainly for the Surround Questions, are at one level without regard for nesting the primary sampling units (PSUs) (e.g., customers) in a hierarchical, or multilevel, data structure. Let me clarify this with an example. Consider the classic example of a survey of students in a classroom. If students in only one classroom are surveyed about the amount of homework assigned, then the PSUs, the students in that class, are all at one level: the single classroom. The same holds for residents in a single apartment building, or military veterans in a single country, or travelers at a single airport, or patients of a single physician, or customers of a single store. The singularity of the measurement domain is the key. The model, whether OLS or logit, just reflects the data structure. There are times, perhaps more often than not, in which a single PSU is in a hierarchical structure that could influence its responses. For example, you might be interested in surveying consumers about their purchase intent for a beverage product. The consumers are the primary sampling units. But suppose you randomly sample them at different stores in different neighborhoods located in different cities which are also located in different states. The consumers are nested within their neighborhoods, which are nested within the cities, which are nested within the states. There is a multilevel or hierarchical data structure rather than a unilevel structure. Does this nesting matter? The answer is yes, because, for my example of a beverage product, purchase intent is influenced by the socioeconomic and preference characteristics of where the consumers live. Those who live in upscale neighborhoods may, and probably do, have different preferences than those in lowerscale neighborhoods. Also, the preferences of consumers in one state will differ from those in another state. For example, Todd et al. (2021) note that consumers of wine prefer wine with labels that emphasize a local county winery rather than wines produced in other states such as California. There is a distinct regional bias. This nested structure should be included in a survey data analysis because it will enable more robust results and recommendations. This same nesting should be considered when dealing with public opinion studies because there are definite regional biases
8.1 Frequentist vs Bayesian Statistical Approaches
253
with, say, political issues. See, for example, Kastellec et al. (2019), who use a multilevel modeling approach to capture the hierarchical data structure. The statistical modeling necessary to efficiently use a hierarchical survey data structure is more complex than what I described earlier for OLS and logistic regression. The reason is the levels themselves. Each level can have an influence on the PSUs via features that determine the parameters that subsequently determine behavior at a lower level in the hierarchy, a level closer to the PSUs. This implies that the parameters at higher levels are random variables to explain. These functions of features at those levels are no different than the random variables at the PSU level. As random variables, their distributions are also functions of parameters referred to as hyperparameters. Overall, the modeling must account for these parameter distributions using information about the different levels. This is done using a Bayesian statistical approach. In this chapter, I will describe the basics of Bayesian statistical analysis and the use of a Python package, pyMC3, to estimate Bayesian models. I will illustrate Bayesian concepts using unilevel data with OLS and logit regressions. I will also provide an example using contingency table analysis to highlight the broader scope of this approach. I will then extend the OLS and logit model frameworks in the next chapter to cover multilevel modeling using pyMC3. For an excellent but more detailed treatment of the material I will cover in this and the next chapter, see Martin et al. (2022). Also see Christensen et al. (2011), citetMartin:2018:001, Gelman and Hill (2007), and Gelman et al. (2021). The book by Gelman and Hill (2007) is applicable for the next chapter.
8.1 Frequentist vs Bayesian Statistical Approaches Most books on survey data analysis, in fact most books on statistical data analysis regardless of the data’s source, emphasize one fundamental approach to analysis: a Frequentist approach. This is based on random variables that follow, or whose values are drawn from, a probability distribution, such as a normal, Student t, or chi-square distribution. Regardless of the distribution, that distribution is a function of parameters, usually two. One determines the distribution’s location and the other its shape. The first is the mean or expected value and the second the variance or its square root, the standard deviation. For example, if the random variable Yi is the number of hours a child watches TV in a survey of young children’s TV viewing habits, then you could hypothesize that Yi ∼ N (μ, σ 2 ) where μ is the mean or expected value and σ 2 is the variance. You could go further, assuming you have the necessary surround data, to hypothesize a regression model such as Yi = β0 + β1 × Carei + i where Yi is, again, a child’s TV viewing time and Care is a dummy variable indicating if the child attends daycare or not. You could extend this by doing a chi-square analysis of the proportion of children who watch TV more than, say, 5 h a day on average segmented by whether or not they attend child daycare. Or you
254
8 Bayesian Survey Analysis: Introduction
could simply use Shallow Data Analysis and calculate sample means as estimates of the population means and display the sample means in bar charts, one bar for each level of the Care variable. These approaches are acceptable to most survey analysts, statisticians, econometricians, and psychometricians. They are based, however, on a critical assumption: the parameters, say β0 and β1 in the regression model or the population mean for the Shallow Data Analysis, are fixed and non-random. You do not know them. The purpose of estimation procedures is to determine their values from sample data. This is a frequency-based approach to statistics because the underlying probability theory is frequency-based founded on an experimental perspective in which an experiment is performed a large number of times, each time the desired event is counted. This frequency count is compared to the possible total count to get a relative frequency. This is the probability of the event. See Haigh (2012) and also Pinker (2021) for a nontechnical discussion. Even though the parameters are assumed fixed in the population, interval statements are made regarding their likely range of values. These intervals are confidence intervals. Unfortunately, they are difficult for a layperson (and sometimes statisticians) to comprehend and consequently result in much misunderstanding and misinterpretation. A particular interval is calculated based on the estimated parameter and its standard error. For the population mean, the interval is Y¯ ± √ Zα/2 × σ/ n where Y¯ is the sample mean, σ is the population standard deviation (assumed known here, but you can substitute the sample standard deviation), and √ n is the sample size. The term σ/ n is the standard error. The Zα/2 is the value from a standard normal distribution, and the confidence level is 1 − α with, usually, α = 0.05.1 This is one interval based on one sample drawn from a distribution. See any basic statistics textbook, such as Weiss (2005, Chapter 8), for this. This interval either does or does not cover or contain the true population mean. If you repeat this sampling a large number of times and each time calculate this interval, then, if α = 0.05, 95% of them will cover or contain the true mean. So, you can say you’re 95% confident your one interval calculated using this formula covers or contains the true mean. I illustrate this concept in Fig. 8.1. There are actually two frequency-based approaches to probabilities. One is the Classical (or Objective) approach, and the other is the Experimental approach. The former is typically taught in introductory statistics courses. This involves counting the number of times an event of interest can occur and then comparing this count to the count of all possible occurrences of all events of the same ilk. If E, called the event space, is the set of all possible outcomes of an event of interest and S, called the sample space, is the set of all possible events of the same ilk, then the probability of a member of E occurring is P r(E) =
1 Normality
n(E) n(S)
does not have to be assumed. This is just convenient for this example.
(8.1)
8.1 Frequentist vs Bayesian Statistical Approaches
255
Fig. 8.1 This illustrates the classical confidence interval concept. There were 100 samples drawn from a normal distribution which mean 100 and standard deviation 15. Each sample had size n = 25. Notice that 6 intervals do not cover or contain the true mean of 100. If, say, 1000 samples are drawn, then 5% of the intervals would cover or contain the true population mean. Any one interval either does or does not
where n(·) is a counting function. As an example usually used in a Stat 101 course, consider tossing a fair coin twice. The event of interest is getting two heads (i.e., H H ) on the two tosses, so E = {H H }. It’s easy to see that n(E) = 1. The number of all possible outcomes is the set S = {H H, H T , T H, T T }. It’s easy to see that n(S) = 4. Therefore, P r(H H ) = 1/4. A simple enumeration of the event space and sample space suffices. The counting function can be specified more formally by reasoning about the tosses. You’re tossing the fair coin twice so you have two slots to fill: the first is filled by the outcome of the first toss and the second by the second toss. For the event of two heads, there is only one way to place a head in the first and one way to place a head in the second. The total number of ways to fill both slots is the product of the count in the slots, or 1 × 1 = 1. So, n(E) = 1. Similarly, the sample space, covering all possibilities, has the first slot filled in two ways (i.e., either with an H or a T ) and the same for the second slot. The total number of ways to fill both slots is the product of the count in the slots, or 2 × 2 = 4. So, n(S) = 4. Consider a deck of playing cards as another example.2 There are 52 cards in 13 suites of 4 cards each. The card game Blackjack (or simply “21”) requires a total of 21 points by adding the points on the cards in order to win. The face cards (Jack, Queen, King, and Joker or wild card) count as 10 points each, and the Ace counts as 1, but when matched with a face card, it counts as 11 points. What is the probability of winning on a single hand of two cards? The event is getting 21. Since there are two cards, there are two slots to fill. Without loss of generality, let the Ace fill the first slot. There are four ways to do this. The next slot can be filled
2 This
is based on Haigh (2012, p. 3), although he doesn’t explain how he got his numbers.
256
8 Bayesian Survey Analysis: Introduction
with any of 16 cards (i.e., 4 Jacks + 4 Queens + 4 Kings + 4 Jokers). Therefore, n(E) = 4 × 16 = 64. The size of the sample space is more complicated. You need the number of combinations of 52 cards, using only 2 at a time, to fill two slots. This is given by the combinatorial counting function for the number of combinations of n things taken x at a time: n n! = x x! × (n − x)!
(8.2)
where n! = n × (n − 1) × (n − 2) × . . . 1 is the n-factorial. For our problem with n = 52 cards and x = 2 slots, 52 2 = 1326 combinations of two cards. So, the size of the sample space is n(S) = 1326. The probability of getting Blackjack on a single play is 64/1326 = 0.048, not very high. These two examples of Classical probabilities are at the foundation of classical statistics as taught in a Stat 101 course. See, for example, the textbook by Weiss (2005) for a similar treatment. These texts note that a condition for this approach is that the possibilities for the event of interest are all equally likely. So, the chance of getting a head or a tail on a toss of a fair coin is the same. The chance of drawing an Ace is the same. And so on for any example in a textbook treatment of Classical probabilities. There are situations, however, where it appears that the chances are not all equal so that there is bias. For example, if you tossed a coin a very large number of times in an experiment following the same experimental protocol (i.e., the same conditions for tossing the coin), you might, and in fact will, find runs of heads and runs of tails. This might make you suspicious of the probability concept, not to overlook a biased coin. This leads to another approach to probability theory, the Frequentist approach, which says that after a large number of trials in an experiment, the relative proportions of the count, or frequency, of the event to the number of trials will settle down, or converge, to the true probability, the one you would get using the methods I described above. For a coin toss experiment, the proportion of heads will converge to 0.50 if the coin is fair. I show this in Fig. 8.2. This notion of convergence is very important. The Classical methods are analytical; the Frequentist is experimental. See Haigh (2012) for a few examples of the Experimental approach. Let me consider the Classical approach again, since this can be used for another classic textbook example. Consider a contingency table comparable to the ones I used in previous chapters. Let this one be a 2 × 2 table developed from survey results for a political opinion study. The objective is to determine the likelihood of someone voting or not voting in the next presidential election. Suppose n = 600 people are surveyed. A table might look like the one in Table 8.1. The sample space is n = 600 survey respondents. The political party affiliation is a Surround Question, and the voting intention is a Core Question. The probability of someone selected at random from the population and saying they plan to vote in
8.1 Frequentist vs Bayesian Statistical Approaches
257
Fig. 8.2 This is the classic coin toss experiment. Notice how the calculated probability converges to 0.50 Table 8.1 This is a contingency table of a sample of n = 600 indicating their self-reported plan to vote in the next presidential election
Democrat Republican Total
Vote 301 107 408
Won’t vote 49 143 192
Total 350 250 600
the next election is P r(V ote) = 408/600 = 0.680.3 Now suppose you want to know the probability of a Democrat voting in the next election. The probability is P r(V ote | Democrat) =
n(Democrat ∩ V ote) n(Democrat)
301 350 = 0.860 =
(8.3) (8.4) (8.5)
where the symbol “∩” represents the logical and (i.e., the intersection of being a Democrat and planning to vote in the next election). The vertical bar on the lefthand side of (8.3) indicates that the probability is conditioned on the row of the table for Democrats, so the probability is a conditional probability: it only holds for
3 This
is actually an estimate of the probability.
258
8 Bayesian Survey Analysis: Introduction
those survey respondents who are (or conditioned on being) Democrats. Notice that you can rewrite (8.3) as P r(V ote | Democrat) = = = =
n(Democrat ∩ V ote) n(Democrat) n(Democrat∩V ote)/n n(Democrat)/n
P r(Democrat ∩ V ote) P r(Democrat) 301/600 350/600
= 0.860.
(8.6) (8.7) (8.8) (8.9) (8.10)
In symbols, this is written as P r(A | B) =
P r(A ∩ B) P r(B)
(8.11)
for two events A and B where I assume that P r(B) > 0. This is an important, fundamental probability result. Pinker (2021, pp. 132–141) gives some interesting examples of conditional probabilities and common errors in their use. In both the Classical and Frequentist approaches to probabilities, there is a basic assumption that there is a parameter, the probability itself, for the likely outcome of a single trial. For the coin toss, this probability is p = 1/2 and, as a parameter, it’s fixed. For a card draw, a single card has the fixed probability p = 1/52 of being drawn. For the coin toss, the probability does not change each time the coin is tossed; the tosses are independent of each other. For the cards, the probability of a draw may remain fixed or change depending on whether or not the drawn card is put back in the deck. If it’s put back, the p remains unchanged.4 Otherwise, the probability changes because the sample space changes from 52 to 51 (for a one-card draw). The parameter changes. Nonetheless, you know the parameter. For more complicated problems, there are more fixed parameters. In many real-life problems, you don’t know the parameters, but you can estimate them using data from a sample. This is where surveys become very important: they are sources of data for estimating population parameters. A third approach to probabilities, equally valid and acceptable as the Classical and Frequentist approaches, assumes that you still don’t know the parameters but goes further and adds that these parameters are themselves random variables. Their values are drawn from probability distributions. These draws reflect uncertainty due to the lack of, or incompleteness of, or suspicions about the quality and nature of the information (i.e., data) you have for your analysis.5 This incorporation
4 I’m
assuming, of course, that the deck is thoroughly reshuffled. is a distinction between data and information. See Paczkowski (2022) for a discussion.
5 There
8.2 Digression on Bayes’ Rule
259
of information into the estimation of the parameters is a Bayesian approach to analysis. The Classical and Frequentist approaches do not assume anything about the information used regarding the parameters; the Bayesian approach does. The Bayesian approach is based on Bayes’ Rule which I review in the next section.
8.2 Digression on Bayes’ Rule Bayes’ Rule, named after Thomas Bayes, an English Presbyterian minister who was an amateur statistician and mathematician, is widely used in empirical research because it allows you to incorporate new information in your assessment of the probability of an event.6 Exactly how you form probabilities of events, and, in fact, what are probabilities, is contentious. The British philosopher, Bertrand Russell, stated in a 1929 lecture that:7 Probability is the most important concept in modern science, especially as nobody has the slightest notion what it means.
Fundamentally, you can interpret probabilities as the proportion of times (i.e., frequencies) an event will occur as I did above. How these proportions are formed, what information you use to form them, is the issue. This topic, a very interesting and almost philosophical one, is beyond the scope of this book on survey data analysis. See Eagle (2021) for the philosophical discussion of probabilities, chance, and randomness. Nonetheless, it is worth noting that a major perspective on probabilities, the Bayesian perspective, is that you form probabilities based on the information you have available and that you revise them as you gain more information. This use of information makes the probabilities conditional statements: probabilities are conditioned on the information you have. For more discussion on probabilities, see Haigh (2012), Hajek (2019), Scheaffer (1990), Andel (2001), Feller (1950), and Feller (1971). The latter two books are classics and challenging. For a readable account and how they relate to everyday decisions, see Pinker (2021).
8.2.1 Bayes’ Rule Derivation You can derive Bayes’ Rule using the basic conditional probability statement I showed in (8.11): P r(A | B) =
6 See
P r(A ∩ B) . P r(B)
(8.12)
the Wikipedia article on Thomas Bayes at https://en.wikipedia.org/wiki/Thomas_Bayes, last accessed December 27, 2021. 7 Cited by Hajek (2019).
260
8 Bayesian Survey Analysis: Introduction
You could also write this as P r(B | A) =
P r(A ∩ B) . P r(A)
(8.13)
Using simple algebra, you can solve (8.13) for P r(A ∩ B) and substitute this into (8.12) to get P r(A | B) =
P r(A) × P r(B | A) . P r(B)
(8.14)
This is Bayes’ Rule. The left-hand side of (8.14) is called the posterior probability, or simply the posterior. It’s what you get after you do the right-hand-side calculations. Better yet, it’s the probability of an event based on the most recent information you have regarding that event and its circumstances. What’s on the right-hand side? Consider the numerator first. This is composed of two terms. The first, P r(A), is the probability of event A occurring without any reference to, or reliance on, event B. It’s the probability you would assert without any knowledge of B. This is called the prior probability, or simply the prior, because it’s stated prior to you knowing B. This concept of a prior has become commonly used. See Pinker (2021) for a readable description of the prior. The second term, P r(B | A), is the conditional probability of B given that you know A. It’s actually not interpreted as a probability, but as the likelihood of B occurring knowing that A occurred. So, it tells you how likely you are to see B when A has occurred or is True. As noted by Pinker (2021, p. 152), the word “likelihood” is not used as a synonym for the word “probability.” Finally, the denominator, P r(B), is the marginal probability of B occurring determined over all possible occurrences of A, which means regardless if A occurs or not. Pinker (2021, p. 152) refers to this as the “commonness” of the event. For a discrete case, this marginal can be written as P r(B) = P r(B ∩ A) + P r(B ∩ ¬A)
(8.15)
where ¬A means “not A.” For the voting example, A = V ote and ¬A = W on t V ote. This marginal probability expression can be further analyzed as P r(B) = P r(B ∩ A) + P r(B ∩ ¬A) P r(B|A)×P r(A)
(8.16)
P r(B|¬A)×P r(¬A)
based on the definition of conditional probabilities. It should be clear that the marginal probability, the “commonness measure,” is the sum of “prior × likelihood” terms over all values of A. In the continuous case, the summation is replaced by an integration over all possible values. In either situation, it’s this marginal probability that made Bayesian analysis challenging, to say the least.
8.2 Digression on Bayes’ Rule
261
With more terms, this marginal becomes more complex, which is especially the case involving multiple integration. In fact, the problem could become analytically intractable very quickly.
8.2.2 Bayes’ Rule Reexpressions You can reexpress (8.14) as P r(Event | I nf ormation) =
P r(Event) × P r(I nf ormation | Event) . (8.17) P r(I nf ormation)
I use this formulation because I want to emphasize the use of information. Information could be in the form of hard evidence, e.g., data, or news from some news source, preferably a credible one. Unfortunately, not all sources are credible or true or correct. See Santos-d’Amorim and Miranda (2021) for discussions about disinformation, misinformation, and malinformation. Referring to (8.17), the probability of seeing an event (e.g., an increase in sales) given the information you have (e.g., the price elasticity of demand, the customers’ preferences, market competition) is the probability of the sales increase weighted by the likelihood you will see the information about the market behavior given that sales did increase and adjusted by the information regardless of whether or not sales increased. The higher the prior probability, the higher the posterior probability; the higher the likelihood, the higher the posterior; and the more common the event, the lower the posterior. See Pinker (2021) for a readable discussion of these relationships and how they can be used to explain or account for how you formulate probabilities, i.e., posterior probabilities, of things you see every day. Equation (8.17) can be written several ways depending on your problem and the continuity of your data, but the essence of (8.17) is the same. For example, (8.17) refers to an “event” which implies a discrete occurrence of something, such as a draw from a deck of cards. If the “event” you’re interested in is a parameter of a distribution, such as the mean of a normal distribution, then the parameter is substituted for the “event” in (8.17). For example, if the parameter is θ , such as the population mean, then the probability of seeing θ given data from, say, a survey, is P r(θ | Data) =
P r(θ ) × P r(Data | θ ) . P r(Data)
(8.18)
The “event” could also be a vector of parameters. If the data are continuous rather than discrete, then (8.17) is simply rewritten with density functions, but, again, the form is the same. The fundamental Bayes’ Rule equation is the one I show in (8.17).
262
8 Bayesian Survey Analysis: Introduction
8.2.3 The Prior Distribution The prior probability in (8.17) is important. It reflects what you know before you have any information, any insight into the event happening. Unfortunately, people often forget the prior or are just unaware of it and its importance. Pinker (2021, Chapter 5) provides many real-life examples in which people overlook the prior. There are two types of priors: informative and uninformative. An informative prior reflects the amount of information you have to enable you to state the precision of the prior. You can view this distribution as having a small variance, although what is “small” is not clear. Another way to view an informative prior is that it is “not dominated by the likelihood” and so it “has an impact on the posterior.” See SAS (2018, p. 133). An informative prior could be based on previous studies, outside data, expert opinion, or experience. A prior distribution with a large variance is imprecise8 and is, therefore, an uninformative prior. This prior is flat so it has “minimal impact on the posterior.” See SAS (2018, p. 132). The variance does not necessarily define the degree of uninformativeness. The uniform distribution is sometimes used as a prior distribution, but this distribution says that no value of the random variable is more likely than any other. No additional insight, no additional information, is embodied in the prior so it is uninformative. I display the two possible priors in Fig. 8.3. Some data analysts hold the position that to claim a prior is uninformative is a misnomer since any “prior distribution contains some specification that is akin to
Fig. 8.3 This illustrates two possible priors, one informative (the light-colored distribution) and the other uninformative (the dark-colored distribution). Notice the difference in the variance, or scale, of both distributions
8 If
σ is the standard deviation, then the precision is τ = 1/σ 2 .
8.2 Digression on Bayes’ Rule
263
Table 8.2 This is a partial list of prior distributions available in pyMC3. See the pyMC3 documentation for a more extensive list Distribution Normal Half-normal Uniform Binomial Beta
Application Continuous random variable Continuous random variable, but restricted to positive values Continuous random variable Discrete random variable Continuous random variables, but useful for binomial problems
some amount of information.”9 I agree with this. The issue is how much information does the prior reflect. As noted by Gelman (2002), the only real issues are the information you use to specify the prior and how it affects the posterior. See, in addition to Gelman (2002), Gelman (2006) for discussions. I list some common priors in Table 8.2. This list, of course, is much large than what I show, but these will suffice for here. The beta distribution is a very flexible form you might come across. See the Appendix to this chapter for a brief summary of the beta distribution.
8.2.4 The Likelihood Function The likelihood function shows the likelihood of the information occurring assuming that the parameters are true or correct: that you know them. In computational work, it’s usually written as a function of the data assuming a specification of the parameters.
8.2.5 The Marginal Probability Function The marginal term, P r(B), is often viewed as a normalizing factor to ensure that the posterior is actually a probability, that is, it’s a value in the interval [0, 1]. It can be omitted so that you can write P r(A | B) ∝ P r(A) × P r(B | A).
(8.19)
The only distributions that matter, that have to be specified, are the prior and the likelihood.
9 See https://stats.stackexchange.com/questions/20520/what-is-an-uninformative-prior-can-weever-have-one-with-truly-no-information?noredirect=1&lq=1. Last accessed January 4, 2022.
264
8 Bayesian Survey Analysis: Introduction
8.2.6 The Posterior Distribution When you weight the prior by the likelihood of information, you get a revised prior probability, the posterior, for event A: you’ve learned something and you can now use that knowledge to reformulate or update your probability of seeing the event. Before you conduct a survey of consumer preferences for a new product, you can “guess” the probability that it will sell (your prior). Guessing may be the best you can do if the product is new-to-the-world. If it’s a new version of an existing product, either yours or a competitor’s, then you can form your prior using that information. Once you have information about market preferences from a consumer survey, you can reformulate that probability (the posterior) based on the information you now have. Of course, once you gain even more insight, perhaps from a second survey or (credible) trade-press articles about market conditions, you could revise your posterior probability, yet again. Basically, the first posterior becomes your new prior which is revised to be the second, newer posterior probability. This revising can continue as you gain newer information, as you learn more and more. So, Bayes’ Rule reflects a learning process.
8.2.7 Hyperparameters of the Distributions Specifying the prior distribution for key parameters is a very important first step in a Bayesian analysis. The prior, as a probability distribution, has its own parameters which differ from and are separate from the model’s parameters. The prior’s parameters are hyperparameters. As an example, for the children’s TV viewing regression model, Yi = β0 + β1 × Carei + i , if i ∼ N (0, σ 2 ), then Yi ∼ N (β0 + β1 × Carei , σ 2 ). You can specify a prior for β0 and β1 . For example, β0 ∼ N (0, 102 ) and β1 ∼ N (0, 102 ). The means and variances of these priors are the hyperparameters. What is the basis for each of them? There is no guidance for this except, perhaps, trial and error as well as intuition. For example, the means of zero suggest they would be consistent with the Null Hypothesis for each parameter. Typically, the Null Hypothesis states that the feature (Care in this example) has no effect on the target. So, a zero mean for the prior might be reasonable. Regarding the variances, the standard deviation of 10 suggests an uninformative prior: there is a large spread, one larger, at least, than a prior with a standard deviation of, say, 1 or 2. The less you feel you know for the prior, then the more you should err on the side of an uninformative one. What about a prior for σ 2 ? This is also a parameter that must be estimated. You could assume a normal distribution for this parameter, but then you run into a problem. Any value on the real line can be drawn from a normal distribution, including negative values. The variance, however, is strictly positive: σ 2 > 0. A commonly used prior is the half-normal which I discuss below.
8.3 Computational Method: MCMC
265
It’s important to notice how a prior is specified. A prior is a probability distribution, and as such, it’s specified using distribution notation and concepts. There’s the distribution itself, such as normal, the mean of the distribution, and the variance. In general terms, a prior for the parameter θ could be specified as θ ∼ N (0, sigma 2 ). Where is the information, the knowledge, the experience I referred to above? It’s in the distributional form, the mean, and the variance. You will see examples of this below.
8.3 Computational Method: MCMC Until recently, the application of Bayes’ Rule was very limited because the mathematics to implement it was challenging beyond simple problems. In fact, it was intractable as I noted above. Just skimming through any earlier books on Bayesian analysis, such as Christensen et al. (2011) and Gill (2008), will show why. The advent and advancement of computational methods, however, have made the use of Bayes’ Rule more practical. This is true of standard statistical calculations and, more importantly, for the estimation of complex statistical models such as OLS and logistic regression models. This also holds for the models I will consider in the next chapter: multilevel regression models. The computational advance is in Monte Carlo simulations coupled with Markov Chains: collectively referred to as Markov Chain Monte Carlo (MCMC). Markov Chains are named after the Russian mathematician, Andrey Markov.10 Monte Carlo is a simulation technique for randomly drawing independent samples from a distribution to approximate that distribution. It was developed in the late 1940s by Stanislaw Ulam at the Los Alamos National Laboratory and later expanded by John von Neumann for nuclear weapons and power research.11
8.3.1 Digression on Markov Chain Monte Carlo Simulation The MCMC method is composed of two parts: a Markov Chain and a Monte Carlo simulation. A Markov Chain is a random movement from one state to another. For example, consider a two-state employment situation: employed and unemployed. Someone moves from being employed one moment, to resigning the next and becoming unemployed (even if temporarily), to being hired elsewhere the next moment. Of course, there is also the movement from being employed one moment
10 See 11 See
2022.
https://en.wikipedia.org/wiki/Markov_chain. Last accessed January 3, 2022. https://en.wikipedia.org/wiki/Monte_Carlo_method#History. Last accessed January 4,
266
8 Bayesian Survey Analysis: Introduction
Fig. 8.4 This illustrates a simple Markov Chain with two states: employed and unemployed. There are four possible movements or transitions as I indicate in the figure. The transitions have associated probabilities that can be arranged in a transition matrix such as the one in (8.20)
to still being employed the next moment (i.e., not resigning) or being unemployed one moment and still unemployed the next. There is a time component to a Markov Chain. There is a sequence of random variables where the next in the sequence only depends on the previous one as the origin of a move. The movement from one state to the next is linked to another, later movement to a new state, hence the notion of a chain. The movement from one state to the next is one link in a chain; the movement from that state to yet another state is another link in the chain. The length of the chain, the number of links, is however long movements occur. I illustrate one possible chain with two states in Fig. 8.4. A probability is attached to each movement or transition: a probability of resigning, a probability of not resigning, a probability of being hired elsewhere, and a probability of not being hired elsewhere (the employee continues to be unemployed). These movements are called stochastic movements because of these probabilities. The probabilities associated with the transitions are represented in a matrix called the probability transition matrix. I illustrate one possibility for Fig. 8.4 in (8.20). The columns represent Employed and Unemployed, in that order; the rows are similarly defined. The columns are the current state and the rows the previous state. So, the probability of being employed one period and employed the next is 0.75; the probability of being unemployed one period and employed the next is 0.45. Notice that each row sums to 1.0 because the rows represent a starting position for a transition. 0.75 0.25 T = (8.20) 0.45 0.55 The distinguishing characteristic of a chain is that the movement from one link or state in the chain to the next is independent of past movements except the prior state. The chain is said to be memoryless. A chain with this memoryless property is a Markov process.
8.3 Computational Method: MCMC
267
For the chain I show in Fig. 8.4 and its transition matrix in (8.20), you can imagine, say, being employed which puts you in the first row of the transition matrix. Then you randomly draw from the conditional distribution of that row. Suppose the draw says you are employed again. There is a 75% chance this will happen. Then you randomly select again from the first row. Now suppose the draw says you’re unemployed. There is a 25% chance this will happen. You move to the second row of the transition matrix (8.20) and randomly draw from that row’s conditional distribution. The fact you were employed two periods ago is irrelevant; just the last state of being employed matters because that determines which row’s distribution you select from. Now you are unemployed so you select from the second row. You continue to make draws in this fashion. The sequence of states you move through as you randomly draw from a probability distribution defines a Markov Chain. The length of the chain for as far as I described the draws is 3 and the chain consists of the sequence {E, E, U }. A good example of a Markov Chain is a random walk in which the movement from one state to another depends only on the previous state and no others.12 One way to write a random walk equation to illustrate this memoryless property is Yt = Yt−1 + at
(8.21)
where at is a random draw from a probability distribution. For example, you could draw a random integer from 1 to 4 from a uniform distribution. The transition matrix is clearly ⎡
0.25 ⎢0.25 T =⎢ ⎣0.25 0.25
0.25 0.25 0.25 0.25
0.25 0.25 0.25 0.25
⎤ 0.25 0.25⎥ ⎥ 0.25⎦
(8.22)
0.25
There are four columns for the four possible moves and similarly four rows. Each movement has a probability of 0.25, and each row’s probabilities sum to 1.0. If you randomly draw a 1, you move one step to the right; if 2, you move one step to the left; if 3, you move one step up; if 4, you move one step down. I show the code to implement this in Fig. 8.5 and a graph of the result in Fig. 8.6. Notice in the code and the graph that the movement from one state to another depends only on the prior state and the random integer drawn. This is the memoryless Markov process. See Malkiel (1999) for a classic application of the random walk process for Wall Street. This is definitely worth reading. Also see Mlodinow (2008). The Monte Carlo part of the simulation name reflects your random draws from a probability distribution. Monte Carlo methods are widely used in computational
12 See
https://en.wikipedia.org/wiki/Random_walk for a good discussion of random walks as a Markov Chain. Also see https://en.wikipedia.org/wiki/Markov_chain. Both articles last accessed January 3, 2022.
268
8 Bayesian Survey Analysis: Introduction
Fig. 8.5 This is the code for generating the random walk I described in the text. Notice that the movements are determined only by the previous state and the random draw. This is a Markov process. See the graph of this random walk in Fig. 8.6
methods to solve problems that would be difficult, if not impossible, to analytically solve. There is an extensive literature on Monte Carlo methods. See Shonkwiler and Mendivil (2009) for a good, readable introduction.
8.4 Python Package pyMC3: Overview
269
Fig. 8.6 This is a graph of the random walk generated by the code in Fig. 8.5
8.3.2 Sampling from a Markov Chain Monte Carlo Simulation The process of sampling from a posterior distribution is not very complicated, although, of course, it can become complicated depending on the problem. The sampling process, fundamentally, involves randomly drawing from a posterior distribution following a Markov Chain. A single draw only depends on where in the distribution the sampler is currently located, not where it had been previous to the current draw. This is the “Markov Chain” part of the procedure’s name. The next move from that current position is randomly drawn. This is the “Monte Carlo” part. A good description of the process is available in Martin et al. (2018, Chapter 8).
8.4 Python Package pyMC3: Overview The Python package, pyMC3, is a computational tool for estimating statistical models using Bayes’ Rule. It uses an intuitive syntax with advanced methods to implement MCMC. Martin et al. (2022) provide excellent examples of its use for a wide variety of problems. pyMC3 must be installed on your computer along with a graphics package, ArviZ. The latter package allows you to graph all the essential output from pyMC3. You install pyMC3 using either pip or conda. For pip, use pip install pymc3, and for conda use conda install -c conda-forge pymc3.
270
8 Bayesian Survey Analysis: Introduction
pyMC3 works best with ArviZ if you use version 3.8.13 You install ArviZ using either pip or conda. For pip, use pip install arviz, and for conda use conda install -c conda-forge arviz. Once these packages are installed and imported into your Python code, you can check the version numbers for both using print( f’Running pyMC3 v{pm.__version__}’ ) print( f’Running az v{az.__version__}’ ) where “pm” and “az” are the aliases for pyMC3 and ArviZ, respectively.
8.5 Case Study The producer of a boutique beverage product sold through a grocery store chain located in New England wants to know and understand the price elasticity for its product. It’s contemplating a price move to thwart similar boutique producers as well as several major beverage producers from taking market share. It wants to gain a competitive advantage and decides that analytics is the best way to do this. The grocery stores are located in suburban and urban areas. Stores come in different sizes (i.e., selling surfaces) ranging from small storefronts (e.g., Mom & Pops) to “Big Box” stores (e.g., warehouse clubs). Even within one chain, store sizes vary, perhaps due to geography, and this has implications for marketing mix strategies. Retailers, for example, are moving to “customized pricing practices. . . in which pricing depends on store size and clientele” Haans and Gijsbrechts (2011). Evidence shows that they “price promote more intensely in their large stores” but the evidence is shaky. Large stores tend to be in suburban areas and small ones in urban areas. The available real estate for either location is the issue. An implication of store size is that beverage products are more elastic in large stores, but this is not always clear. Also, they are more elastic in urban areas because of intense competition and more inelastic in suburban areas because of the value of time for shopping. Consumers would have to drive to the next shopping mall which is timeconsuming. If the value of time outweighs a price saving, then the product would be more inelastic. I summarize some factors that might impact price elasticities in Table 8.3. These are mostly convenience factors. Fictional data on a retail chain in New England states were generated. There is one consumer product and six stores: three suburban (large) stores and three urban (small) stores. Large and small are defined by store square footage. The sample design is a stratified random sampling design from the two areas: suburban and urban. There are more than three stores in each area, but sampling costs
13 As
of January 17, 2022.
8.5 Case Study
271
Table 8.3 This table provides some characteristics of large and small grocery stores Large stores Benefits • Increased parking • Additional services • Wider variety • One-stop shopping Costs • Longer distance to travel • More/longer aisles • Longer checkout time • Higher in-store search
Small stores • • • •
Personal treatment More competition Neighborhood focus Wider variety of stores
• • • •
Smaller variety Fewer products Frequent store entry/exit Higher store search
prohibit sampling from all stores—hence, the random stratified sample design. The researchers randomly sampled customers from the three stores in each stratum. The grocery chain’s management refused to divulge any shopping habits (e.g., weekly amounts purchased) from their internal database to protect privacy as well as to avoid being accused of providing them with a competitive advantage. However, they agreed to allow the consultants to approach shoppers in the aisle where the product is on display and ask them questions after grocery store chain-approved identification was provided. This method of shopping research differs from traditional market research in that consumers are interviewed at the moment of their buying decision. The method is called an ethnographic study and the measurements as “point-ofexperience” key performance measures (KPI).14 This approach allows researchers to gain better insight into shoppers’ buying decisions about the products they compared (the attributes) and why they chose the one they eventually bought. But it also allows them to recruit shoppers for an in-home diary-keeping phase of the study. A diary, much like for Nielsen ratings,15 allows researchers to track behavior over an extended period of time. Diaries are sometimes used with in-home use tests (referred to as IHUTs (in-home user tests) and HUTs (home user tests)). Consumers are given products to use at home, and they record their use patterns and experience in a diary.16 For this study, the researchers wanted to understand the weekly shopping behavior, primarily the quantity purchased and price paid. Six hundred consumers were recruited from the ethnographic phase: 100 from each 14 See,
for example, the description of SmartRevenue, Inc. at [https://www.linkedin.com/ company/smartrevenue/about/](https://www.linkedin.com/company/smartrevenue/about/), last accessed December 7, 2021. SmartRevenue is now defunct. 15 See https://global.nielsen.com/global/en/. Last accessed December 7, 2021. 16 See https://www.sisinternational.com/ as an example market research company using this method. Last accessed December 7, 2021.
272
8 Bayesian Survey Analysis: Introduction
store. Those who agreed to participate kept a diary of their purchases. The Surround Questions were asked when they were recruited at the store location. The diary was maintained for 1 month to avoid fatigue which could cause dropouts. It was assumed that the average consumer would make at least four shopping trips in the month, which the diaries showed was generally correct. Prices varied weekly due to sales/promotions which may impact the amount purchased. The grocery chain heavily promotes beverages in general offering weekly advertised promotions and unannounced in-store specials. The consumers also kept a record of approximately how long they had to wait in line (i.e., a queue) before being checked out (i.e., served). The waiting time in the queue obviously depends on the time-of-day and day-of-week they shopped. These were also recorded. The waiting time is a feature of the store since it characterizes the number of checkout stations available and staffed by checkout personnel. In addition, the size of the store in square footage, which is part of the store data, determines the number of checkout stations and thus the waiting time. These are features of the stores and have nothing to do with the consumers per se. I will return to the waiting time in Chap. 9. See DeVany (1976) for a theoretical economic discussion of queueing theory and price effects. Each consumer’s purchases, prices paid, and waiting time were averaged to a monthly number. The final data set contained the average price paid, the household income, the average purchase size (i.e., the number of bottles purchased), and the average waiting time. In addition to the diary data, the researchers knew the store location based on obvious observation. It’s hypothesized that the location would influence consumers’ shopping behavior. This provides the context for that shopping. I will use this data in Chap. 9.
8.5.1 Basic Data Analysis The first step in analyzing the data is to plot data distributions. I show a histogram of the quantity purchased by the 600 consumers in Fig. 8.7. This suggests a slight skewness, so I ran a skewness test. I show the result in Fig. 8.8. The Null Hypothesis is that there is no skewness, and the Alternative Hypothesis is that there is skewness. The p-value was 0.0031, which is less than 0.05, so the Null Hypothesis is rejected. I then took the natural log of the quantity to eliminate the skewness. I show the histogram for the natural log of quantity in Fig. 8.9. Notice that there is still skewness so that in this case the problem was not corrected. The same analysis could be repeated for the price, income, and waiting time. In each instance, the natural log transformation could be used. I do not show them here, but I did take the natural log of each of these features. The reason is that I can easily obtain an elasticity when natural logs are used for the dependent and independent variables in a regression model. Since an objective is to estimate the price elasticity, using this log transformation has an advantage. See Paczkowski (2018) for an extensive discussion of using the natural log transformation to estimate an elasticity.
8.6 Benchmark OLS Regression Estimation
273
Fig. 8.7 This is the histogram of the average quantity purchased by each consumer
8.6 Benchmark OLS Regression Estimation I show a regression model setup in Fig. 8.10 and the regression results in Fig. 8.11. The model is ln (Qi ) = β0 + β1 × ln (Pi ) + β2 × ln (Ii ) + i
(8.23)
where Qi is the average weekly quantity of the beverage purchased by the ith household, Pi is the average price paid for the beverage, and Ii is the household disposable income. Waiting time is reserved for Chap. 9. The model is a pooled regression model that I will discuss again in Chap. 9. It is referred to as “pooled” because there is no allowance made for the hierarchical structure of the data. Each consumer belongs to just one level (i.e., one store) and is treated equally. So, this is a
274
8 Bayesian Survey Analysis: Introduction
Fig. 8.8 This is the skewness test for the average quantity purchased by each consumer. The Null Hypothesis is clearly rejected
unilevel example. The reason for showing the regression here is to have a benchmark for comparison with the pyMC3 estimation. I will repeat this model in Chap. 9 The analysis of the estimates is the same as the analysis I discussed in Chap. 5.
8.7 Using pyMC3 To use pyMC3, you need to specify a prior distribution for the parameters of a model, the likelihood function, and the data. Of course, you also have to specify the model. These are all specified in paragraphs. The paragraphs are contained in a with clause identified by an alias.
8.7.1 pyMC3 Bayesian Regression Setup The setup for a basic regression model estimation in a Bayesian framework is different than that in a Frequentist setup, such as the one I showed in Chap. 5 and then again in Fig. 8.10 for this chapter’s case study. In the Frequentist setup, the three parameters of (8.23) are assumed to be fixed numeric values in the population. The disturbance term, i , is a random variable so the target variable is also a random variable. If the disturbance term follows the Classical Assumptions, then
8.7 Using pyMC3
275
Fig. 8.9 This is the histogram of the average quantity purchased by each consumer on a natural log scale
E(ln (Qi )) = β0 + β1 × ln (Pi ) + β2 × ln (Ii ) and ln (Qi ) ∼ N (E(ln (Qi )), σ 2 ). What is σ 2 ? It’s the variance for the disturbance term. In general, you have an OLS model in the Frequentist framework: Y ∼ N (μ, σ 2 )
(8.24)
for μ = E(Y ) and σ 2 = V (). In the Bayesian framework, the coefficients are not fixed, but are, instead, random variables. Each one is assumed to be distributed following a prior distribution with hyperparameters. The price slope, for example, could be specified as β1 = N (0, σ12 )
(8.25)
276
8 Bayesian Survey Analysis: Introduction
Fig. 8.10 This is the setup to estimate the OLS model for the beverage purchase survey. This is the same setup that I described in Chap. 5
with two hyperparameters, μ = 0 and σ 2 = σ12 , similar to the intercept and other slope in the model. The normal is a commonly used distribution in Bayesian analysis. What about the disturbance term’s variance? It too has a prior distribution sometimes written as β1 = HN (σ )
(8.26)
where only one hyperparameter, σ > 0, is needed. This distribution is called the half-normal distribution, represented by HN .17 It’s just a normal distribution with mean 0 that is cut in half at the mean with the positive half retained and the negative half discarded. This distribution is sometimes characterized as folding the distribution in half so that only the top half, the half above the zero mean, is used. The reason for using the half-normal is to have positive values for the hyperparameter for the disturbance term’s prior distribution. Otherwise, negative values could result which would be unacceptable. I specified three paragraphs in Fig. 8.12 for this model. I set the priors in Paragraph 1. The term beta0 is the intercept prior; beta is the vector of slope priors, one for the logPrice and the other for logIncome; sigma is the prior for the standard error of the regression (the positive square root of σ 2 ). The priors are specified by using the appropriate distribution with arguments for the name of the parameter, the mean, and the standard deviation (if needed—but not the variance). I specified them as draws from a normal distribution.
17 See
the Wikipedia article “Half-normal distribution” at https://en.wikipedia.org/wiki/Halfnormal_distribution. Last accessed January 9, 2022.
8.7 Using pyMC3
277
Fig. 8.11 These are the estimation results for the OLS model in Fig. 8.10 for the beverage purchase survey
Other distributions are available, but least squares theory shows that they are normally distributed if the disturbance term is normally distributed. The means are set to zero since another value is not available. The standard errors for the betas are made large. The general argument is sigma = X where X is the prior value. You could, instead, specify the precision which is just the inverse of the squared standard deviation. You do this using tau = 1/X2 . You cannot use both sigma and tau; it is one or the other. The values I used could be interpreted as uninformative. You can always
278
8 Bayesian Survey Analysis: Introduction
experiment with other values. The sigma is also a draw from a normal distribution, but this distribution is truncated to a “half” normal (the upper half) since σ cannot be negative. The distribution is a half-normal. The standard error of the regression equals the square root of the mean square error (MSE) in Fig. 8.11. If a vector is required rather than a scalar, each with the same mean and standard deviation, then the length of the vector must also be included. A “shape” argument is used for this. The default is “shape = 1.” I used “shape = 2” in this example for the “beta” prior vector: [β0 , β1 ]. You should notice that the line for each prior specification contains a character string on the right-hand side in the first argument position. For example, the prior for “beta0” is beta0 = pm.Normal( ’beta0’, mu = 0, sigma = 10 ). The string ’beta0’ in the Normal command is just a label that will appear in the output. I recommend that you make this label descriptive. Others recommend that you just repeat the name of the variable on the left-hand side to avoid confusion. I think this invites confusion. You can gain more insight into the Paragraph 1 priors by rewriting them. Let me focus on the prior for β1 for the logP rice variable. The pyMC3 statement beta = pm.Normal( beta , mu = 0, sigma = 10, shape = 2) is interpreted as specifying a prior for β1 and β2 based on the shape = 2 argument. For just β1 , this is a normally distributed random variable with mean 0 and standard deviation of 10, or μ − μ1 σ1 μ−0 ∼φ 10
β1 ∼ φ
(8.27) (8.28)
where φ is the standard normal pdf function, μ1 is the specified mean for β1 , and σ1 is the specified standard deviation. The same holds for β2 . I specified the expected value of the target in Paragraph 2. This is important because, based on least squares theory, the expected value of the target is what you are estimating. If the model is Yi = β0 + β1 × Xi + i with i ∼ N (0, σ 2 ), then E(Y ) = β0 + β1 × Xi . This is the expected value I specify in Paragraph 2. Once the values for the priors are selected from Paragraph 1, then the “mu” value in Paragraph 2 is known. This statement is deterministic. There is no random component. The randomness is only in Paragraph 1. I specified the likelihood in Paragraph 3. This is a function of the observed data, the expected value from Paragraph 2, and the variance from Paragraph 1. The parameter priors are already in the expected value. As a likelihood function, it shows the likelihood of seeing the data (i.e., the information) given the parameters. Comparable to what I wrote in (8.28), the likelihood is observed ∼ φ
Data − μ σ
(8.29)
8.7 Using pyMC3
279
where data is the observed data from the observed = df.logQuantity argument, μ is the mean for the regression from Paragraph 2, and σ is the prior from Paragraph 1. The with pm.Model statement combines (8.28) and (8.29) using Bayes’ Rule: P r(β1 | Data) ∝ P r(β1 ) × P r(Data | β1 ) Data − μ μ − μ1 ×φ . ∝φ σ1 σ
(8.30) (8.31)
This formulation may help you understand the setup in Fig. 8.12 and the three paragraphs. The with pm.Model merely organizes the components of Bayes’ Rule.18 A random seed, set as 42, is used to allow you to reproduce the results. Since
Fig. 8.12 This is the setup and results for a regression in Fig. 8.11 replicated with pyMC3. Notice that the estimated MAP parameters are almost identical. The sigma term is the standard error of the regression and matches the MSE in the ANOVA table for Fig. 8.11 after squaring this standard error
18 See Rob Hicks’ course notes, which are the basis for this discussion, at https://rlhick.people.wm.
edu/stories/bayesian_7.html, last accessed January 16, 2022.
280
8 Bayesian Survey Analysis: Introduction
random draws are part of the MCMC method, you will get different results each time you run the code. The random seed function from the Numpy package allows reproducibility. Any seed could be used. The with statement is named, or given an alias, “reg_model” in this example. Any alias could be used. The results of calculations using the priors and likelihood along with the observed data are stored in this alias. The with clause is a container for the model; it contains the model that is used when sampling is done.
8.7.2 Bayesian Estimation Results There are summary measures and one set of graphs to examine from the Bayesian estimation.
8.7.2.1
The MAP Estimate
One measure is the maximum a posteriori (MAP) estimate of the mode of the posterior distribution. It is a point estimate of a parameter, but unfortunately, it can be, and sometimes is, biased. This happens when the distribution is skewed. The bias follows from the basic statistical relationship among the mean, median, and mode of a distribution and what happens to this relationship due to skewness of the distribution. It is shown (or at least mentioned) in an introductory statistics course that for a normal distribution, which is symmetric about the mean, the relationship is mean = median = mode. When the distribution is skewed, then the three will differ depending on the extent of the skewness. I illustrate a skewed distribution and the three measures in Fig. 8.13. The MAP, as an estimate of the local maximum of the posterior distribution, can be useful for some applications. For example, are estimated regression coefficients close to what you expect? It is important to remember that the MAP estimate is a point estimate of the local maximum of the posterior. As a point estimate, it is limited. The MCMC method traces the whole distribution, so you can get other statistics about the distribution, not just the mode. The MAP is not part of the MCMC sampling method. It is obtained by using the results of the posterior calculations. Using pyMC3, the MAP estimates are obtained using the find_MAP function that takes the model name as an argument. I illustrate this in Fig. 8.14. The find_MAP function actually does a transformation of some of the variables because the variables, as specified in your model, may have restricted bounds which would complicate estimations. The transformation is transparent to you; you just see the results. An example is the standard deviation which is specified to be strictly positive. To enable efficient estimations, the function transforms the bounded variables to be any value on the real line. A log transformation is common. The transformation does nothing to the results. The find_map function returns both forms of the variable,
8.7 Using pyMC3
281
Fig. 8.13 This illustrates a right-skewed distribution. As explained in a basic statistics course, the mean, mode, and median differ due to the skewness. This has implications for the MAP estimate from Bayesian estimation procedure
but typically the untransformed value is what you want; the transformed version is indicated by the naming convention [var name]_[transf ormation]__ (NOTE: this is a double underscore). The values are returned in a dictionary with the parameters as the keys and the estimates as the values in arrays. A dictionary comprehension can be used to delete the transformed version, leaving the original variable only. The dictionary comprehension might be map = {key : item for key, item in map.items() if “__” not in key} where “map” is the name for the find_MAP returned dictionary. The MAP values in Fig. 8.12 are almost identical to the OLS results.
282
8 Bayesian Survey Analysis: Introduction
Fig. 8.14 This is an example of finding the MAP estimates for the parameters of a regression model. The model itself is explained in the text
You could obtain better, more informative results if you do multiple draws, or samples, from the posterior distribution. I display the statistics for n = 500 random draws from the posterior distributions in Fig. 8.15. Since these are random draws, a random seed of 42 (any integer could be used, such as 1234) is set so that the same draws happen each time this code block is run. If the random seed is not set, then your computer’s current clock time is used to determine the random draws, and, since the clock time constantly changes, different results will occur with each run of the code block.
8.7.2.2
The Visualization Output
A minimal output from the sampling is displayed in Fig. 8.15: 1. the estimated mean of the posterior for the parameters, each one over the n = 500 draws; 2. the corresponding standard errors of the means;
8.7 Using pyMC3
283
Fig. 8.15 This is the setup and summary statistics for the pooled regression in Fig. 8.12. Compare these to the results in Fig. 8.11
3. 95% HDI bounds (explained below); and 4. the trace diagrams. The means and standard errors are self-explanatory. The annotation in the shaded area at the bottom of the code cell in Fig. 8.16 notes that four chains were sampled. The number of chains and the sample size determine the total number of draws from the posterior. If there are 4 chains and the sample size is 1000, then there will be 4000 draws: 1000 for each chain. You control the number of chains with a “chain” argument: say, chain = 2 for two chains. The default is four chains. So, draw = 500 produces a sample of 500 draws from the posterior distribution.19 With 4 chains, 2000 draws are taken. The default number of draws or samples is draws = 1000 per chain. You can change the draw argument. The sampling process terminates when the specified number of draws is taken. The ArviZ package creates visual diagnostics that summarize these draws from the posterior distributions. These are the trace diagrams in two columns. The draws are shown in the right-hand panels. A KDE histogram plot of each chain for each variable is shown in the left-hand panels. These histograms are created from the runs in the right-hand panels. You can use a legend = True argument to identify the chains in the histograms. The trace graphs are useful to examine because they reveal the pattern of the sampling and any potential problems. The problems are:
19 Note:
the “draw” keyword is not required because it is the first argument to the function.
284
8 Bayesian Survey Analysis: Introduction
Fig. 8.16 These are the posterior distribution summary charts for the pooled regression model in Fig. 8.12
• burn-in period to discard; • autocorrelation; and • non-convergence. The MCMC sampling procedure should converge to a long-run equilibrium posterior distribution. I mentioned convergence above when I reviewed the classical coin tossing problem for the Frequentist approach to probabilities. I noted that after many tosses (thousands?), the series should converge to a stable number (0.50 for a fair coin). Any difference from this convergence number indicates bias in the coin. Convergence for the MCMC sampling process means that after many samples have been drawn (perhaps thousands), the ultimate equilibrium or stationary distribution should be known. The trace plots tell you how the process evolved to that final distribution. It is possible, given the prior, data, and model complexity, that there is no convergence. It is also possible that the sampling procedure decays toward a long-run stationary distribution, in which case you should know how long it took to reach equilibrium. I show four possible trace plots in Fig. 8.17. In all but the last one, 2000 samples were drawn, and the parameter of interest (θ in this case) was shown for each plot. Basically, the trace plots show the evolutionary path to a long-run equilibrium value for a parameter. For each trace plot, you should look for the convergence to an equilibrium mean and also for a constant variance around that mean. In Fig. 8.17a, you can see that the sampling converged very quickly to settle on θ ≈ 1.5.
8.7 Using pyMC3
285
Fig. 8.17 These are four possible patterns for a trace plot. Only the first one is ideal. Notice the randomness throughout the entire plot. (a) Good trace plot. (b) Trace plot showing burn-in period. (c) Trace plot showing autocorrelation. (d) Trace plot without convergence
The variance appears to be constant. This is the ideal pattern. Compare this to Fig. 8.17b in which there is a large initial decay before convergence is attained around 250 sampling runs. The first 200 samples are referred to as a burn-in period before convergence is reached. The burn-in samples are typically discarded because retaining them would bias calculations of distribution statistics such as the mean and variance. A main characteristic of a Markov Chain is that the successive draws for the chain are independent, that is, there is no autocorrelation. The draw for the next move in the chain depends only on the current location and a random draw from a probability distribution. This is the random walk. Sometimes, however, you will see a trace plot like Fig. 8.17c which indicates autocorrelation. Notice the subtle sine wave pattern. pyMC3 provides methods to check for autocorrelation.20 Finally, notice in Fig. 8.17d that there is divergence, not convergence, although there may be a suggestion of the beginning of convergence when about 500 samples have been drawn. This suggests that more than 500 may be needed and that at least the first 500 have to be discarded; the burn-in is about 500. Nonetheless, this may indicate that the model is not a very good one.
20 See
the pyMC3 and ArviZ documentation.
286
8 Bayesian Survey Analysis: Introduction
Fig. 8.18 These are the posterior plots for the parameters of the regression model. Notice the region of the logIncome density that I highlighted. There appears to be a bimodality in this distribution
It is instructive to examine the posterior distribution for each parameter which you get separately using the command az.plot_posterior( smpl ) where smpl is the name of the sampling output. I display the posteriors in Fig. 8.18. There are four parameters for our problem, so there are four posterior distributions. The sampling method actually produces one for the log of the disturbance term and the disturbance term unlogged. The logged version is redundant, so I deleted it using a dictionary comprehension. Each panel shows the distribution of a parameter, a distribution that you would not have with the Frequentist approach to OLS regression. Each panel is annotated with • • • •
the mean and a vertical reference line at that mean; the 95% HDI indicated by the thick black line at the bottom; the lower and upper HDI bounds; and the range for the density bounding the reference line.
Notice that three of the four distributions are unimodal; the one for logIncome appears to be bimodal. I extracted this one and displayed it separately in Fig. 8.19. There is a definite indication of bimodality. This suggests that there are two groups of beverage consumers based on household income so that this distribution is a mixture of underlying distributions. Exactly where the split between one group and another occurs is not clear without a further examination of the survey data, which is beyond the scope of this example. Nonetheless, the two groups could suggest a marketing strategy based on household income. This is an observation that would not have been possible with the Frequentist approach to OLS because the parameters are assumed to be fixed, so there is only one population parameter for income. You now have the whole distribution for income (as well as each of the other parameters), so you could look for patterns such as this bimodality. A reference line can be placed on the posterior plot anywhere you want. The mean seems the most logical and is the one I drew; the median and mode could also be used. If you look at Fig. 8.19, you will see the line at the mean of the density. The reference line divides the density into two parts: an area to the left of the line and one to the right. The percent of density area to either side is indicated on the
8.7 Using pyMC3
287
Fig. 8.19 This is an enlargement of the posterior plot for the logIncome parameter. The bimodality is more evident
Fig. 8.20 This is an example of a reference line at 0
graph. So, in Fig. 8.19, 50.5% is to the left, which leaves 49.5% for the right. I show an example of the logIncome posterior density curve in Fig. 8.20 with the reference line at 0. At this point, only 1.6% of the density is to the left of the line, so 98.5% (rounding error) is to the right. How are the reference line and percentages used? If the line is drawn at the mean, as in Fig. 8.19, then you can assess the skewness of the posterior distribution by how much of the area is to each side of the reference line. In the case of Fig. 8.19, the left side is slightly larger indicating a slight right skewness. Compare this to Fig. 8.21 with the reference line at the median. The two sides are the same size as they should be: 50% each. A key concept for the MCMC procedure is the credible interval. This is comparable to the classical confidence interval, but it is not a confidence interval.
288
8 Bayesian Survey Analysis: Introduction
Fig. 8.21 This is a second example of the split of the posterior density into the left and right sides based on the reference line, this time with the line at the median. Notice that each side has 50.0% of the area
It is a probability range for the posterior distribution itself unlike a confidence interval which is not for a distribution per se. For a confidence interval, the bounds are based on a fixed parameter, fixed because this is the basic assumption for the Frequentist approach to estimation and a confidence interval is a Frequentist concept. A confidence interval is random (actually, it is a random interval),21 but not the parameter. The credible interval is a Bayesian concept that treats the parameter as a random variable with a probability distribution. The bounds are fixed for that distribution. The advantage of a credible interval is that you can make a statement that a parameter, such as the population mean, lies within the interval with a certain probability. You cannot make this statement with a confidence interval because an interval either does or does not cover or contain the parameter. Consequently, the credible interval is consistent with what people intuitively expect the interval to represent. There are many credible intervals, but the most commonly used is the highest posterior density interval (HDI). This particular interval is constructed to have equal-sized tails of the posterior so that most of the distribution’s density (or mass) lies between the interval bounds. If α = 0.05, the 100(1 − α)% HDI means there is a 95% probability that the true parameter is in that interval.22 The HDI is the thick black line at the bottom of the posterior graph.
21 See
Hogg and Craig (1970, Chapter 6). explanation of the HDI, see https://stats.stackexchange.com/questions/148439/whatis-a-highest-density-region-hdr. Last accessed January 7, 2022. Also see Hyndman (1996).
22 For a good
8.8 Extensions to Other Analyses
289
Fig. 8.22 This chart shows a vertical line at the Null Hypothesis value of 0 for the slope parameter of the logPrice variable (β1 ). You can see that 100% of the density under the posterior distribution curve is to the (far) left of the vertical line indicating that the Null Hypothesis is not credible
It is important to emphasize that HDI values are not confidence interval bounds which represent the bounds for 95% of random intervals. The HDI bounds are for an actual distribution created by the random sampling. Nonetheless, you can compare the bounds in Fig. 8.15 to the bounds in Fig. 8.11. The default for the HDI is 94%, but I specified 95% since this is consistent with confidence limit usage. Any value outside the HDI interval is interpreted as “not credible.” This allows you to make a more understood and defensible statement about an estimate. As an example, a common Null Hypothesis for an OLS slope parameter is that it is zero: H0 : β1 = 0. In Fig. 8.22, I drew a reference line at the Null value. It is clear that the Null is not credible: 100% of the density under the posterior is to the left of the Null value of 0. This is a stronger statement than what could be made with a confidence interval. In this case, you have the entire (posterior) distribution for the parameter, not just a point estimate.
8.8 Extensions to Other Analyses The Bayesian framework I just described can be extended to other forms of analysis. It is not restricted to regressions. An extension just requires the specification of appropriate priors and perhaps an appropriate deterministic function in a paragraph. This paragraph allows you to write a formula for a calculation that is independent of the stochastic part of the MCMC operations. I will illustrate this below.
290
8 Bayesian Survey Analysis: Introduction
Fig. 8.23 This is the setup for testing the mean of a sample
8.8.1 Sample Mean Analysis You can study the sample average and check if a Null Hypothesis is credible. I show the setup, trace diagrams, and posterior distribution for the sample of n = 600 consumers for the beverage case study in Figs. 8.23, 8.24, and 8.25. For this example, the target is the average weekly quantity purchased calculated from the diary data. The Null Hypothesis is H0 : μ0 = 135. The posterior display shows that the Null is credible since it is within the limits of the HDI. Notice that there is no Paragraph 2 per se since a deterministic formula is not needed. I just left a placeholder for a paragraph.
8.8.2 Sample Proportion Analysis You can study the sample proportion and check if a Null Hypothesis is credible. I show the setup, trace diagrams, and posterior distribution for the sample of n = 600 voters in Figs. 8.26, 8.27, 8.28, and 8.29. The target is the proportion of voters who intend to vote in the next election. The DataFrame for this example has a variable named “Vote” that has character string values: “Vote” and “Won’t Vote.” A list comprehension is used to dummy encode these as 1 = “Vote” and 0 otherwise. The Null Hypothesis is H0 : p = 0.50 where p is the population proportion. The posterior display shows that the Null is not credible.
8.8 Extensions to Other Analyses
291
Fig. 8.24 These are the trace diagrams for testing the mean of a sample
Fig. 8.25 This is the posterior distribution for testing the mean of a sample. Notice that the vertical line is at the Null Hypothesis value of 135. The line indicates that the Null value is credible
8.8.3 Contingency Table Analysis I introduced a contingency table in Table 8.1 for a fictitious survey of registered voters. The Surround Question is the political party of the respondents (Democrat or Republican), and the Core Question is their intention to vote in the next presidential election (“Vote” or “Won’t Vote”). The research question is the difference between the Democrats and Republicans in their voting intention. A two-sample Frequentistbased Z-test of the difference in proportions (Democrats–Republicans) is used. The Null Hypothesis is that there is no difference in the population proportions and the
292
8 Bayesian Survey Analysis: Introduction
Fig. 8.26 This is the setup for testing the proportion for a sample. Notice that the binomial distribution is used with the observed value for the likelihood as the sum of the random variable, which is dummy coded
Fig. 8.27 These are the trace diagrams for testing the proportion of a sample
Alternative Hypothesis is that there is a difference. This is a two-tailed test. I show the results in Fig. 8.29. The analysis in Fig. 8.29, which could certainly be expanded and enhanced with graphs and tables, is typical of traditional survey data analysis. The key assumption of this form of analysis, like for the regression case, is that the population proportion parameters are fixed, non-stochastic numerics. The sample proportions are random variables, but not the population counterparts. This is not the best assumption to make if you take a Bayesian approach which allows you to estimate the posterior distribution for the proportions and the difference between them.
8.8 Extensions to Other Analyses
293
Fig. 8.28 This is the posterior distribution for testing the proportion of a sample. Notice that the vertical line is at the Null Hypothesis value of 0.50. The line indicates that the Null value is not credible
Fig. 8.29 This is the Z-test for the voting study data I summarized in Table 8.1. The p-value indicates that the Null Hypothesis should be rejected: there is a statistical difference in the voting intention between members of the two political parties. This is a Frequentist approach
294
8 Bayesian Survey Analysis: Introduction
Fig. 8.30 This is the setup for the MCMC estimation. The results are in Fig. 8.31
I show how to set up a Bayesian estimation of the proportions in Fig. 8.30 and the results in Fig. 8.31. The posterior distributions are in Fig. 8.32 with the posterior for the differences shown separately in Fig. 8.33. Notice in Fig. 8.30 that Paragraph 2 has the difference in proportions specified as a simple statement: pm_dem − pm_rep. Also notice in Paragraph 3 that a Bernoulli distribution is used for the likelihoods for the party proportions. A Bernoulli distribution shows the probability of a discrete, binary random variable which has the values 0 and 1. The probability is p if the value is 1 and 1 − p if the value is 0. This is certainly the case for the voting intention: it is binary with 1 = will vote and 0 otherwise.
8.8 Extensions to Other Analyses
295
Fig. 8.31 These are the estimation results for the voting problem. The setup is in Fig. 8.30
Fig. 8.32 These are the posterior distributions for the voting problem. See Fig. 8.33 for an enhancement of the differences in posterior distribution
8.8.4 Logit Model for Contingency Table As another application example, consider the voting case study once more. The target is the voting intention for the next presidential election, and the feature is the party affiliation. A logit model, as I described in Chap. 5, can be used for this problem. The contingency table in Table 8.1 is just an arrangement of the data, in many instances just for visual display. A model, however, allows for more understanding of relationships and is thus more insightful. The table does have its advantages. One is the simple calculation of the odds of an event. In this example, the odds are for voting. The odds ratio is the odds of voting if a respondent is
296
8 Bayesian Survey Analysis: Introduction
Fig. 8.33 This is the posterior distribution for the differences between the voting intentions of the political parties. The reference line is at 0 for no difference by party. You can see that this is not credible
a Democrat versus the odds if a respondent is a Republican. Using the data in Table 8.1, you can calculate the odds ratio in one of the two equivalent ways: 1. by calculating the two conditional probabilities of voting (one for Democrats and the other for Republicans) and then dividing the Democrat conditional by the Republican conditional; or 2. by cross-multiplying the frequencies in Table 8.1: (301×143)/(49×107). By either method, you will get an odds ratio of 8.2096. This means that Democrats are over 8 times as likely to vote in the next presidential election than Republicans. You can get the same result by estimating a logit model and exponentiating the estimated parameter as I explained in Chap. 5. I show the estimated logit model in Fig. 8.34 for the data I used to construct Table 8.1. The odds ratio is shown in the display’s footer as 8.2096. The problem with this approach is the same as for any Frequentist approach. You do not incorporate any information and you get point estimates. I show a Bayesian version of this problem in Figs. 8.35, 8.36, and 8.37. The calculation of the odds ratio in Fig. 8.38 is the exponentiation of the party coefficient and then averaging these values.23 This is the same figure as the posterior graph. The odds ratio is
23 Be careful how you average. I exponentiated the estimate first for each value in the chains and then averaged these values. You could average the unexponentiated estimates and then exponentiated the average. The latter will produce a smaller value. You need to exponentiate first and then average because each exponentiation is for a separate model.
8.8 Extensions to Other Analyses
297
Fig. 8.34 This is the Frequentist approach to analyzing the contingency table. The odds ratio is 8.2096
for Democrats vs Republicans, and the value shows that Democrats are 8.46 times more likely to vote. This agrees with the Frequentist conclusion, but you now have the whole distribution for the values, not just a point estimate.
8.8.5 Poisson Model for Count Data You can also estimate a Poisson model for count data. For example, the voter intention survey could have asked respondents, as part of the Surround Questions, how many times they voted in the past 4 years for any type of election. This could be a measure of voter activity. The responses would be 0, 1, 2, 3, or 4 times. I stopped at 4 assuming just one election per year. These values are obviously counts. A Poisson model is appropriate for this.
298
8 Bayesian Survey Analysis: Introduction
Fig. 8.35 This is the setup for estimating the Bayesian version of the voting intentions data. Notice the dummy coding of the data
Fig. 8.36 These are the trace diagrams for the Bayesian logit estimation. Notice the two groups in the probability chart. This reflects the two parties: Republicans (the left group) and Democrats (the right group)
8.8 Extensions to Other Analyses
299
Fig. 8.37 This is the posterior distribution for the party odds ratio for the Bayesian logit voting model. Notice that I used a transform = np.exp argument. This transforms the raw parameter to the odds ratio using the Numpy exp function
Fig. 8.38 This is the party odds ratio distribution for the Bayesian logit voting model. Notice that the mean odds ratio is 8.46. The graph is the same as Fig. 8.37
300
8 Bayesian Survey Analysis: Introduction
8.9 Appendix 8.9.1 Beta Distribution The beta distribution is a very flexible functional form that includes some wellknown distributions as special cases.24 The beta pdf is defined as f (x; α, β) =
1 × x α−1 × (1 − x)β−1 B(α, β)
with two parameters, α and β. These determine the shape of the distribution. The function, B(α, β), called the beta function, is defined as B(α, β) =
(α, β) (α) × (β)
where (n) is the gamma function which, for positive n, is simply the factorial function. An important special case is Beta(1, 1) ∼ U (0, 1). For large n, Beta(n×α1, n× β) is normally distributed. See Fig. 8.39.
8.9.2 Half-Normal Distribution The pdf for the normal distribution is
f (x | μ, σ ) = √
1 2 × π × σ2
(x − μ)2 × e 2 × σ2 −
where μ and σ 2 are two population parameters. The half-normal has the pdf f (x | σ ) =
x2 − 2 × e 2 × σ2 . π × σ2
for x ∈ [0, ∞). This is referred to as the support. The only parameter for the halfnormal is the shape parameter, σ 2 . You could specify the precision instead of the variance where the precision is τ = 1/σ 2 . It is one or the other, not both. See Fig. 8.40.
24 See
https://en.wikipedia.org/wiki/Beta_distribution. Last accessed January 22, 2022.
8.9 Appendix
301
Fig. 8.39 This is the beta distribution for several settings of the two parameters. Notice that when α = β = 1, the distribution is the uniform distribution
8.9.3 Bernoulli Distribution The Bernoulli distribution is appropriate when the random variable is binary: 0 and 1. The probabilities are P r(X = 1) = p and P r(X = 0) = 1 − p = q. The probability mass function is p if x = 1 p(x; p) = 1 − p otherwise. This is a special case of the binomial distribution with n = 1. The binomial mass function is n p(x; n, p) = × px × (1 − p)n−x . x
302
8 Bayesian Survey Analysis: Introduction
Fig. 8.40 This compares the standardized normal and half-normal distributions
Chapter 9
Bayesian Survey Analysis: Multilevel Extension
Contents 9.1
Multilevel Modeling: An introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Omitted Variable Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Simple Handling of Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Nested Market Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Multilevel Modeling: Some Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Aggregation and Disaggregation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Two Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Ubiquity of Hierarchical Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Data Visualization of Multilevel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Basic Data Visualization and Regression Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Case Study Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Pooled Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Unpooled (Dummy Variable) Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Multilevel Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Multilevel Modeling Using pyMC3: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Multilevel Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Multilevel Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Example Multilevel Estimation Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Example Multilevel Estimation Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Multilevel Modeling with Level Explanatory Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Extensions of Multilevel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.2 Possion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.3 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
304 305 307 307 308 309 310 311 311 312 313 318 318 319 321 323 324 324 325 328 328 328 330 332 332 333
The unilevel approach I covered in Chap. 8 is sufficient for many survey-based problems, and, in fact, for many problems whether survey-based or not. It is fundamentally one way of viewing the probabilistic structure of the target variable. There are times, however, as I noted at the beginning of that chapter, when the data structure requires a different approach. That structure is hierarchical with primary © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4_9
303
304
9 Bayesian Survey Analysis: Multilevel Extension
sampling units (PSUs) nested under a larger category of objects so that there are multiple levels to the data. The problem is multilevel as opposed to unilevel as in Chap. 8. I want to extend the unilevel Bayesian framework in this chapter to cover multilevel modeling. The larger category gives context for the PSUs. As a result of the nesting, the parameters at the PSU level are themselves random variables because they are determined by the features at that higher level. As random variables, they are also draws from a probability distribution, the prior distribution in a Bayesian framework. I will develop the ideas for the multilevel problem in this chapter by first discussing the data structures in more detail and then relating the estimation methods to the unilevel Bayesian methods I presented in Chap. 8.
9.1 Multilevel Modeling: An introduction I described several methods for analyzing survey data in previous chapters. These include tables such as cross-tabs, simple graphics, and regression models. The last includes linear and logistic regression. The regression methods are the most important because they involve explaining a variable, the dependent or target variable, as a function of a set of independent or feature variables. To be more precise, the goal is to explain the variation in the target. These are powerful methods for identifying cause-effect or relationships among survey variables and are quite often used in survey analysis. In addition, they quantify the magnitude of the effects and allow you to make predictive statements. As an example, consider a pricing survey for estimating price elasticities which are important for determining, i.e., predicting, the effect on sales of different price points. Also, they allow you to determine the effect on revenue since a revenue elasticity is directly related to a price elasticity.1 As another example, consider the odds calculation from a logistic regression model. This will tell you the odds that one group will behavior one way compared to another group. You could determine, for instance, that female consumers are twice as likely, the odds are twice as high, to purchase a product than male consumers. Although these are powerful methods and should be in your toolkit, they have a major shortcoming. They do not fully take advantage of the underlying structure of the market or institutional arrangement which, based on the survey design, is reflected in the data. As an example of structure, consider consumers buying a particular brand of product. A survey would be conducted to determine the amount purchased and at what price points as in the beverage study I used in the previous chapter. The consumers are asked about prices paid. A simple random sample is collected of shoppers in grocery stores in several locations. A properly constructed questionnaire would contain Surround Questions about the store location such as
1 If ηQ P
Q
is a price elasticity, then ηPT R = 1 + ηP is the total revenue (T R) elasticity. See Paczkowski (2018) for the derivation.
9.1 Multilevel Modeling: An introduction
305
the city and state and, preferably, the store name. These Surround Questions could be used in the cross-tab and data visualization analysis (i.e., Shallow Analysis) to characterize or profile the survey respondents. If such data are collected, then this is how they would be used. These surround data provide information about market structure. For instance, shopping behavior and consumption preferences vary by state. The same holds if shopping is done strictly online vs in-store. Many people use online shopping services such as InstaCart. McKinsey reported a shift to online.2 Also see a Pew Research report that supports these trends.3 The store type also has an impact. A USDA Economic Research Service report showed that at-home food expenditures varied greatly by store type (mass merchandise, etc).4 These Surround Question data, however, are often not used in a regression analysis except by a simplified means. If they are ignored in a regression model, then important behavioral differences that transcend the basic behavioral factors are ignored and omitted from the model. For example, regional characteristics may determine purchasing behavior beyond a price effect. If the regions are ignored by excluding a regional variable, then the regression model is subject to the omitted variable bias (OVB): estimated regression parameters are biased. Omitting an important variable is not uncommon. You omit a variable because: 1. you are ignorant of another explanatory variable; 2. you do not have the data for another variable; or 3. you simply made a mistake. But this leads to an important question: “What is the bias due to the omitted variable?”
9.1.1 Omitted Variable Bias You can determine the effect of omitting a variable by studying the bias introduced by its omission. To understand this, consider two models. The first, Model I, is the True Model or the Data Generating Model, the one that actually generates the dependent variable, Yi . This model is
2 See “The great consumer shift: Ten charts that show how US shopping behavior is changing” (August 4, 2020). Available at https://www.mckinsey.com/business-functions/marketing-andsales/our-insights/the-great-consumer-shift-ten-charts-that-show-how-us-shopping-behavior-ischanging. Last accessed December 15, 2021. 3 See “Online shopping and Americans’ purchasing preferences” (December 19, 2016). Available at https://www.pewresearch.org/internet/2016/12/19/online-shopping-and-purchasingpreferences/. Last accessed December 15, 2021. 4 See “Where You Shop Matters: Store Formats Drive Variation in Retail Food Prices” (November 1, 2005). Available at https://www.ers.usda.gov/amber-waves/2005/november/where-you-shopmatters-store-formats-drive-variation-in-retail-food-prices/). Last accessed December 15, 2021.
306
9 Bayesian Survey Analysis: Multilevel Extension
Yi = β0 + β1 × Xi1 + β2 × Xi2 + i .
(9.1)
The second model, Model II, is the incorrect model, but the one you nonetheless estimate. Using Model II is a specification error. This is written as Yi = β0∗ + β1∗ × Xi1 + i∗ .
(9.2)
You estimate Model II because of any of the reasons I cited above, even though Model I is the actual generating process for Yi . Since you estimate Model II, the estimator for β1∗ is βˆ1∗ =
(Xi1 − X¯1 ) × (Yi − Y¯ ) . (Xi1 − X¯1 )2
(9.3)
This looks like the correct estimator, but it is a function only of X1 since X2 is omitted. Remember, however, that Yi was actually generated by Model I! You can substitute the Model I formula for Yi into (9.3) since, again, this is what generated Yi . The resulting expression is easy to simplify using simple algebra. The simplification will show that βˆ1∗ = β1 + β2 × COR(X1 , X2 )
(9.4)
so βˆ1∗ is a function of three terms: 1. β1 ; 2. β2 ; and 3. the correlation between X1 and X2 Therefore, βˆ1∗ is biased unless β2 = 0 and/or COR(X1 , X2 ) = 0, in which case it is unbiased. The bias depends on β2 and the correlation. See Kmenta (1971) for a good discussion. There are four specification error cases which I summarize in Table 9.1. Consider the case of including or excluding an irrelevant variable that contributes no explanatory power to the model. Remember, you want to explain the variation in the target, but an irrelevant variable contributes nothing to the explanation. If it is irrelevant, then including it is not a problem because it will be shown that it is insignificant and so will have no effect; you will most likely drop it. If it is omitted, then you also do not have a problem as should be obvious. Next, consider a relevant variable. If you include such a variable, then that is not a problem because this is what you should do anyway. There is only one problem case as you can see from this table: omitting a relevant variable. This could be region in my pricing example. I summarize this problem cell in Table 9.2 in terms of the three features I listed above.
9.1 Multilevel Modeling: An introduction
307
Table 9.1 This table summarizes the four cases for omitting an important variable and including an irrelevant one. Relevant Irrelevant
Omitted Problem No Problem
Included No Problem No Problem
Table 9.2 This summarizes the bias possibilities for the problem cell in Table 9.1. “Positive” is positive bias; “Negative” is negative bias. β2 > 0 β2 < 0
COR(X1 , X2 ) > 0 Positive Negative
COR(X1 , X2 ) < 0 Negative Positive
9.1.2 Simple Handling of Data Structure If a price elasticity estimate is the core survey objective, the estimated elasticity will be biased if an important Surround Question, perhaps marketing region or store type, is omitted from the model. If they are included but by using a simplified approach, then the results will still be biased because the characteristics of the market or institution itself are not included. A simplified approach is to use a dummy variable to capture the effect of a market or institution. For example, you could use a dummy variable to capture different regions or store types. The problem is that the dummies do not capture or reflect the underlying characteristics of the regions or store types. Some stores, for example, cater to health-conscious consumers; others to thrift-conscious consumers; and yet others to convenience shoppers. There are also neighborhood stores that carry products, with appropriate price points, for their local markets, products that would not sell well in other markets merely because of the demographic composition of those markets. Dummy variables are powerful additions to your modeling toolkit, but they are limited in that, almost by definition of how they are created, they do not capture the driving factors for what they represent. A dummy variable for a market does not completely capture the market’s characteristics.
9.1.3 Nested Market Structures Another approach is needed to reflect market structures. What structure? It is a nested one in which the primary sampling units (e.g., consumers) are in groups that are subsets of, or are embedded or nested in, other higher-level groups. So, one group is nested or is part of another group. Each group has its own characteristics that help determine the behavior of the lower-level group.
308
9 Bayesian Survey Analysis: Multilevel Extension
Nested structures are everywhere, and in some instances are reflected in survey designs. An example is a cluster survey design in which homogeneous groups or clusters are formed for a study or are naturally formed in the market. If they are formed for a study, then they enable a more efficient way to collect data. The PSUs are assumed to be homogeneous within a cluster but heterogeneous across clusters. A random sample of clusters is selected and then the PSUs are sampled. You have a one-stage cluster sampling design if all the units within a cluster are surveyed. You have a two-stage cluster sampling design if the sampling units within each group are randomly sampled; that is, you first sample the clusters and then the units within each sampled cluster. See Levy and Lemeshow (2008, Chapter 8) for a thorough overview of cluster sampling. Also see Cochrane (1963) and Yamane (1967, Chapter 8). Yamane (1967, p. 187) note that a sampling survey is “easier in terms of preparation, cost, and administration if the sampling units that are to be selected” are in clusters. But Levy and Lemeshow (2008, p. 228) note that there are some disadvantages, primarily high standard errors. For our purpose, it is sufficient to note that cluster sampling is an example of hierarchical structures. Levy and Lemeshow (2008, p. 225) provide a few examples of clustering such as a person nested in a household which is nested in a city block; and a student nested in a classroom which is nested in a school. The city block and school are clusters. The person and student are the PSUs. As another example, a healthcare company with a large network of hospitals located in multiple states wants to measure patient satisfaction. It could sample hospitals (i.e., clusters of patients) and then sample patients within each hospital to measure patient satisfaction. Incidentally, this could be extended to a three-stage cluster sampling design by sampling hospitals as clusters, then medical units as sub-clusters, and finally the patients within the hospitals and units. The patients are nested within the medical units which are nested within the hospitals. In the nested structure, the nests have characteristics that vary and need to be accounted for. Multilevel modeling does this.
9.2 Multilevel Modeling: Some Observations Let me first discuss some ways to handle multilevel data. In a typical, non-Bayesian framework, nested data are either aggregated or disaggregated, depending on the problem and analytical sophistication. Variables can, of course, be aggregated or disaggregated to a different level if the nested structure is recognized but not used, perhaps because it is considered to be unimportant or because it is viewed as too complicated to handle. Aggregation or disaggregation are sometimes done to hide or avoid data complexities. I will discuss both in the following subsections.
9.2 Multilevel Modeling: Some Observations
309
9.2.1 Aggregation and Disaggregation Issues Aggregation means redefining data from a low level to a high one. You can do this for cross-sectional and times series data. For example, you can sum sales at an individual store level (i.e., a low level) and then use this sum for a marketing region total sales (i.e., a high level). Your analysis would be at the region or high level. You would do the same with time series if you have a tracking study. You could average, for example, daily data to get monthly data because daily data are nested within monthly periods. Disaggregation means redefining data from a high level to a low one. For example, you could divide marketing region sales by the number of stores in a region to get average store sales. Your analysis would then be at the store or low level. I illustrate this in Fig. 9.1. Aggregation and disaggregation are common in all forms of data analysis, whether with survey data, experimental data, or observational data. Data at one level are “moved” to another level. All the data are then at one level so that standard statistical/econometric methods (e.g., OLS, ANOVA) can be applied. There are problems, however, with aggregation and disaggregation. Aggregation may make the problem “easier” in that you will have less data to manage. But the price to pay is that information, what is needed for decision making, become hidden or obscure. Statistically, there is a loss of power for statistical tests and procedures. More importantly, you will overlook weights needed for the estimation when you aggregate data. Typical aggregation procedures, such as summing and averaging, implicitly assume that all the units have equal importance and so are equally weighted. Basically, for the OLS regression model, Yij = β0 +β1 × Xij +ij , where i indexes the observation and j the group the observation belongs to, it is assumed that the two parameters, β0 and β1 , are constant for each observation regardless of groups. Therefore, they are still constant after aggregation. So, this model after aggregation, say, by averaging, becomes Y¯j = β0 + β1 × X¯ j + ¯j .
(9.5)
Notice the j subscripts. This assumption is more for convenience than anything else. The convenience is that straight least squares methods can be used with the
Fig. 9.1 This illustrates the aggregation and disaggregation possibilities for cross-sectional survey data. You could do the same with time series data.
310
9 Bayesian Survey Analysis: Multilevel Extension
aggregated data. However, these parameters will most likely not be constant for the observations across the groups that were aggregated. The model should be Y¯j = β0j + β1j × X¯ j + ¯j
(9.6)
so that the parameters vary by group but are constant within a group. As a result, the expected value of each estimated coefficient is a weighted average of the data where the weights sum to 1.0. Ignoring these weights leads to estimation issues. A further discussion of this is beyond the scope of this chapter although for this chapter the implication of assuming constancy is that the parameters are themselves not functions of other features. Something determined these parameters and caused them to differ by groups. But what? See Theil (1971), Pesaran et al. (1989), and Moulton (1990) for detailed theoretical discussions of this aggregation problem from a non-multilevel modeling perspective. Also see Roux (2002) for comments about this constancy assumption. Disaggregation also has its problems. Data are “blown-up” which is problematic. Statistical tests assume that the data are independent draws from a distribution, but they are not since they have a common base, thus violating this key assumption. Also, sample size is affected since measures are at a higher level than what the sampling was designed for. See Paczkowski (2018) for aggregation and disaggregation problems.
9.2.2 Two Fallacies There are two subtle issues, two fallacies, associated with converting nested data to a single level: 1. Ecological Fallacy; and 2. Atomistic Fallacy. The Ecological Fallacy occurs when aggregated data are used to draw conclusions about disaggregated units. For example, you may model sales and estimate price elasticities at the marketing region and then use these elasticities to predict price or buying behavior at the store level. What holds at the region level may not hold at the store level. Each store has its own defining characteristics such as its clientele and their associated socioeconomic (SES) attributes. See Roux (2002) and Wakefield (2009) for good summaries of this fallacy. Also see Robinson (1950) for an early discussion of this issue. The Atomistic Fallacy occurs when disaggregated data are used to draw conclusions about aggregate units. For example, you may model individual consumers from Big Data and then apply the results to a whole market. See Roux (2002).
9.2 Multilevel Modeling: Some Observations
311
Macro (Level 2)
Micro 1 (Level 1)
Micro 2 (Level 1)
Micro 3 (Level 1)
Fig. 9.2 This illustrates the relationship between the Level 1 and Level 2 units in a multilevel data structure.
9.2.3 Terminology In a nested, multilevel, or hierarchical data structure, data are measured at a lower level but within the context of a higher level.5 The low level is Level 1 or the micro level. The high level is Level 2 or the macro level. The macro level is interpreted as giving context or meaning to the micro level in the sense that it influences the micro level. You can imagine Level 1 units as lying below Level 2 units. Level 1 units are the main study units. In terms of sampling design, Level 1 units are the primary sampling units. In a cluster sampling of apartment buildings, the buildings are Level 2 and the households within the buildings are Level 1. I illustrate this relationship in Fig. 9.2. There is a connection between the level structure of data and the two fallacies I mentioned above. When you disaggregate data from Level 2 to Level 1 and then use the disaggregated data to make a statement about Level 2, you commit the Atomistic Fallacy. When you aggregate data from Level 1 to Level 2 and then use results to make statements about the Level 1, you commit the Ecological Fallacy. I illustrate this in Fig. 9.3.
9.2.4 Ubiquity of Hierarchical Structures Examples of hierarchical structures are more common than thought. In marketing and pricing surveys, for instance, you have • Segments; • Stores; • Marketing regions;
5 The
terms “nested”, “multilevel”, “hierarchical” are used interchangeably. I prefer “multilevel.”
312
9 Bayesian Survey Analysis: Multilevel Extension
Disaggregation Atomistic Fallacy Draw Conclusions About Higher Level Micro Level
Macro Level
Ecological Fallacy Draw Conclusions About Lower Level Aggregation Fig. 9.3 This illustrates the connection between the levels and the two fallacies.
• • • •
States; Neighborhoods; Organization membership; and Brand loyalty
to list a few. Many more could be listed. Some data are naturally hierarchical or nested. For example: • • • • •
Family; Neighborhood; Store; City; and Segment.
You need to account for the nesting and interactions between the nests. See Oakley et al. (2006) and Ray and Ray (2008).
9.3 Data Visualization of Multilevel Data It is well known that you must graph your data to fully understand any messages inside the data: relations, trends, patterns, anomalies. These include scatter plots for continuous measures and bar charts for discrete as I discuss in Chaps. 3 and 4. Also see Paczkowski (2022) for an extensive discussion on data visualization.
9.3 Data Visualization of Multilevel Data
313
9.3.1 Basic Data Visualization and Regression Analysis Typical scatterplots used with hierarchical data will not suffice for visualization because the hierarchical groups could be obscured or hidden. In addition, a regression line through the data may indicate the wrong fit which will, of course, lead to wrong conclusions and recommendations. I illustrate this using Fig. 9.4 for six groups of data. To create Fig. 9.4, I first wrote a function that generates data sets for the model Yi = β0 + β1 × Xi + i . The settings for the model (i.e., the intercept, β0 , and the low value for a uniform random variable) are defined in a dictionary. These settings vary by the six groups. The function calls relevant values from the dictionary as needed. The X variable is randomly drawn from a uniform distribution and the disturbance term is randomly drawn from a normal distribution scaled by a standard deviation. The slope for the model is fixed at β1 = −2.0. The function is called six times to create six separate DataFrames, each of size n = 50, which are vertically concatenated to create one DataFrame of n = 300 rows. A group indicator is included to identify which group the observations belong. I show the estimated model using this data in Fig. 9.4. You can see from the estimated model that the slope of a regression line is 4.3195 despite the data being created with β1 = −2.0. There is no accounting for the six groups which leads to this result. This is a pooled regression model and is misspecified by omitting the relevant variable for group membership. The slope is clearly biased as I discussed above. I plotted the generated data, which I show in Fig. 9.5, with a superimposed estimated pooled regression line based on the model in Fig. 9.4. Notice the positive sloped line for the pooled regression. The graph also shows a separate regression line for each of the six groups. Notice that these all have the same negative slope as they should since the model used to generate the data had a slope of −2.0. The implication is that each group has a negative relationship between X and Y , but the pooled group has a positive relationship. One possible way to handle this issue is to include a series of dummy variables in the model, one dummy for each group, omitting, of course, one dummy variable for a base group. This omission avoids the Dummy Variable Trap. Including a dummy variable without any interactions has the effect of shifting the estimated regression line by changing the intercept, not the slope. A series of parallel regression lines will be produced, the intercept of each line being equal to the model’s estimated intercept plus the relevant dummy’s estimated parameter. The line for the base group will have an intercept equal to the regression estimated intercept. I show how this regression is specified in Fig. 9.6. By including a series of dummy variables, I have effectively taken the entire pooled data set and split it into smaller pieces by unpooling it. Consequently, this form of regression analysis is an unpooled regression analysis. The pooled data are unpooled (a verb). The estimated coefficient for X, the slope for each regression line, is βˆ1 = −2.2862 which is close to the β1 = −2.0 I used to generate the data. The
314
9 Bayesian Survey Analysis: Multilevel Extension
Fig. 9.4 This illustrates a pooled regression model. The top portion of the figure shows the Python code to generate the data. The bottom shows the regression results. Notice the positive, and highly significant, slope estimate for the X variable, even though −2.0 was used to generate each separate data set.
9.3 Data Visualization of Multilevel Data
315
Fig. 9.5 This illustrates the pooled data. Notice the positively sloped regression line for the pooled data and the negatively sloped lines for each of the six groups.
base intercept is the overall regression intercept: 10.2120. The intercept for the first line is the overall intercept plus the first dummy’s estimated coefficient: 10.2120 + (−2.0159) = 8.1961. The intercept for each group’s regression line is similarly calculated. The lines shift down to the left for higher and higher groups. It is instructive to note that the estimated dummy coefficients are the shift or change in the intercept due to the groups; they measure the effects due to groups. See Paczkowski (2018) for an extensive discussion of dummy encoding and modeling using dummy variables. Also see Gujarati (2003) on the Dummy Variable Trap. The problem of different groups, that is, different hierarchical levels in the data, could be handled by dummy variables as I just showed. In fact, this is an effective way to handle different groups. But, there is a problem. The number of dummies can proliferate as the number of groups in the hierarchy increases. Matters become even more complicated if each group has a separate slope so that there is an interaction between a group dummy and a feature variable. My example has a constant slope for each group. Interactions are handled by multiplying each dummy by the X variable. I illustrate this in Fig. 9.7. Notice that there is a separate dummy estimate for each group as before plus a single slope estimate, but there are also five additional slope-intercept interaction terms. The intercept for the first group is calculated as above, although the values are different as should be expected: the intercept is 10.3513 + (−2.0858) = 8.2655. The slope of the line for this first group is the sum of the overall slope plus the interaction coefficient for this first group: (−2.4717) + 0.0858 = −2.3859. The slopes for the other groups are
316
9 Bayesian Survey Analysis: Multilevel Extension
Fig. 9.6 This illustrates the pooled regression with dummy variables. There are six groups in the generated data from Fig. 9.4. One less dummy is used to avoid the Dummy Variable Trap.
9.3 Data Visualization of Multilevel Data
317
Fig. 9.7 This illustrates the pooled regression with dummy variables interacted with the X variable. Notice the added estimated coefficients.
similarly calculated. The proliferation is clear, and the increase in the onerousness of interpreting the results increases as well. Another way to handle the grouping issue is to physically unpool the data and estimate a separate model for each group. For my example, just unpool the n = 300 observations into six groups of n = 50 each. This is a simple-minded approach. It
318
9 Bayesian Survey Analysis: Multilevel Extension
does not take into account the advantage of the efficiency of pooling (i.e., more data points) to get better estimates of standard errors, or allow for a common slope effect if this is what you hypothesize. The standard error issue is handled by pooling to give you more degrees-of-freedom for the MSE calculation. See Neter et al. (1989, p. 355) for a brief discussion of this issue. Aside from the proliferation of coefficients, there is the issue of what caused the dummy variable estimated values to be what they are? That is, what drives or determines each group and what makes each group differ from the others? If you study patient satisfaction within a chain of hospitals, what features of the hospitals influence or drive the satisfaction ratings? You could cite number of nurses, courtesy and caring of staff, promptness of care, and so on, but these are all Level 2 features. Dummy variables that represent individual hospitals are not a function of any of these features. The levels in a hierarchical data structure have their own defining factors, but these are not and cannot be included using dummy variables. Another mechanism is needed, and this is where multilevel modeling can be used.
9.4 Case Study Modeling I will now examine several models for the Case Study from Chap. 8. The dependent and main independent variables are on the natural log scale for each model. The models are: 1. pooled; 2. unpooled or dummy variable; and 3. multilevel. It may seem that I am needlessly repeating the modeling procedure that I discussed in the previous chapter for the pooled and dummy variable treatments. This is necessary to have benchmarks for the multilevel method which I focus on in this chapter. View this as a convenience for this chapter.
9.4.1 Pooled Regression Model The first model is based on pooling the Case Study data. The pooled regression model is β
β
Qi = eβ0 × Pi 1 × Ii 2 × ei
(9.7)
ln (Qi ) = β0 + β1 × ln (Pi ) + β2 × ln (Ii ) + i
(9.8)
or
9.4 Case Study Modeling
319
where Pi is the price paid by the ith consumer and Ii is that consumer’s income. This is the model I used in Chap. 8. The price elasticity is simply β1 . Notice that I did not include the store location. The reason is that this model does not take into account the hierarchical market structure. All the consumers are viewed as being in a random sample without regard to their location. This is, of course, a model misspecification that I will deal with below. I display the results for this model in Fig. 9.8 and its associated ANOVA table in Fig. 9.9. The price elasticity is −2.3879 (as in Fig. 8.11 of Chap. 8) so the beverage is highly price elastic. This makes sense since there are many beverage products available, including plain water. This suggests that a uniform price be offered.
9.4.2
Unpooled (Dummy Variable) Regression Model
There is a structure to the market that should be taken into account. You could impose structure with dummy variables. For example, you could segment consumers into homogeneous groups, say J segments where the segments could be defined a priori or derived, perhaps using a clustering algorithm. Regardless how you form the segments, you would pool all the consumers into one model as I did in the previous section, but now include J − 1 dummies to identify the groups. The use of the dummies effectively unpools the data. As I noted before, you can more efficiently model the groups with one model with dummies identifying the groups than estimating a separate model for each group. For the Case Study, a location dummy can be added to the basic model: β1 +γ2 ×Locationi
Qi = eβ0 +γ1 ×Locationi × Pi
β
× Ii 2 × ei
(9.9)
or ln Qi = β0 + γ1 × Locationi + β1 × ln Pi + γ2 × Locationi × ln Pi + i
(9.10)
where Locationi =
1 if Suburban 0 if Urban
(9.11)
The γ1 coefficient shifts the intercepts and defines the groups: β0 : Urban
(9.12)
β0 + γ1 : Suburban.
(9.13)
The γ2 coefficient gives a different slope for each group. Price elasticities are then:
320
9 Bayesian Survey Analysis: Multilevel Extension
Fig. 9.8 This is a summary of the pooled regression which ignores any market structure.
9.4 Case Study Modeling
321
Fig. 9.9 This is ANOVA table of the pooled regression in Fig. 9.8.
β1 : Urban β1 + γ2 : Suburban
(9.14) (9.15)
I show the regression results in Fig. 9.10. Notice that the urban elasticity is -1.2 and the suburban elasticity is −0.6(= −1.2086 + 0.5946 with rounding). These different price elaticities are intuitively obvious once the level of competition is accounted for between the two areas. There is more competition in urban areas than suburban areas. This suggests a discriminatory pricing structure with prices higher in the suburban area and lower in the urban area.
9.4.3 Multilevel Regression Model The pooled model does not account for any variations in the target variable due to the groups the stores belong to, which is either suburban or urban. Basically, all the observations were placed into one big pot and a model was estimated for the combined data. I implicitly assumed that the intercept and slopes are the same regardless of the location. The model with the dummy variables assumed just the opposite, that there is a different intercept for suburban and urban consumers and that the slope parameters also differ by location. So, in the first model, the parameters do not change by store location, and in the second they do change. The first is a pooled regression model and the second is an unpooled regression model. The first asserts that there is no differentiation in purchases between the two locations; the second asserts that there is differentiation. The unpooled model has a drawback, even though it seems that it captures or reflects the market structure and provides reasonable and intuitive price elasticities for suburban and urban consumers. It fails to allow for any driving factors that
322
9 Bayesian Survey Analysis: Multilevel Extension
Fig. 9.10 This is a summary of the pooled regression with dummy variables reflecting the market structure. The structure is Suburban and Urban. The location dummy variable was interacted with the price variable to capture differential price elasticities.
determine or explain why the parameters should differ by location. This is not very helpful or insightful from a marketing strategy recommendation perspective, which is what the survey was was supposed to support. A logical question you could and should ask is why is there a difference. The reason may be exploitable to give the business a competitive advantage. The same could be said for a public policy study where, perhaps, the amount of time spent studying political issues is the target and income and political party affiliation are the features explaining the target. A pooled model would assert that the effect is the same, regardless of affiliation. An unpooled model would identify the impact of affiliation, but it would not explain that impact. From a political strategy point-of-view, the strategists would have as much actionable insight as if they did not have the study results. The use of a pooled regression model is a form of Deep Data Analysis. The use of the unpooled regression model with dummy variables makes it even deeper. This is, however, true when there is no hierarchical structure to the data. When there is a hierarchical data structure, then the pooled model reverts to the Shallow Data
9.5 Multilevel Modeling Using pyMC3: Introduction
323
Analysis rank. The same holds for the unpooled regression model. Some insight is gained by either, although the unpooled is better than the pooled model, but, nonetheless, more insight and actionable information could be gained by another approach. The other approach is a multilevel regression model which is a Bayesian approach based on Bayes’ Rule that allows flexibility in modeling the levels in a hierarchical data structure. The parameters of a model are themselves random draws from probability distributions, each with a mean and standard deviation. The mean of a parameter could itself be made a function of other variables so that the mean has its own regression model. Those other variables could be factors that define or clarify the higher level in the hierarchical structure. For example, the price parameter for the model at the consumer level (the lowest level in a hierarchical structure, or Level 1) could be a function of location, but location could be modeled as a function of convenience factors such as parking (readily available in suburban area, scarce in urban areas) proxied, perhaps, by store size: the larger the store, the more convenience factors that are or can be made available. The advantage of this approach is that other information can be used in the analysis, information that would otherwise be ignored.
9.5 Multilevel Modeling Using pyMC3: Introduction There are three types of multilevel models: 1. Varying intercept; 2. Varying slope; and 3. Varying intercept and slope. These are the same ones I handled by the dummy variable approach. The basic OLS model has fixed intercept and slope regardless of the data structure. The models I summarized in Figs. 9.4 and 9.8 are based on this assumption. But as you saw for Fig. 9.4, this may not be the best way to handle the structure. The dummy variable version allowed for the varying intercept and did better. I could have interacted the dummy variable and the X variable to allow both the intercept and slope to vary. In each formulation, the parameter of interest is still fixed in the population even though it will vary based on the setting of the dummy variable. It is fixed for a setting of the dummy but varies by the dummy setting. The multilevel framework also allows for these three possibilities, but does so by accounting for the randomness of the parameter. The parameter of interest (i.e., intercept, slope, or both) is viewed as a random variable because that parameter is itself based on a model of the features at a higher level, say Level 2. The three types of multilevel models I listed above have parameters that are random variables.
324
9 Bayesian Survey Analysis: Multilevel Extension
9.5.1 Multilevel Model Notation The notation for a multilevel model can be confusing. Traditional notation, the one used in almost all statistics and econometrics books, uses a double subscript to indicate an observation and a level for that observation. The feature variable X would be written as Xij to indicate that the Level 1 observation, i, depends on the Level 2 level, j . For our beverage example, i is the consumer and j is the store location so logP riceij is the log of the price paid by consumer i (i.e., Level 1) who shops at store j (i.e., Level 2). The double subscript is acceptable and logical notation. Gelman and Hill (2007, Chapter 1, p. 2), however, introduced another notation they believe makes it clear that the observation unit, i, (e.g., consumer) at Level 1 is nested in a higher level, Level 2. This is Xj [i] . With this notation, observation i, the Level 1 observation, is nested inside j , the Level 2 observation. You can read this as “i is contained in j ”. The target observation is still written as Yi because this is independent of the level; an observation on the target is just an observation. This also holds for an observation on the feature variable. I will adopt this notation.
9.5.2 Multilevel Model Formulation Using this notation, the three multilevel models are Varying intercept: Yi = β0,j [i] + β1 × Xi + i Varying slope: Yi = β0 + β1,j [i] × Xi + i Varying intercept and slope: Yi = β0,j [i] + β1,j [i] × Xi + i . Recall, however, that the parameters are random variables that can be modeled as a function of features at Level 2. For the beverage Case Study, Level 2 is the store location: Suburban or Urban. Shopping at these stores could be a function of the value of time to get to the store, store amenities, waiting time to check out as a function of the number of checkout stations, or even the socioeconomic composition of the immediate or surrounding area of the store. Regardless of the feature, the parameters could be modeled as Varying intercept: Varying slope:
β
β
β
β
β
β1,j = γ0 1 + γ1,j1 × Zj + uj 1
Varying intercept and slope: β
β
β0,j = γ0 0 + γ1 0 × Zj + uj 0 β
β
β
β
β0,j = γ0 0 + γ1 0 × Zj + uj 0 and β1,j = γ0 1 +
β
γ1,j1 × Zj + uj 1 . where Z is the feature at level j of Level 2. Notice that the nesting subscript notation is not used because the Level 2 parameters are modeled. The γ s are the hyperparameters. The uj term is a disturbance term for Level 2 and, as such, it is
9.5 Multilevel Modeling Using pyMC3: Introduction
325
assumed to follow the Classical Assumptions for a disturbance term, in particular normality. Consequently, you can write 2 β0,j ∼ N (γ0 + γ1 × Zj , σu,j )
(9.16)
2 β1,j ∼ N (γ0 + γ1,j × Zj , σu,j ).
(9.17)
These are the priors for the parameters. If you do not have any information for the means, then you would simply have 2 β0,j ∼ N (0, σu,j )
(9.18)
2 β1,j ∼ N (0, σu,j ).
(9.19)
There is actually another prior: the σ term for the Level 1 model. This can be specified as a half-normal distribution as I used in Chap. 8. As an example for the Case Study, the varying-intercept model is logQuantityi = β0,j [i] + β1 × logP ricei + β2 × logI ncomei + i (9.20) i ∼ N (0, σ 2 ) β0,j ∼
2 N (0, σu,j )
(9.21) (9.22)
β1 ∼ N (0, σu2 )
(9.23)
σ ∼ HN (x)
(9.24)
9.5.3 Example Multilevel Estimation Set-up I show the set-up for this model’s estimation in Fig. 9.11 and the results in Fig. 9.12. There are several observations to make about this set-up. 1. There is an index (idx) for each location. This is a variable created with the Pandas Categorical function. This function just reads the character strings identifying the store location, extracts the unique strings, sorts them in alphanumeric order, and then assigns integer values to each string. There are only two locations so the integers are 0 and 1 where 0 = Suburban and 1 = U rban. If there are three locations, then the integers would be 0, 1, 2. This is used to identify the levels of Level 2. 2. A group variable has a count of the number of levels of the index; 2 in this case. This is used to create vectors. 3. Paragraph 1 is longer because there are more priors, but the layout is the same as the one I used in Chap. 8. The location, price, and income blocks set the hyperparameters for the model. These parameters (gamma, beta_1, and beta_2) are used in Paragraph 2.
326
9 Bayesian Survey Analysis: Multilevel Extension
Fig. 9.11 This is the set-up for estimating the multilevel model. I show the results in Fig. 9.12.
4. Paragraph 2 has a slight change: the model parameters (gamma, beta_1, and beta_2) are indexed by the idx variable using the [ idx ] notation. This allows the parameters to vary by level of the Level 2 variable, which is location in this example. This is not dummy notation. With dummy notation, since there are two groups, only one parameter would be estimated. With this set-up, a parameter for each group is estimated. 5. Paragraph 3 sets the likelihood, as before.
9.5 Multilevel Modeling Using pyMC3: Introduction
Fig. 9.12 These are the estimation results for the multilevel model I show in Fig. 9.11.
327
328
9 Bayesian Survey Analysis: Multilevel Extension
9.5.4 Example Multilevel Estimation Analyses The same analyses I described in Chap. 8, primarily the posterior distribution and the HDI, are done here. Looking at the results, you can see that there is a separate estimate for each store, a separate estimate for the store/price interaction, and a separate estimate for income.
9.6 Multilevel Modeling with Level Explanatory Variables I noted several times that a strength of the multilevel modeling approach is that you can include explanatory factors for the priors. These Level 2 factors determine the priors which in turn determine the Level 1 random variable. The question, of course, is how to include them. The inclusion is actually simple: just define the priors and add another regression model, but for Level 2. I show a possible set-up for the stores example where the waiting time to be served is used as a proxy for store convenience. The consumers who took part in the diary-keeping phase of the study tracked the (approximate) length of time, in minutes, it took to begin the actual check-out process. In other words, they recorded how long they had to wait in line (i.e., a queue) before being served. The service time is function of the number of servers (i.e., check-out stations) in a grocery store: the more check-out stations, the less time in queue. The more time in queue, the less convenient the store and the higher the total cost (i.e., dollar amount paid plus value of time) of buying the beverage (and all other products). You should expect the urban stores to have longer queues and more time in queue merely because there is less physical space for checkout counters, but they should also have lower (dollar) prices due to competition and as an off-set to the longer checkout times. I show the distributions of waiting time in queue vs. the store location in Fig. 9.13 and the relationship between waiting time and price by location in Fig. 9.14. I show the set-up for this explanatory variable for Level 2 in Fig. 9.15. The analysis of the results is the same as the previous ones I presented and discussed.
9.7 Extensions of Multilevel Models You are not restricted to a linear model. You could also have a model for discrete data, such as a logistic regression model and a Poisson regression model.
9.7 Extensions of Multilevel Models
329
Fig. 9.13 This is the distribution of waiting time in a store’s checkout queue by the store location.
Fig. 9.14 This shows the relationship between of waiting time in a store’s checkout queue and the (unlogged) price. The store location is included as the legend indicates.
330
9 Bayesian Survey Analysis: Multilevel Extension
Fig. 9.15 This is the regression set-up for the inclusion of an explanatory variable for Level 2.
9.7.1 Logistic Regression Model Suppose you have data on voting intention that I used in Chap. 8. You still want to know the effect of party affiliation, but now you also want to know how region impacts the intentions. There are very definite regional influences on intentions. You might consider the voter intentions and their associated party affiliation as Level 1 variables because they are directly at the primary sampling unit level. But these PSUs are nested within geographic regions that have social and economic characteristics that affect voting intentions. These regional factors are at Level 2 in a nested data structure. The PSUs are nested within regions which influence the voters. See Jost (2021) and Tarrance (2018) for some interesting regional analyses for the U.K. and the U.S., respectively.
9.7 Extensions of Multilevel Models
331
As I stated in Chap. 8, a linear model such as Yi = β0 + β1 × Xi is inappropriate when Yi is binary because there is a chance of predicting outside the range of Yi which is just 0 and 1. The logistic cumulative distribution function overcomes these issues and gives rise to a tractable model. Other distribution functions can be used, but this is the most common in practice. A logistic model specifies the probability of a win as a function of explanatory variables. If the voting intentions variable is Yi for the ith voter, then you can write Logistic = P r(Yi = 1) =
eβ0 +β1 ×Xi . 1 + eβ0 +β1 ×Xi
(9.25)
The logistic model is nonlinear. Estimation is simplified by a transformation involving odds which are defined as the ratio of the probability of voting over the probability of note voting. The probability is estimated using the log of these odds, called the log odds or logit (short for logarithmic transformation). The logit is the linear function of the independent variables which is much easier to work with. For this model, odds = eX/1+eX × 1+eX/1 = eX where the second ratio is the inverse of the probability of voting. The logistic statement and the logit statement are inverses of each other so: Logistic = P r(Yi = 1) = logit −1 (β0 + β1 × Xi ).
(9.26)
The logistic model can be easily extended to a multilevel situation for varyingintercepts as Logistic = P r(Yi = 1) = logit −1 (β0,j [i] + β1 × Xi )
(9.27)
for i = 1, . . . , nj and j = 1, . . . , J , the number of levels for Level 2. A varying-intercepts, varying-slopes model is Logistic = P r(Yi = 1) = logit −1 (β0,j [i] + β1,j [i] × Xi )
(9.28)
with β
β
β
(9.29)
β
β
β
(9.30)
β0,j = γ0 0 + γ1 0 Zj + j 0 β1,j = γ0 1 + γ1 1 Zj + j 1
where Zj term is an independent variable for Level 2 that affects the intercepts and slopes and indirectly affect voting intentions because of regional influences.
332
9 Bayesian Survey Analysis: Multilevel Extension
9.7.2 Possion Model You may have just counts of events. For example, you may have survey data on the number of the number of times someone voted in the last 4 years (a count such as 0, 1, 2, 3 4 times), or the number of shopping occasions someone had in the past month (another count such as 0, 1, 2, 3, 4 times), or the number of times someone has seen a physician in the past year (yet another count such as 0, 1, 2 times). The counts could be a function of individual characteristics such as education, income, and age which are all Level 1 variables. In each case, the survey respondents could be influenced by ethnic peer pressures and life styles which would be Level 2 factors. For example, there is evidence that about 1 in 4 Hispanics and 1 in 4 African-Americans believe doctor misconduct is a serious problem. Such beliefs at the doctor level could impact how often they see a doctor. Similarly, characteristics of a medical practice such as parking, in-practice labs, whether or not insurance is accepted (i.e., a concierge medical practice), courtesy and respect of support staff (e.g., nurses, professional assistants), privacy safeguards, all Level 2 features, could impact the number of times a doctor is seen. See Funk et al. (2019) for some evidence about how medical professionals are viewed by ethnic groups and in general. These count examples would all follow a Poisson Process. The basic Poisson Process is Yi ∼ P oisson(θi )
(9.31)
θi = exp(β0 + β1 × Xi )
(9.32)
where Xi is a personal attribute. In Poisson models, the variance equals the mean so there is no independent variance parameter, σi2 . The implication is that you could have a variance larger than what is predicted by the model. The Poisson model is said to exhibit overdispersion because there is no variance parameter “to capture the variation in the data.” See Gelman and Hill (2007, p. 325). The Poisson model can be extended to handle multilevel data as Yi ∼ P oisson(μi eβ×Xi +i )
(9.33)
N 0, σ2
(9.34)
i ∼
The σ2 captures the overdispersion; σ2 = 0 is the classical Poisson. See Gelman and Hill (2007) and Snijders and Bosker (2012).
9.7.3 Panel Data There is one final extension: panel data, also known as longitudinal data or time series-cross sectional data or repeated measures data. This could be the types of data maintained for a tracking study.
Appendix
333
Recall the store example: data were total expenditures of customers by store for 1 month. This is the case where you have multiple repeated measures of the same variable for the same unit where the units are a sample. For the store example, there would be a three-level model: 1. time measures within an individual; 2. individuals nested within stores; and 3. stores. Panel models are far more complicated because of potential autocorrelation and the presence of the additional level of nesting. See Snijders and Bosker (2012) for a good discussion.
Appendix Multilevel Models: A High Level View A more efficient model structure is needed that reflects the data structure. There is a two-level system of equations reflecting the two levels I introduced in the text.1 Level 1 Equation: Yij = β0j + β1j × Xij + ij where: • • • • •
ij ∼ N (0, σ 2 ); Yij is the outcome variable for individual i in group j ; Xij is the individual-level variable for individual i in group j ; β0j is the group-specific intercept; and β1j is the group-specific effect or slope of the individual-level variable.
Notice that the parameters vary by the groups. You can aggregating this Stage I Equation by averaging the units in each group: nj
nj
nj
nj
i
i
i
i
1 1 1 1 × Yij = × β0j + × (β1j × Xij ) + × ij nj nj nj nj Y¯j = β0j + β1j × X¯j + ¯j .
1 The
following draws from Roux (2002). Also see Gelman and Hill (2007).
334
9 Bayesian Survey Analysis: Multilevel Extension
This is (9.6). However, in this case, there is a second set of equations for the two parameters, each equation having a set of hyperparameters. Level 2 Equations: β0j = γ00 + γ01 Gj + U0j β1j = γ10 + γ11 Gj + U1j where: • • • •
Gj is a group-level variable; γ00 is the common intercept across groups; γ01 is the effect of the group-level predictor on the group-specific intercept; γ10 is the common slope associated with the individual-level variable across groups; and • γ11 is the effect of the group-level predictor on the group-specific slopes. These γ parameters are called hyperparameters. The error terms in the Stage II Equations are called macro errors. 2 U0j ∼ N (0, τ00 ) 2 U1j ∼ N (0, τ11 )
There could be a covariance between the intercepts and slope which is represented by τ01 . The group-specific parameters have two parts: 1. A “fixed” part that is common across groups: γ00 , γ01 for the intercept and γ10 and γ11 for the slope; and 2. A “random” part that varies by group: U0j for the intercept and U1j for the slope. The underlying assumption is that the group-specific intercepts and slopes are random samples from a normally distributed population of intercepts and slopes. A reduced form of the model is: Yij = γ00 + γ01 Gj + γ10 Xij + γ11 Gj × Xij + U1j × Xij + U0j + ij . Fixed Component
Random Component
See Roux (2002). Many variations of this model are possible: • Null Model: no explanatory variables; • Intercept varying, slope constant;
Appendix
335
• Intercept constant, slope varying; and • Intercept varying, slope varying. The Null Model is particularly important because it acts as a baseline model—no individual effects. This is a more complicated, and richer, model that can be considered. • The random component for the error is a composite of terms, not just one term as in a Stat 101 OLS model. – A dummy variable approach to modeling the hierarchical structure would not include this composite error term. – The dummy variable approach is incorrect—there is a model misspecification – The correct specification has to reflect random variations at the Stage I level as well as at the Stage II level and, of course, any correlations between the two. • The composite error term contains an interaction between an error and the Stage I predictor variable which violates that OLS Classical Assumptions. – A dummy variable OLS specification would not do this.
References
Agresti, A. 2002. Categorical Data Analysis. 2nd ed. New York: Wiley. Andel, J. 2001. Mathematics of Chance. In Wiley Series in Probabilities and Statistics. New York: Wiley. Bachman, J.G., and P.M. O’Malley. 1984. Yea-Saying, Nay-Saying, and Going to Extremes: BlackWhite Differences in Response Styles. The Public Opinion Quarterly 48 (2): 491–509. Bauer, H., Y. Goh, J. Park, S. Schink, and C. Thomas. 2012. The Supercomputer in Your Pocket. resreport, McKinsey & Co. New York: McKinsey on Semiconductors. Beck, R.A. 2008. Statistical Learning from a Regression Perspective. Springer Series in Statistics. New York: Springer. Bethlehem, J.G. 2002. Survey Nonresponse. In Chapter Weighting Nonresponse Adjustments Based on Auxiliary Information, 275–287. New York: Wiley. Bonett, D.G. 2016. Sample size planning for behavioral science research. http://people.ucsc.edu/~ dgbonett/sample.html. Last accessed August 23, 2017 Box, G., W. Hunter, and J. Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York: Wiley. Bradburn, N., S. Sudman, and B. Wansink. 2004. Asking Questions: The Definitive Guide to Questionnaire Design for Market Research, Political Polls, and Social and Health Questionnaires, Revised ed. New York: Wiley. Brill, J.E. 2008. Likert scale. In Encyclopedia of Survey Research Methods, ed. Paul J. Lavrakas, 428–429. Beverley Hills: SAGE Publications, Inc. Cameron, A.C. and P.K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University. Chaudhuri, A. and H. Stenger. 2005. Survey Sampling: Theory and Methods. 2nd ed. London: Chapman & Hall/CRC. Christensen, R., W. Johnson, A. Branscum, and T.E. Hanson. 2011. Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians. New York: CRC Press. Clausen, S. 1998. Applied Correspondence Analysis: An Introduction. Beverly Hills: Sage Publications, Inc. Cleveland, W.S. 1994. The Elements of Graphing Data. 2nd ed. New York: Hobart Press. Cochrane, W.G. 1963. Sampling Techniques. 2nd ed. New York: Wiley. Cohen, L., L. Manion, and K. Morrison. 2007. Research Methods in Education. London: Routledge. Cox, D. 2020. Statistical significance. Annual Review of Statistics and Its Application 1: 1–10. Daniel, W.W. 1977. Statistical significance versus practical significance. Science Education 61 (3): 423–427.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4
337
338
References
Deming, W.E. 1943. Statistical Adjustment of Data. New York: Dover Publications, Inc. Deming, W.E. and F.F. Stephan. 1940. On a least squares adjustment of a sampled frequency table when the expected marginaltotals are known. The Annals of Mathematical Statistics 11 (4): 427–444. Demnati, A. and J.N.K. Rao. 2007. Linearization variance estimators for survey data: Some recent work. In ICES-III. DeVany, A. 1976. Uncertainty, waiting time, and capacity utilization: A stochastic theory of product quality. Journal of Political Economy 84 (3): 523–542. Dorofeev, S. and P. Grant. 2006. Statistics for Real-Life Sample Surveys. Cambridge: Cambridge University Press. Dudewicz, E.J. and S.N. Mishra. 1988. Modern Mathematical Statistics. New York: Wiley. Eagle, A. 2021. Chance versus randomness. In Stanford Encyclopedia of Philosophy. Ellis, S. and H. Steyn. 2003. Practical significance (effect sizes) versus or in combination with statistical significance (p-values). Management Dynamics 12 (4): 51–53. Enders, C.K. 2010. Applied Missing Data Analysis. New York: The Guilford Press. Feller, W. 1950. An Introduction to Probability Theory and Its Applications. Vol. I. New York: Wiley. Feller, W. 1971. An Introduction to Probability Theory and Its Applications. Vol. II. New York: Wiley. Few, S. 2007. Save the pies for dessert. resreport, Perceptual Edge. In Visual Business Intelligence Newsletter. Few, S. 2008. Practical rules for using color in charts. resreport, Perceptual Edge. In Visual Business Intelligence Newsletter. Fischer, R. 2004. Standardization to account for cross-cultural response bias: A classification of score adjustment procedures and review of research in JCCP. Journal of Cross-cultural Psychology 35 (3): 263–282. Fuller, W.A. 2009. Sampling Statistics. New York: Wiley. Funk, C., M. Hefferon, B. Kennedy, C. Johnson. 2019. Trust and Mistrust in Americans’ Views of Scientific Experts, resreport Chapter 4: Americans Generally View Medical Professionals Favorably, but about Half Consider Misconduct a Big Problem. Washington: Pew Research Center. https://www.pewresearch.org/science/2019/08/02/trust-and-mistrust-inamericans-views-of-scientific-experts/. Gelman, A. 2002. Prior distribution. In Encyclopedia of Environmetrics, ed. Abdel H. El-Shaarawi, and Walter W. Piegorsch. Vol. 3, 1634–1637. New York: Wiley. Gelman, A. 2006. Prior distributions for variance parameters inhierarchical models. Bayesian Analysis 1 (3): 515–533. Gelman, A. and J.B. Carlin. 2002. Survey Nonresponse. In Chapter Poststratification and Weighting and Weighting Adjustments, 289–302. New York: Wiley. Gelman, A. and J. Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press. Gelman, A., J. Hill, and A. Vehtari. 2021. Regression and Other Stories. Cambridge: Cambridge University Press. Gill, J. 2008. Bayesian Methods: A Social and Behavioral Sciences Approach, 2nd ed. Statistics in the Social and Behaviorial Sciences. New York: Chapman & Hall/CRC. Grafstrom, A. and L. Schelin. 2014. How to select representative samples. Scandinavian Journal of Statistics 41 (2): 277–290. Greenacre, M.J. 1984. Theory and Applications of Correspondence Analysis. New York: Academic Press. Greenacre, M.J. 2007. Correspondence Analysis in Practice. 2nd ed. New York: Chapman and Hall/CRC. Greene, W.H. 2003 Econometric Analysis, 5th ed. Englewood: Prentice Hall. Groves, R.M., D.A. Dillman, J.L. Eltinge, and R.J.A. Little, eds. 2002. Survey Nonresponse. New York: Wiley. Guenther, W.C. 1964. Analysis of Variance. Englewood Cliffs: Prentice-Hall, Inc.
References
339
Gujarati, D. 2003. Basic Econometrics, 4th ed. New York: McGraw-Hill/Irwin. Haans, H. and E. Gijsbrechts. 2011. “one-deal-fits-all?” on category sales promotion effectiveness in smaller versus larger supermarkets. Journal of Retailing 87 (4): 427–443. Haigh, J. 2012. Probability: A Very Short Introduction. Oxford: Oxford University Press. Hajek, A. 2019. Interpretations of probability. In Stanford Encyclopedia of Philosophy. Hansen, M.H., W.N. Hurwitz, and W.G. Madow. 1953a. Sample Survey Methods and Theory. Methods and Applications. Vol. I. New York: Wiley. Hansen, M.H., W.N. Hurwitz, and W.G. Madow. 1953b. Sample Survey Methods and Theory. Theory. Vol. II. New York: Wiley. Hegarty, M. 2011. The cognitive science of visual-spatial displays: Implications for design. Topics in Cognitive Science 3: 446–474. Hicks, L.E. 1970. Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin 74 (3): 167–184. Higgs, N. 1991. Practical and innovative uses of correspondence analysis. The Statistician 40 (2): 183–194. Hill, R.C., W.E. Griffiths, and G.C. Lim. 2008. Principles of Econometrics, 4th ed. New York: Wiley. Hodge, D.R. and D.F. Gillespie. 2007. Phrase completion scales: A better measurement approach than likert scales? Journal of Social Service Research 33 (4): 1–12. Hogg, R.V. and A.T. Craig. 1970. Introduction to Mathematical Statistics, 3rd ed. New York: Macmillan Publishing Co., Inc. Horvitz, D.G. and D.J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47 (260): 663–685. Hunt, J. 2019. Advanced Guide to Python 3 Programming. Berlin: Springer. Hunt, J. 2020. A Beginners Guide to Python3 Programming. Berlin: Springer. Hyndman, R.J. 1996. Computing and graphing highest density regions. The American Statistician 50 (2): 120–126. Jamieson, S. 2004. Likert scales: how to (ab)use them. Medical Education 38: 1217–1218. Jobson, J. 1992. Applied Multivariate Data Analysis. In Categorical and Multivariate Methods. Vol. II. Berlin: Springer. Jost, P. 2021. Where do the less affluent vote? the effect of neighbourhood social context on individual voting intentions in england. Political Studies 69: 1–24. Kastellec, J.P., J.R. Lax, and J. Phillips. 2019. Estimating state public opinion with multilevelregression and poststratification using r. https://scholar.princeton.edu/sites/default/files/ jkastellec/files/mrp_primer.pdf. Kish, L. 1965. Survey Sampling. New York: Wiley. Kmenta, J. 1971. The Elements of Econometrics. New York: The MacMillan Company. Knaub, J.R. 2008. Encyclopedia of Survey Research Methods. In Chapter Finite Population Correction (FPC) Factor, 284–286. Beverley Hills: SAGE Publications, Inc. Kosslyn, S.M. 2006. Graph Design for the Eye and Mind. Oxford: Oxford University Press. Lavallee, P. and F. Beaumont. 2015. Why we should put some weight on weights. Survey Insights: Methods from the Field. http://surveyinsights.org/?p=6255. Leung, S.-O. 2011. A comparison of psychometric properties and normality in 4-, 5-, 6-, and 11point likert scales. Journal of Social Service Research 37 (4): 412–421. Levy, P.S. and S. Lemeshow. 2008. Sampling of Populations: Methods and Applications, 4th ed. New York: Wiley. Liu, M. 2015. Response Style and Rating Scales:The Effects of Data Collection Mode, Scale Format, and Acculturation. phdthesis, Michigan: The University of Michigan. Lohr, S. L. 2009. Sampling: Design and Analysis, 2nd ed. Boston: Cengage Learning. Lumley, T. 2010. Complex Surveys: A Guide to Analysis Using R. New York: Wiley. Malkiel, B.G. 1999. A Random Walk Down Wall Street, Revised ed. New York: W.W. Norton & Company. Marascuilo, L. 1964. Large-sample multiple comparisons with a control. Biometrics 20: 482–491.
340
References
Marascuilo, L. and M. McSweeney. 1967. Nonparametric posthoc comparisons for trend. Psychological Bulletin 67 (6): 401. Martin, N., B. Depaire, and A. Caris. 2018. A synthesized method for conducting a business process simulation study. In 2018 Winter Simulation Conference (WSC), 276–290. Martin, O.A., R. Kumar, and J. Lao. 2022. Bayesian Modeling and Computation in Python. In Textx in Statistical Science. New York: CRC Press. McGrath, J.J. 2007. The other end of the spear: The tooth-to-tail ratio (t3r) in modern military operations. Techreport, Combat Studies Institute Press. The Long War Series Occasional Paper 23. Kansas: Combat Studies Institute Press Fort Leavenworth. McKinney, W. 2018. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and ipython, 2nd ed. Newton: O’Reilly. Mlodinow, L. 2008. The Drunkard’s Walk: How Randomness Rules Our Lives. Vintage Books. Montgomery, D.C., E.A. Peck, and G. Vining. 2012. Introduction to Linear Regression Analysis, 5th ed. New York: Wiley. Moore, D.S. and W.I. Notz. 2014. Statistics: Concepts and Controversies, 8th ed. New Year: W.H. Freeman & Company. Moulton, B.R. 1990. An illustration of a pitfall in estimating the effects of aggregate variables on micro units. The Review of Economics and Statistics 72 (2): 334–338. Neter, J., W. Wasserman, and M.H. Kutner. 1989. Applied Linear Regression Models, 2nd ed. New York York; Richard D. Irwin, Inc. Oakley, J., D. Iacobucci, and A. Duhachek. 2006. Multilevel, hierarchical linear models and marketing: This is not your advisor’s ols model. Review of Marketing Research 2: 203–227. Paczkowski, W.R. 2016. Market Data Analysis Using JMP. New York: SAS Press. Paczkowski, W.R. 2018. Pricing Analytics: Models and Advanced Quantitative Techniques for Product Pricing. Milton Park: Routledge. Paczkowski, W.R. 2020. Deep Data Analytics for New Product Development. Milton Park: Routledge. Paczkowski, W.R. 2022. Business Analytics: Data Science for Business Problems. Berlin: Springer. Pagolu, M.K. and G. Chakraborty. 2011. Eliminating response style segments in survey data via double standardization before clustering. In SAS Global Forum 2011. Data Mining and Text Analytics; Paper 165-2011. Pedlow, S. 2008. Variance estimation. In Encyclopedia of Survey Research Methods, ed. Paul J. Lavrakas, 943–944. Beverley Hills: SAGE Publications, Inc. Peebles, D. and N. Ali. 2015. Expert interpretation of bar and line graphs: The role of graphicacy in reducing the effect of graph format. Frontiers in Psychology 6: Article 1673. Pesaran, M.H., R.G. Pierse, and M.S. Kumar. 1989. Econometric analysis of aggregation in the context of linear prediction models. Econometrica 57 (4): 861–888. Pinker, S. 2021. Rationality: What it is, Why it Seems Scarce, Why it Matters. New York: Viking Press. Potter, F. and Y. Zheng. 2015. Methods and issues in trimming extreme weights in sample surveys. In Proceedings of the Survey Research Methods Section, 2707–2719. New York: American Statistical Association. Ramsey, C.A., and A.D. Hewitt. 2005. A methodology for assessing sample representativeness. Environmental Forensics 6: 71–75. Ray, J.-C., and D. Ray. 2008. Multilevel modeling for marketing: a primer. Recherche et Applications en Marketing 23 (1): 55–77. Rea, L.M., and R.A. Parker. 2005. Designing and Conducting Survey Research: A Comprehensive Guide, 3rd ed. New York: Wiley. Robbins, N.B. 2010. Trellis display. WIREs Computational Statistics 2: 600–605. Robinson, W. 1950. Ecological correlations and the behavior of individuals. American Sociological Review 15 (3): 351–357. Rosen, B.L., and A.L. DeMaria. 2012. Statistical significance vs. practical significance: An exploration through health education. American Journal of Health Education 43 (4): 235–241.
References
341
Roux, A.V.D. 2002. A glossary for multilevel analysis. Journal of Epidemiology and Community Health 56: 588–594. Safir, A. 2008. Check all that apply. In Encyclopedia of Survey Research Methods, ed. Paul J. Lavrakas, 95. New York: SAGE Publications, Inc. Santos-d’Amorim, K., and M. Miranda. 2021. Misinformation, disinformation, and malinformation: clarifying the definitions and examples in disinfodemic times. Encontros Bibli Revista Eletronica de Biblioteconomia e Ciencia da Informacao 26: 1–23. Sarkar, D. 2016. Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from Your Data. New York: APress. SAS. 2018. SAS/STAT 15.1 User’s Guide. In Chapter 7: Introduction to Bayesian Analysis Procedures, 129–166. North Carolina: SAS Institute Inc. Scheaffer, R.L. 1990. Introduction to Probability and Its Applications. In Advanced Series in Statistics and Decision Sciences. California: Duxbury Press. Sedgewick, R., K. Wayne, and R. Dondero. 2016. Inroduction to Python Programming: An Interdisciplinary Approach. London: Pearson. Shonkwiler, R.W. and F. Mendivil. 2009. Explorations in Monte Carlo Methods. In Undergraduate Texts in Mathematics. Berlin: Springer. Snijders, T.A. and R.J. Bosker. 2012. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, 2nd ed. Beverley Hills: Sage. Sourial, N., C. Wolfson, B. Zhu, J. Quail, J. Fletcher, S. Karunananthan, K. Bandeen-Roche, F. Beland, and H. Bergman. 2010. Correspondence analysis is a useful tool to uncover the relationships among categorical variables. Journal of Clinical Epidemiology 63 (6): 638–646. Tarrance, V.L. 2018. A new regional paradigm for following u.s. elections. Newsletter. Gallup: Polling Matters. Theil, H. 1971. Principles of Econometrics. New York: Wiley. Thompson, S.K. 1992. Sampling. New York: Wiley. Todd, M.J., K.M. Kelley, and H. Hopfer. 2021. Usa mid-atlantic consumer preferences for front labelattributes for local wine. Beverages 7 (22): 1–16. Tufte, E.R. 1983. The Visual Display of Quantitative Information. Cheshire: Graphics Press. VanderPlas, J. 2017. Python Data Science Handbook: Essential Tools for Working with Data. Newton: O’Reilly Media. Voss, D.S., A. Gelman, and G. King. 1995. Preelection survey methodology: Details from eight polling organizations, 1988 and 1992. The Public Opinion Quarterly 59 (1): 98–132. Vyncke, P. 2002. Lifestyle segmentationfrom attitudes, interests and opinions, to values, aesthetic styles, life visions and media preferences. European Journal of Communication 17 (4): 445– 463. Wakefield, J. 2009. Multi-level modelling, the ecologic fallacy,and hybrid study designs. International Journal of Epidemiology 38 (2): 330–336. Weiss, N.A. 2005. Introductory Statistics, 7th ed. Boston: Pearson Education, Inc. White, I.K., and C.N. Laird. 2020. Steadfast Democrats: How Social Forces Shape Black Political Behavior. Princeton Studies in Political Behavior. Princeton: Princeton University Press. Williams, R.L. 2008. Taylor series linearization (tsl). In Encyclopedia of Survey Research Methods, ed. Paul J. Lavrakas, 877–877. Beverley Hills: SAGE Publications, Inc. Wu, H., and S.-O. Leung. 2017. Can likert scales be treated as interval scales? a simulation study. Journal of Social Service Research 43 (4): 527–532. Yamane, T. 1967. Elementary Sampling Theory. Englewood Cliffs: Prentice-Hall, Inc. Ziliak, S.T., and D.N. McCloskey. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society), 1st ed. Ann Arbor: University of Michigan Press.
Index
A Aggregation, 92–93, 308–310 AIOs, see Attitudes, interests, and opinions (AIOs) Alternative Hypothesis, 116, 117, 119, 130, 131, 134, 144, 151, 184, 246, 272, 292 Among-group sum of squares, 137 Anaconda, xi, 32, 33, 37 Analysis of variance (ANOVA), 132–139, 142, 147, 148, 204–206, 279, 309, 319, 321 Analysis plan, 14, 15, 139 ANOVA, see Analysis of variance (ANOVA) ANOVA table, 132–134, 138, 139, 147, 204–206, 279, 319, 321 Atomistic Fallacy, 310, 311 Attitudes, interests, and opinions (AIOs), v, 6, 13, 27, 29, 30, Attribute importances, 25, 211, 215–217 Autocorrelation, 284–285, 333 Awareness, vi, 5, 6 B Balanced repeated replication (BRR), 240 Balancing, 73 Bar chart, 4, 14, 15, 69, 70, 83, 86, 95, 97, 99–104, 160–163, 174, 198, 199, 219, 254, 312 Bayesian, xi, 251–335 Bayesian statistical approach, 253–259 Bayes’ Rule, ix, 259–265, 269, 279, 323 Bayes, T., 259 Bernoulli distribution, 294, 301–302 Best linear unbiased estimators (BLUE), 181 Best Practice, 36–43, 45, 70, 93, 95, 96, 99, 101
Beta distribution, 263, 300–301 Big Data, 3, 54, 310 Bimodal, 105, 286 Bimodality, 286, 287 Binary questions, 6 Binomial distribution, 292, 301 Biplot, 225 BLUE, see Best linear unbiased estimators (BLUE) Bootstrapping, 240 Boxplot, 98, 102, 105, 107, 159, 160, 232, 233 BRR, see Balanced repeated replication (BRR) Business Data Analytics, 31, 136, 186
C Calibration weights, 75 CATA, see Check all that Apply (CATA) Categorical tree, 220 Categorical variables, 55, 56, 66, 67, 86, 90, 134–136, 159, 161, 197, 214, 244–247 Character strings, 19, 44, 46, 182, 278, 290, 325 Check all that Apply (CATA), 52–54, 61, 62, 147–151, 153–157 Chi-square test, 142, 144, 149, 151, 152 Classical assumptions, 180–181, 274, 325 Cochrane’s Q test, 149–151, 154 Code cells, 32, 37, 38, 40, 45, 283 Coefficient of Variation (CV), 10, 11, 242 Color map, 110 Color palette, 110, 161 Comma Separated Value (CSV), 41, 43–46, 49 Complex sample surveys, ix, 10, 237–240, 244
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 W. R. Paczkowski, Modern Survey Analysis, https://doi.org/10.1007/978-3-030-76267-4
343
344 Conda, 33, 37, 47, 147, 163, 224, 225, 269, 270 Condition number, 208 Confidence interval, 10, 142, 184, 242, 255, 287–289 Confidence level, 10, 254 Conjecture, 115–117, 122, 135, 139, 144, 232 Conjoint analysis, 65, 210–217 Contingency table, 73–74, 253, 256, 257, 291–295 Core Questions, 13–16, 18, 25–31, 70, 83, 84, 94, 113, 118, 120, 154, 162, 177–179, 181, 185, 187, 193, 200, 209–211, 220, 225, 252, 256, 291 Correspondence Analysis, 209, 224–228 Cramer’s phi statistic, 147, 223 Cross-sectional data, 187 Crosstab, see Tabulations Cross-tabulation, see Tabulations CSV, see Comma Separated Value (CSV) Cult of significance, 115 Cumulative distribution, 190, 331 Cumulative distribution function, 190, 331 CV, see Coefficient of Variation (CV)
D Data dictionary, 36, 42, 43, 46, 49 Data structure hierarchical, 41, 252, 253, 273, 308, 311–313, 318, 322, 323 multilevel, 252, 311–318 nested, 308, 310, 330 Data visualization, 4, 32, 37, 43, 56, 94–110, 158, 165, 232, 305, 312–318 Decision tree analysis, 220 Deep Data Analysis, ix, 4, 14, 322 Degrees-of-freedom, 122–125, 127, 134, 137, 138, 144, 152, 170, 173, 183, 184, 205, 318 Demographic questions, 6, 13, 15, 193 Design matrix, 210–214 Design weights, 75, 79 Disaggregation, 308–310 Distributions, see Bernoulli distribution; Beta distribution; Binomial distribution; Cumulative distribution; Half-normal distribution; Normal distribution; Posterior distribution; Prior distribution; Student’s t-distribution Document-term matrix (DTM), 234 DTM, see Document-term matrix (DTM) Dummify, 67, 189
Index Dummy values, 19 Dummy variables, 19, 20, 67, 197, 211, 214, 253, 307, 313, 315–316, 318–323, 335 Dummy variable trap, 197, 211, 313, 315, 316
E Ecological Fallacy, 310, 311 Econometric theory, 214 Effects coding, 136, 139, 211, 214, 215 Elasticities, 7, 25, 181, 185, 207, 210, 214, 215, 261, 270, 272, 304, 307, 310, 319, 321, 322 Equal allocation, 12 Error sum of squares (SSE), 180, 196, 205 Ethnographic study, 271 Event space, 23, 24, 254, 255 Expected value, 8, 114, 138, 166–167, 178, 191, 192, 204, 205, 253, 278, 310
F Fact finding, 5 Familiarity, vi, 5, 6, 28 Five Number Summary, 68, 87 Fortran, 32 Frequentist, 253–259, 274, 275, 284, 286, 288, 291, 293, 296
G Gauss-Markov Theorem, 181 Generalized linear model (GLM), 178 Geographic map, 162–165 GLM, see Generalized linear model (GLM)
H Half-normal distribution, 276, 300–302, 325 HDI, see Highest posterior density interval (HDI) Header record, 44 Heat map, 102, 109–111, 142, 143 Highest posterior density interval (HDI), 288–290, 328 Histogram, 79, 98, 102, 105–110, 232, 233, 272, 273, 275, 283 Homoskedasticity, 178 Honestly Significant Difference (HSD), 141–143, 151 Horvitz–Thompson estimator, 24 HSD, see Honestly Significant Difference (HSD)
Index
345
Hyperparameters, 253, 264–265, 275, 276, 324, 325, 334 Hypothesis testing, 90, 114–120, 134, 238, 240, 241, 246–249
Logit link, 178, 187–199 Log-Likelihood, 197, 208 Log-odds, 192, 198, 331 Longitudinal data, 332
I Identity function, 178, 203 Indicator variable, 20–21, 53, 54, 58, 62, 67, 239 Inertias, 225 Infographics, 14, 160, 232, 234, 261 Information, vii, ix, 2–4, 8, 10–12, 14, 25, 30, 31, 40, 41, 68, 72, 75, 83, 94, 95, 98, 100, 113, 115, 117, 153, 159, 161, 162, 165, 183, 186, 208, 225, 234, 253, 258–265, 278, 296, 305, 309, 323, 325 Informative prior, 262, 264 Instantiation, 124, 182, 240 Intentions, vii, 3, 5, 7, 98, 256, 291, 293–298, 330, 331 ipf, see Iterative proportional fitting (ipf) Iterative proportional fitting (ipf), 73
M MAP, see Maximum a posteriori (MAP) estimate Marascuillo Procedure, 151, 152, 155–157 Marginal homogeneity, 146, 222, 223 Margin of error, 8–12, 16, 26, 240 Markdown cells, 32 Markov Chain Monte Carlo (MCMC), 265–269, 280, 284, 287, 289, 294 Markov Chains, 265–267, 269, 285 Markov Process, 266, 267 Matplotlib, 37, 96–97, 224 Maximum a posteriori (MAP) estimate, 280–282 MCMC, see Markov Chain Monte Carlo (MCMC) McNemar Test, 146, 147, 149, 222, 224 Memoryless, 266, 267 Memoryless Markov Process, 267 Metacharacters, 230, 231 Metadata, 42, 43, 46–48 Missing data, 14, 51, 131, 193, 229 Missing value code, 28, 44, 49, 51 Missing values, 28, 44, 47–52, 74, 131, 144, 145, 147–150, 193–195, 222, 223, 229, 230, 246, 247 Monte Carlo, 265–269 Monte Carlo simulations, 265–270 Moore’s Law, 3 Mosaic chart, 102, 105–109 MulitiIndex, 69 Multicollinearity, 136, 208 Multilevel modeling, 253, 304–312, 318, 323–328 Multilevel regression model, 265, 321–323 Multiple responses, 49, 53–54, 195, 220, 221
J Jackknife, 240 Jackknife repeated replication, 240 Jupyter, viii, 32–34, 36, 37, 40 Jupyter notebook, viii, ix, 32, 36, 39–41, 93, 95, 96, 99
K KDA, see Key driver analysis (KDA) Key driver analysis (KDA), 177, 217 Key performance measures (KPI), 271 KPI, see Key performance measures (KPI)
L Level of confidence, 8 Likelihood, 56, 66, 98, 191, 192, 196, 198, 208, 212, 217–219, 223, 256, 260–264, 274, 278, 280, 292, 294, 326 Likert Scale questions, 6, 63, 94, 221, 228 Linear regression, 14, 132, 177, 178 Link functions, 178–179, 187, 192, 203–204 List comprehension, 20, 32, 58–60, 175, 194, 195, 217, 219, 224, 242, 290 Logistic regression, ix, 14, 177, 187–200, 202, 220, 265, 304, 328, 330–331 Logit, 65, 178, 179, 187–200, 220, 252, 253, 295–231, 299
N Nested structure, 252, 308 Net promoter score (NPS) analysis, 147, 188, 209, 217–224 Nominal variables, 47, 56, 146, 149 Nonresponse weights, 75, 79 Nonstructural missing values, 50, 51 Normal distribution, 11, 116, 122, 123, 170, 180, 208, 254, 255, 261, 264, 278, 280, 300–301, 313, 325
346 NPS, see Net promoter score (NPS) analysis Null hypothesis, 116–119, 122, 124, 125, 127, 131, 134, 137–142, 145, 148, 151, 184, 185, 222, 223, 240, 245, 264, 272, 274, 289–291, 293 O Odds, 65, 66, 191, 192, 198, 199, 220, 295, 296, 304, 331 Odds ratio, 64, 66, 198, 199, 207, 295–297, 299 OLS, see Ordinary least squares (OLS) Omitted variable bias, 305–307 One-hot values, 19 One-hot variable, 19 One-stage cluster sampling design, 308 Ordinary least squares (OLS), ix, 178, 182–186, 211, 273–274 Outliers, 105 OVB, see Omitted variable bias (OVB) P Package installer for python (pip), 37, 47, 54, 75, 147, 163, 211, 224, 225, 239, 269, 270 Pandas, viii, ix, 32, 37–39, 42–51, 55, 56, 62, 66–69, 85, 87, 89, 90, 92, 93, 95–99, 101, 105, 108, 111, 148, 157, 158, 162, 214, 217, 229, 230, 325 Pandas DataFrames, 37, 45, 47–49, 62, 66, 68, 69, 98, 162, 214 Pandas series, 69, 85 Panel graphs, 162 Part-worths, 211, 215, 217 Pattern identification, 5, 7 PCA, see Principal component analysis (PCA) Pie chart, ix, 15, 84, 95, 97–102, 149, 150, 160, 189 Pivot table, 153, 154, 157–159, 162, 175, 224 Poisson link, 178, 200–203 Poisson process, 332 Poisson regression, ix, 177, 178, 200–208, 328 Political surveys, 2 Pooled regression, 273, 283, 284, 313, 314, 316–319, 321–323 Pooled sample standard deviation, 127 Population, 8, 66, 111, 114, 189, 237, 254, 323 Population size, 8, 10–12, 18, 111, 135, 237 Posterior, see Distributions; Highest posterior density interval (HDI); Posterior probability Posterior distribution, 264, 269, 280, 282–284, 286, 288–296, 299, 328
Index Posterior probability, 260, 261, 264 Post-stratification weights, 75, 79 Power, 3, 8, 30, 184, 192, 265, 306, 309 Preprocessing, 58, 64–67 Price elasticity, 7, 25, 181, 185, 210, 211, 214, 215, 261, 270, 272, 304, 307, 310, 319, 321, 322 Primary sampling units (PSU), 238, 252, 253, 304, 307, 308, 311, 330 Principal component analysis (PCA), 225 Prior, see Informative prior; Prior probability; Uninformative prior Prior distribution, 262–264, 274–276, 304 Prior probability, 260–262, 264 Private survey, 2 Probability transition matrix, 266 Proportional Allocation, 12 Pseudo-R 2 , 193, 197 PSUs, see Primary sampling units (PSUs) Public Opinion Study San Francisco Airport Customer Satisfaction Survey, 30 Toronto Casino Opinion Survey, 28–30 Public Sector Study VA Benefits Survey, 27–28 p-value, 116, 119, 124, 125, 128, 130–132, 138, 139, 141–143, 147, 148, 151, 152, 183–185, 197, 240, 245, 246, 272, 293 Pyreadstat, 37, 47, 48 Q Query, 80–82, 162, 164
R R (language), 31, 32 Raking, 72–77 Random sample cluster, 8, 9, 237 simple, 8–10, 26, 123, 128, 240, 304 stratified, 8–10, 237, 238, 242, 270, 271 Random walks, 267–269, 285 Reductio ad absurdum, 117 Regression sum of squares (SSR), 183, 184, 205 Regression tree, 220 Regular expression, 230, 231 Repeated measures data, 332 Replicate methods, 240 Representative, 6, 16–22, 70, 72, 79 Representativeness, 16–22, 72 Reproductive property of normals, 136, 170 Russell, B., 259
Index S Sample size, 7–17, 19, 23, 26, 27, 29, 51, 69, 71, 73, 85, 95, 99, 122, 127, 131, 134, 135, 146, 147, 214, 229, 239–242, 254, 283, 310 calculations, 8, 26, 238, 240–241 optimal, 8 Sample space, 23, 254–256, 258 Sampling units, see Primary sampling units (PSUs); Secondary sampling units (SSUs) SAS, 32 Scenario analysis, 187 Scientific data visualization, 4, 32, 232 Screener, 12–14, 16 Seaborn, 15, 32, 37, 95–97, 106, 110, 162 Secondary sampling units (SSUs), 238 Shallow Data Analysis, ix, 4, 14, 94, 254 Sidetable, 37, 131 Simple random sampling (SRS), 9, 237, 238, 240, 244 Singular value decomposition (SVD), 225 SPSS, 37, 43, 44, 46–48 SRS, see Simple random sampling (SRS) SST, see Total sum of squares (SST) SSUs, see Secondary sampling units (SSUs) Stackoverflow, 32 Standardization, see Within-item standardization; Within-subject standardization Standardizing variables, 58 Stata, 32 Statistically significant differences, 114 StatsModels, 32, 37, 111, 124, 128, 136, 144, 146, 182, 196, 203, 208 Stop words, 234 Structural missings, 50, 51 Student’s t-distribution, 116, 123, 170–171 Sum of squares, see Among-group; Error sum of squares (SSE); Regression sum of squares (SSR); Total sum of squares (SST); Within-group Surround Questions, 13, 15, 16, 25–31, 50, 84, 100, 154, 162, 192, 211, 214, 220, 225, 252, 256, 272, 291, 297, 304, Survey design, 4, 238, 304, 308 SVD, see Singular value decomposition (SVD) T Tabs, see Tabulations Tabulations, 80, 83, 89–94, 102, 114, 153–158, 238, 244–247
347 Taylor series linearization (TSL), 240 t-distribution, see Student’s t-distribution Term frequency-inverse document frequencies (tfidf), 234 Text analysis, 209, 228–235 Text strings, 44 tfidf, see Term frequency-inverse document frequencies (tfidf) Time series-cross sectional data, 332 Time series forecasting, 187 Top box, 9 Top box of satisfaction, 9 Total preference, 211 Total sum of squares (SST), 132, 133, 183, 205 Total utility, 211, 215 Transition matrix, 266, 267 Trellis graphs, 162 Trend analysis, 5, 6 TSL, see Taylor series linearization (TSL) Tukey’s HSD test, 141–143 Two-stage cluster sampling design, 308 Type I Error, 119
U Ulam, S., 265 Unimodal, 105, 286 Uninformative prior, 262, 264 Unpooled regression analysis, 313, 321–323
V von Neumann, J., 265
W Weight calculation, 70–79 Weighted statistics, 112 Weights, see Design weights; Weight calculation; Weighted statistics What-if analysis, 187 What-if values, 187 Within-group, 64 Within-group sum of squares, 137 Within-item standardization, 64 Within-subject standardization, 64 Word cloud, 232, 234
Z Z-statistic, 122, 144, 145