242 15 35MB
English Pages 617 [618] Year 2023
Klaus Backhaus Bernd Erichson Sonja Gensler Rolf Weiber Thomas Weiber
Multivariate Analysis An Application-Oriented Introduction Second Edition
Multivariate Analysis
ii
Your bonus with the purchase of this book With the purchase of this book, you can use our “SN Flashcards” app to access questions free of charge in order to test your learning and check your understanding of the contents of the book. To use the app, please follow the instructions below: 1. Go to https://flashcards.springernature.com/login 2. Create an user account by entering your e-mail adress and assiging a password. 3. Use the link provided in one of the first chapters to access your SN Flashcards set.
Your personal SN Flashcards link is provided in one of the first chapters.
If the link is missing or does not work, please send an e-mail with the subject “SN Flashcards” and the book title to [email protected].
Klaus Backhaus · Bernd Erichson · Sonja Gensler · Rolf Weiber · Thomas Weiber
Multivariate Analysis An Application-Oriented Introduction Second Edition
Klaus Backhaus University of Münster Münster, Nordrhein-Westfalen, Germany
Bernd Erichson Otto-von-Guericke-University Magdeburg Magdeburg, Sachsen-Anhalt, Germany
Sonja Gensler University of Münster Münster, Nordrhein-Westfalen, Germany
Rolf Weiber University of Trier Trier, Rheinland-Pfalz, Germany
Thomas Weiber Munich, Bayern, Germany
ISBN 978-3-658-40410-9 ISBN 978-3-658-40411-6 (eBook) https://doi.org/10.1007/978-3-658-40411-6 English Translation of the 17th original German edition published by Springer Fachmedien Wiesbaden, Wiesbaden, 2023 © Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021, 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer Gabler imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH, part of Springer Nature. The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Preface 2nd Edition
The new edition of our book “Multivariate Analysis—An Application-Oriented Introduction” has been very well received on the market. This success has motivated us to quickly work on the 2nd edition in English and the 17th edition in German. As far as the content of the book is concerned, the German and English versions are identical. The 2nd English edition differs from the 1st edition in the following aspects: • The latest version of SPSS (version 29) was used to create the figures. • Errors in the first edition, which occurred despite extremely careful editing, have been corrected. We are confident that we have now fixed (almost) all major mistakes. A big thank you goes to retired math and physics teacher Rainer Obst, Supervisor of the Freiherr vom Stein Graduate School in Gladenbach, Germany, who read both the English and German versions very meticulously and uncovered quite a few inconsistencies and errors. • We made one major change in the chapter about the cluster analysis. The application example was adapted to improve the plausibility of the results. Also for the 2nd edition, research assistants have supported us energetically. Special thanks go to the research assistants at the University of Trier, Mi Nguyen, Lorenz Gabriel and Julian Morgen. They updated the literature, helped to correct errors and adjusted all SPSS figures. They were supported by the student Sonja Güllich, who was instrumental in correcting figures and creating SPSS screenshots. The coordination of the work among the authors as well as with the publisher was again taken over by the research assistant Julian Morgen, who again tirelessly and patiently accepted change requests and again implemented them quickly. Last but not least, we would like to thank Barbara Roscher and Birgit Borstelmann from Springer Verlag for their competent support. SpringerGabler also provides a set of electronic learning cards (so-called “flashcards”) to help readers test their knowledge. Readers may individualize their own learning environment via an app and add their own questions and answers. Access to the flashcards is provided via a code printed at the end of the first chapter of the book. v
vi
Preface 2nd Edition
We are pleased to present this second edition, which is based on the current version of IBM SPSS Statistics 29 and has been thoroughly revised. Nevertheless, any remaining mistakes are of course the responsibility of the authors. Münster Magdeburg Münster Trier Munich November 2022
Klaus Backhaus Bernd Erichson Sonja Gensler Rolf Weiber Thomas Weiber
Note on Excel Operation The book Multivariate Analysis provides Excel formulas with semicolon separators. If the regional language setting of your operating system is set to a country that uses periods as decimal separators, we ask you to use comma separators instead of semicolon separators in Excel formulas.
Preface
This is the first English edition of a German textbook that covers methods of multivariate analysis. It addresses readers who are looking for a reliable and application-oriented source of knowledge in order to apply the discussed methods in a competent manner. However, this English edition is not just a simple translation of the German version; rather, we used our accumulated experience gained through publishing 15 editions of the German textbook to prepare a completely new version that translates academic statistical knowledge into an easy-to-read introduction to the most relevant methods of multivariate analysis and is targeted at readers with comparatively little knowledge of mathematics and statistics. The new version of the German textbook with exactly the same content is now available as the 16th edition. For all methods of multivariate analysis covered in the book, we provide case studies which are solved with IBM’s software package SPSS (version 27). We follow a step-bystep approach to illustrate each method in detail. All examples and case studies use the chocolate market as an example because we assume that every reader will have some affinity to this market and a basic idea of the factors involved in it. This book constitutes the centerpiece of a comprehensive service offering that is currently being realized. On the website www.multivariate-methods.info, which accompanies this book and offers supplementary materials, we provide a wide range of support services for our readers: • For each method discussed in this book, we created Microsoft Excel files that allow the reader to conduct the analyses with the help of Excel. Additionally, we explain how to use Excel for many of the equations mentioned in this book. By using Excel, the reader may gain an improved understanding of the different methods. • The book’s various data sets, SPSS jobs, and figures can also be requested via the website. • While in the book we use SPSS (version 27) for all case studies, R code (www.r-project.org) is also provided on the website. • In addition to the SPSS syntax provided in the book, we explain how to handle SPSS in general and how to perform the analyses. vii
viii
Preface
• In order to improve the learning experience, videos with step-by-step explanations of selected problems will be successively published on the website. • SpringerGabler also provides a set of electronic learning cards (so-called “flashcards”) to help readers test their knowledge. Readers may individualize their own learning environment via an app and add their own questions and answers. Access to the flashcards is provided via a code printed in the book. We hope that these initiatives will improve the learning experience for our readers. Apart from the offered materials, we will also use the website to inform about updates and, if necessary, to point out necessary corrections. The preparation of this book would not have been possible without the support of our staff and a large number of research assistants. On the staff side, we would like to thank above all Mi Nguyen (MSc. BA), Lorenz Gabriel (MSc. BA), and Julian Morgen (M. Eng.) of the University of Trier, who supported us with great meticulousness. For creating, editing and proofreading the figures, tables and screenshots, we would like to thank Nele Jacobs (BSc. BA). Daniela Platz (BA), student at the University of Trier, provided helpful hints for various chapters, thus contributing to the comprehensibility of the text. We would also like to say thank you to Britta Weiguny, Phil Werner, and Kaja Banach of the University of Münster. They provided us with help whenever needed. Heartfelt thanks to Theresa Wild and Frederike Biskupski, both students at the University of Münster, who provided feedback to improve the readability of this book. Special thanks go to Julian Morgen (M. Eng.) who was responsible for the entire process of coordination between the authors and SpringerGabler. Not only did he tirelessly and patiently implement the requested changes, he also provided assistance with questions concerning the structure and layout of the chapters. Finally, we would like to thank Renate Schilling for proofreading the English text and making extensive suggestions for improvements and adaptations. Our thanks also go to Barbara Roscher and Birgit Borstelmann of SpringerGabler who supported us continuously with great commitment. Of course, the authors are responsible for all errors that may still exist. Münster Magdeburg Münster Trier München April 2021
Klaus Backhaus Bernd Erichson Sonja Gensler Rolf Weiber Thomas Weiber
www.multivariate-methods.info
On our website www.multivariate-methods.info, we provide additional and supplementary material, publish updates and offer a platform for the exchange among the readers. The website offers the following core services: Methods We provide supplementary and additional material (e.g., examples in Excel) for each method discussed in the book. FAQ On this page, we post frequently asked questions and the answers. Forum The forum offers the opportunity to interact with the authors and other readers of the book. We invite you to make suggestions and ask questions. We will make sure that you get an answer or reaction. Service Here you may order all tables and figures published in the book as well as the SPSS data and syntax files. Lecturers may use the material in their classes if the source of the material is appropriately acknowledged. Corrections On this page, we inform the readers about any mistakes detected after the publication of the book. We invite all readers to report any mistakes they may find on the Feedback page. Feedback Here the authors invite the readers to share their comments on the book and to report any ambiguities or errors by sending a message directly to the authors.
ix
x
www.multivariate-methods.info 3URIHVVXUI¾U0DUNHWLQJXQG,QQRYDWLRQ 8QLY3URI'U5ROI:HLEHU 8QLYHUVLW¦WVULQJ '(7 7ULHU *HUPDQ\ &RQWDFW[email protected]ದSKRQH 6HQGHU
____________________________ ____________________________ ____________________________ ____________________________ (0DLOBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB3KRQHBBBBBBBBBBBBBBBBBB 6XEMHFW0XOWLYDULDWH$QDO\VLV +HUHZLWK,RUGHU DOOGDWDVHWVDQG6366V\QWD[ILOHVIRUDOOPHWKRGVFRYHUHGLQWKLVERRNDWDSULFHRI(85 FRPSOHWHVHWRIILJXUHVIRUDOOPHWKRGVFRYHUHGLQWKLVERRNDWDSULFHRI (85 WKHVHWRIILJXUHVDVUHDGRQO\3RZHU3RLQWILOHVDWDSULFHRI(85HDFKIRUWKHIROORZLQJFKDSWHUV
1 Introduction to statistical data analysis 2 Regression analysis 3 Analysis of variance 4 Discriminant analysis 5 Logistic regression
6 Contingency analysis 7 Factor analysis 8 Cluster analysis 9 Conjoint analysis
7KHGRFXPHQWVZLOOEHVHQWE\HPDLO,IGHVLUHGRWKHUPHDQVRIGHOLYHU\HJPHPRU\VWLFN DUHDOVR SRVVLEOH3OHDVHFRQWDFWWKHDXWKRUVDWWKHDERYHDGGUHVVIRUIXUWKHULQIRUPDWLRQ
_____________________ 'DWH
________________________ 6LJQDWXUH 0). A negative correlation (r 0). • The correlation coefficient does not distinguish between dependent and independent variables. It is therefore a symmetrical measure.
9 Cf.
the correlation of binary variables with metrically scaled variables in Sect. 1.1.2.2.
1.2 Basic Statistical Concepts
25
Y
a) uncorrelated data (
Y
X ≈ 0)
Y
c) negative correlation (
b) positive correlation (
X > 0)
Y
X < 0)
d) nonlinear correlation (
X ≈ 0)
Fig. 1.3 Scatter plots of data sets with different correlations
Values of –1 or +1 for the correlation coefficient indicate a perfect correlation between the two variables. In this case, all data points in a scatterplot are on a straight line. The following values are often cited in the literature for assessing the magnitude of the correlation coefficient: • | r | ≥ 0.7 : strong correlation • | r | ≤ 0.3 : weak correlation However, the correlation coefficient must also be evaluated in the context of the application (e.g., individual or aggregated data). For example, in social sciences, where variables are often influenced by human behavior and many other factors, a lower value may be regarded as a strong correlation than in natural sciences, where much higher values are generally expected. Another way to assess the relevance of a correlation coefficient is to perform a statistical significance test that takes the sample size into account. The t-statistic or the F-statistic may be used for this purpose:10
10 For
statistical testing, also see Sect. 1.3.
1 Introduction to Empirical Data Analysis
26
r t=√ (1 − r 2 )/(N − 2) F=
r2 (1 − r 2 )/(N − 2)
with r correlation coefficient N number of cases in the data set and df = N–2. We can now derive the corresponding p-value (cf. Sect. 1.3.1.2).11
1.3 Statistical Testing and Interval Estimation Data usually contain sampling and/or measurement errors. We distinguish between random errors and systematic errors: • Random errors change unpredictably between measurements. They scatter around a true value and often follow a normal distribution (central limit theorem).12 • Systematic errors are constant over repeated measurements. They consistently overor underestimate a true value (also called bias). Systematic errors result from deficiencies in measuring or non-representative sampling. Random errors, e.g. in sampling, are not avoidable, but their amount can be calculated based on the data, and they can be diminished by increasing the sample size. Systematic errors cannot be calculated and they cannot be diminished by increasing the sample size, but they are avoidable. For this purpose, they first have to be identified. As statistical results always contain random errors, it is often not clear if an observed result is ‘real’ or has just occurred randomly. To check this, we can use statistical testing (hypothesis testing). The test results may be of great importance for decision-making. Statistical tests come in many forms, but the basic principle is always the same. We start with a simple example, the test for the mean value.
11 The
p-value may be calculated in Excel as follows: p = TDIST(ABS(t);N−2;2) or p = 1–F. DIST(F;1;N–2;1). 12 The central limit theorem states that the sum or mean of n independent random variables tends toward a normal distribution if n is sufficiently large, even if the original variables themselves are not normally distributed. This is the reason why a normal distribution can be assumed for many phenomena.
1.3 Statistical Testing and Interval Estimation
27
1.3.1 Conducting a Test for the Mean The considerations in this section are illustrated by the following example: Example
The chocolate company Choco Chain measures the satisfaction of its customers once per year. Randomly selected customers are asked to rate their satisfaction on a 10-point scale, from 1 = “not at all satisfied” to 10 = “completely satisfied”. Over the last years, the average index was 7.50. This year’s survey yielded a mean value of 7.30 and the standard deviation was 1.05. The sample size was N = 100. Now the following question arises: Did the difference of 0.2 only occur because of random fluctuation or does it indicate a real change in customer satisfaction? To answer this question, we conduct a statistical test for the mean. ◄
1.3.1.1 Critical Value Approach The classical procedure of statistical hypothesis testing may be divided into five steps: 1. formulation of hypotheses, 2. computation of a test statistic, 3. choosing an error probability α (significance level), 4. deriving a critical test value, 5. comparing the test statistic with the critical test value. Step 1: Formulation of Hypotheses The first step of statistical testing involves the stating of two competing hypotheses, a null hypothesis H0 and an alternative hypothesis H1: • null hypothesis: H0 : µ = µ0 • alternative hypothesis: H1 : µ � = µ0 where µ0 is an assumed mean value (the status quo) and µ is the unknown true mean value. For our example with µ0 = 7.50 we get: • null hypothesis: H0 : µ = 7.50 • alternative hypothesis: H1 : µ � = 7.50 The null hypothesis expresses a certain assumption or expectation of the researcher. In our example, it states: “satisfaction has not changed, the index is still 7.50”. It is also called the status quo hypothesis and can be interpreted as “nothing has changed” or, depending on the problem, as “no effect”. Hence the name “null hypothesis”.
1 Introduction to Empirical Data Analysis
28
The alternative hypothesis states the opposite. In our example, it means: “satisfaction has changed”, i.e., it has increased or decreased. Usually it is this hypothesis that is of primary interest to the researcher, because its acceptance often requires some action. It is also called the research hypothesis. The alternative hypothesis is accepted or “proven” by rejecting the null hypothesis. Step 2: Computation of a Test Statistic For testing our hypotheses regarding customer satisfaction, we calculate a test statistic. For testing a mean, we calculate the so-called t-statistic. The t-statistic divides the difference between the observed and the hypothetical mean by the standard error SE of the mean. The empirical value of the t-statistic is calculated as follows:
temp =
x − µ0 x − µ0 √ = SE(x) sx / N
(1.9)
with
x mean of variable x µ0 assumed value of the mean sx standard deviation of variable x SE(x) standard error of the mean N number of cases in the data set For our example we get:
temp =
−0.2 7.3 − 7.5 √ = −1.90 = 0.105 1.05/ 100
Under the assumption of the null hypothesis, the t-statistic follows Student’s t-distribution with N–1 degrees of freedom. Figure 1.4 shows the density function of the t-distribution and the value of our test statistic. If the null hypothesis were true, we would expect temp = 0 or close to 0. Yet we get a value of –1.9 (i.e., 1.9 standard deviations away from zero), and the probability of such a test result decreases fast with the distance from zero. The t-distribution is symmetrically bell-shaped around zero and looks very similar to the standard normal distribution, but has broader tails for small sample sizes. With increasing sample size, the tails of the distribution get slimmer and the t-distribution approaches the standard normal distribution. For our sample size, the t-distribution and the standard normal distribution are almost identical. Step 3: Choosing an Error Probability As statistical results always contain random errors, the null hypothesis cannot be rejected with certainty. Thus, we have to specify an error probability of rejecting a true
1.3 Statistical Testing and Interval Estimation
29
f(t) 0.4
0.3
0.2
0.1
-3.0
-2.0
temp
-1.0
0.0
0.0
H0
1.0
2.0
3.0
Fig. 1.4 t-distribution and empirical t-value (df = 99)
null hypothesis. This error probability is denoted by α and is also called the level of significance. Thus, the error probability α that we choose should be small, but not too small. If α is too small, the research hypothesis can never be ‘proven’. Common values for α are 5%, 1% or 0.1%, but other values are also possible. An error probability of α = 5% is most common, and we will use this error probability for our analysis. Step 4: Deriving a Critical Test Value With the specified error probability α, we can derive a critical test value that can serve as a threshold for judging the test result. Since in our example the alternative hypothesis is undirected (i.e., positive and negative deviations are possible), we have to apply a two-tailed t-test with two critical values: −tα/2 in the left (lower) tail and tα/2 in the right (upper) tail (see Fig. 1.5). Since the t-distribution is symmetrical, the two values are equal in size. The area beyond the critical values is α/2 on each side, and it is called the rejection region. The area between the critical values is called the acceptance region of the null hypothesis.
1 Introduction to Empirical Data Analysis
30
rejection region
acceptance region
rejection region
0.4
0.3
0.2
0.1
2
2
0.025 -3.0
-2.0 t
-1.0
0.0
0.0
1.0
2.0 t
2
0.025 3.0
2
Fig. 1.5 t-distribution and critical values for α = 5% (df = 99)
Table 1.12 Extract from the t-table
Error probability α df
0.10
0.05
0.01
1
6.314
12.706
63.657
2
2.920
4.303
9.925
3
2.353
3.182
5.841
4
2.132
2.776
4.604
5
2.015
2.571
4.032
10
1.812
2.228
3.169
20
1.725
2.086
2.845
30
1.697
2.042
2.750
40
1.684
2.021
2.704
50
1.676
2.009
2.678
99
1.660
1.984
2.626
∞
1.645
1.960
2.576
The critical value for a given α value and degrees of freedom (df = N–1) may be taken from a t-table or calculated by using a computer. Table 1.12 shows an extract from the t-table for different values of α and degrees of freedom.
1.3 Statistical Testing and Interval Estimation
31
For our example, we get:13
tα/2 = 1.984 ≈ 2 Step 5: Comparing the Test Statistic with the Critical Test Value If the test statistic exceeds the critical value, H0 can be rejected at a statistical significance of α = 5%. The rules for rejecting the H0 can be formulated as Reject H0 if | | |temp | > tα/2 (1.10) | | Do not reject H0 if |temp | ≤ tα/2. Here we cannot reject H0, since | | |temp | = 1.9 < 2 This means, the result of the test is not statistically significant at α = 5%.
Interpretation It is important to note that accepting the null hypothesis does not mean that its correctness has been proven. H0 usually cannot be proven, nor can we infer a probability for the correctness of H0. In a strict sense, H0 is usually “false” in a two-tailed test. If a continuous scale is used for measurement, the difference between the observed value and µ0 will practically never be exactly zero. The real question is not whether µ0 will be zero, but how large the difference is. In our example, it is very unlikely that the present satisfaction index is exactly 7.50, as stated by the null hypothesis. We have to ask whether the difference is sufficiently large to conclude that customer satisfaction has actually changed. The null hypothesis is just a statement that serves as a reference point for assessing a statistical result. Every conclusion we draw from the test must be conditional on H0. | | Thus, for a test result |temp | > 2, we can conclude: • Under the condition of H0, the probability that the test result has occurred solely by chance is less than 5%. Thus, we reject the of H0. | proposition | Or, as in our example, for the test result |temp | ≤ 2, we can conclude: • Under the condition of H0, the probability that this test result has occurred by chance is larger than 5%. We require a lower error probability. Thus, we do not have sufficient reason to reject H0.
13 In
Excel we can calculate the critical value tα/2 for a two-tailed t-test by using the function T.INV.2T(α,df). We get: T.INV.2T(0.05,99) = 1.98. The values in the last line of the t-table are identical with the standard normal distribution. With df = 99 the t-distribution comes very close to the normal distribution.
1 Introduction to Empirical Data Analysis
32
f(t) 0.4
0.3
0.2
p 2
-3.0
0.1
0.03
-2.0 -1.9
-1.0
0.0
p 2
0.0
H0
1.0
0.03
2.0
3.0
1.9
Fig. 1.6 p-value p = 6% (for a two-sided t-test with df = 99)
The aim of a hypothesis test is not to prove the null hypothesis. Proving the null hypothesis would not make sense. If this were the aim, we could prove any null hypothesis by making the error probability α sufficiently small. The hypothesis of interest is the research hypothesis. The aim is to “prove” (be able to accept) the research hypothesis by rejecting the null hypothesis. For this reason, the null hypothesis has to be chosen as the opposite of the research hypothesis.
1.3.1.2 Using the p-value The test procedure may be simplified by using a p-value approach instead of the critical value approach. The p-value (probability value) for our empirical t-statistic is the probability to observe a t-value more distant from the null hypothesis than our temp if H0 is true: (1.11)
p = P(|t| ≥ |temp |) 6%.14
This is illustrated in Fig. 1.6: p = P(|t| ≥ 1.9) = 0.03 + 0.03 = 0.06 or Since the t-statistic can assume negative or positive values, the absolute value has to be considered in for the two-sided t-test and we get probabilities in both tails.
14 In
Excel we can calculate the p-value by using the function T.DIST.2 T(ABS(temp);df). For the variable in our example we get: T.DIST.2 T(ABS(−1.90);99) = 0.0603 or 6.03%
1.3 Statistical Testing and Interval Estimation Table 1.13 Test results and errors
33
Test Result
Reality H0 is True
H0 is False
H0 is accepted
Correct decision 1–α
Type II error β
H0 is rejected
Type I error α Significance level
Correct decision 1–β Power
The p-value is also referred to as the empirical significance level. In SPSS, the p-value is called “significance” or “sig”. It tells us the exact significance level of a test statistic, while the classical test only gives us a “black and white” picture for a given α. A large p-value supports the null hypothesis, but a small p-value indicates that the probability of the test statistic is low if H0 is true. So probably H0is not true and we should reject it. We can also interpret the p-value as a measure of plausibility. If p is small, the plausibility of H0 is small and it should be rejected. And if p is large, the plausibility of H0 is large. By using the p-value, the test procedure is simplified considerably. It is not necessary to start the test by specifying an error probability (significance level) α. Furthermore, we do not need a critical value and thus no statistical table. (Before the development of computers, these tables were necessary because the computing effort for critical values as well as for p-values was prohibitive.) Nevertheless, some people like to have a benchmark for judging the p-value. If we use α as a benchmark for p, the following criterion will give the same result as the classical t-test according to the rule according Eq. (1.10):
If p < α, reject H0
(1.12)
Since in our example p = 6%, we cannot reject H0. But even if α is used as a benchmark for p, the problem of choosing the right error probability remains.
1.3.1.3 Type I and Type II Errors There are two kinds of errors in hypothesis testing. So far, we considered only the error of rejecting the null hypothesis although it is true. This error is called type I error and its probability is α (Table 1.13). A different error occurs if a false null hypothesis is accepted. This refers to the upper right quadrant in Table 1.13. This error is called type II error and its probability is denoted by β. The size of α (i.e., the significance level) is chosen by the researcher. The size of β depends on the true mean µ, which we do not know, and on α (Fig. 1.7). By decreasing α, the probability β of the type II error increases.
1 Introduction to Empirical Data Analysis
34
0
acceptance region
Fig. 1.7 Type II error β (dependent on α and μ)
The probability (1–β) is the probability that a false null hypothesis is rejected (see lower right quadrant in Table 1.13), what we want. This is called the power of a test and it is an important property of a test. By decreasing α, the power of the test also decreases. Thus, there is a tradeoff between α and β. As already mentioned, the error probability α should not be too small; otherwise the test is losing its power to reject H0 if it is false. Both α and β can only be reduced by increasing the sample size N. How to Choose α The value of α cannot be calculated or statistically justified, it must be determined by the researcher. For this, the researcher should take into account the consequences (risks and opportunities) of alternative decisions. If the costs of a type I error are high, α should be small. Alternatively, if the costs of a type II error are high, α should be larger, and thus β smaller. This increases the power of the test. In our example, a type I error would occur if the test falsely concluded that customer satisfaction has significantly changed although it has not. A type II error would occur if customer satisfaction had changed, but the test failed to show this (because α was set too low). In this case, the manager would not receive a warning if satisfaction had decreased, and he would miss out on taking corrective actions.
1.3.1.4 Conducting a One-tailed t-test for the Mean As the t-distribution has two tails, there are two forms of a t-test: a two-tailed t-test, as illustrated above, and a one-tailed t-test. A one-tailed t-test offers greater power and should be used whenever possible, since smaller deviations from zero are statistically significant and thus the risk of a type II error (accepting a wrong null hypothesis) is
1.3 Statistical Testing and Interval Estimation
35
reduced. However, conducting a one-tailed test requires some more reasoning and/or a priori knowledge on the side of the researcher. A one-tailed t-test is appropriate if the test outcome has different consequences depending on the direction of the deviation. If in our example the satisfaction index has remained constant or even improved, no action is required. But if the satisfaction index has decreased, management should be worried. It should investigate the reason and take action to improve satisfaction. The research question of the two-tailed test was: “Has satisfaction changed?” For the one-tailed test, the research question is: “Has satisfaction decreased?”. Thus, we have to ‘prove’ the alternative hypothesis
H1 : µ < 7.5 by rejecting the null hypothesis
H0 : µ ≥ 7.5 which states the opposite of the research question. The decision criterion is:
reject H0 if temp < tα
(1.13)
Note that in our example tα is negative. The rejection region is now only in the lower tail (left side) and the area under the density function has double the size. The critical value for α = 5% is tα = –1.66 (Fig. 1.8).15 As this value is closer to H0 than the critical value tα/2 = 1.98 for the two-tailed test, a smaller deviation from H0 is significant. The empirical test statistic temp = –1.9 is now in the rejection region on the lower tail. Thus, H0 can be rejected at the significance level α = 5%. With the more powerful onetailed test we can now “prove” that customer satisfaction has decreased. Using the p-value When using the p-value, the decision criterion is the same as before in Eq. (1.12): If p < α, reject H0. But the one-tailed p-value here is just half the two-tailed p-value in Eq. (1.12). Thus, if we know the two-tailed p-value, it is easy to calculate the one-tailed p-value. As we got p = 6% for the two-tailed test, the p-value for the one-tailed test is p = 3%. This is clearly below α = 5%.16
15 In
Excel we can calculate the critical value tα for the lower tail by using the function T.INV(α;df). We get: T.INV(0.05;99) = –1.66. For the upper tail we have to switch the sign or use the function T.INV(1–α;df). 16 In Excel we can calculate the p-value for the left tail by using the function T.DIST(temp;df;1). We get: T.DIST(−1.90;99;1) = 0.0302 or 3%. The p-value for the right tail is obtained by the function T.DIST.RT(temp;df).
1 Introduction to Empirical Data Analysis
36
rejection region
acceptance region 0.4
0.3
0.2
0.1
0.05 -3.0
-2.0
t
0.0
-1.0
0.0
1.0
2.0
3.0
1.66
Fig. 1.8 t-distribution and critical value for a one-tailed test (α = 5%, df = 99).
1.3.2 Conducting a Test for a Proportion We use proportions or percentages instead of mean values to describe nominally scaled variables. Testing a hypothesis about a proportion follows the same steps as a test for the mean. For a two-sided test of a proportion, we state the hypotheses as follows: • null hypothesis H0 : π = π0 • alternative hypothesis H1 : π � = π0 where π0 is an assumed proportion value and π is the unknown true proportion. If we denote the empirical proportion by prop, the test statistic is calculated by:
zemp =
prop − π0 √ σ/ N
with prop empirical proportion π0 assumed proportion value σ standard deviation in the population N number of cases in the data set
(1.14)
1.3 Statistical Testing and Interval Estimation
The standard deviation in the population can be calculated as follows: √ σ = π0 (1 − π0 )
37
(1.15)
If the null hypothesis is true, the standard deviation of the proportion can be derived from π0. For this reason, we can use the standard normal distribution instead of the t-distribution for calculating critical values and p-values. This simplifies the procedure. But for N ≥ 100 it makes no difference whether we use the normal distribution or the t-distribution (see Table 1.12). Example
The chocolate company ChocoChain knows from regular surveys on attitudes and lifestyles that 10% of its customers are vegetarians. In this year’s survey x = 52 customers stated that they are vegetarians. With a sample size of N = 400 this amounts to a proportion prop = x/N = 0.13 or 13%. Does this result indicate a real increase or is it just a random fluctuation? For π0 = 10% we get σ = 0.30 and, with Eq. (1.13), we get the test statistic:
zemp =
0.13 − 0.10 √ = 2.00 0.30/ 400
A rule of thumb says that an absolute value ≥ 2 of the test statistic is significant at α = 5%. So we can conclude without any calculation that the proportion of vegetarians has changed significantly. The exact critical value for the standard normal distribution is zα/2 = 1.96. The two-tailed p-value for zemp = 2.0 is 4.55% and thus smaller than 5%. If our research question is: “Has the proportion of vegetarians increased?”, we can perform a one-tailed t-test with the hypotheses
H0 : π ≤ π0 = 10% H1 : π > π0
In this case, the critical value will be 1.64 and the one-tailed p-value will be 2.28%, which is clearly lower than 5%. Thus, the result is highly significant. ◄ Accuracy Measures of Binary Classification Tests Tests with binary outcomes are very frequent, e.g. in medical testing (sick or healthy, pregnant or not) or quality control (meets specification or not). To judge the accuracy of such tests, certain proportions (or percentages) of the two test outcomes are used, called sensitivity and specificity.17 These measures are common in medical research, epidemiology, or machine learning, but still widely unknown in other areas.
17 Cf.,
e.g., Hastie et al. (2011), Pearl and Mackenzie (2018); Gigerenzer (2002).
1 Introduction to Empirical Data Analysis
38 Table 1.14 Measures of accuracy in medical testing
Test Result
No Disease
Disease
Negative
Specificity true negative 1–α
1–sensitivity false negative β
Positive
1–specificity false positive α false alarm
sensitivity true positive 1–β power
In medical testing these measures have to be interpreted as follows: • Sensitivity = percentage of “true positives”, i.e., the test will be positive if the patient is sick (disease is correctly recognized). • Specificity = percentage of “true negatives”, i.e., the test will be negative if the patient is not sick. For an example, we can look at the accuracy of the swab tests (RT-PCR-tests) used early in the 2020 corona pandemic for testing people for infection with SARS-CoV-2. The British Medical Journal (Watson et al. 2020) reported a test specificity of 95%, but a sensitivity of only 70%. This means that out of 100 persons infected with SARS-CoV-2, the test was falsely negative for 30 people. Not knowing about their infection, these 30 people contributed to the rapid spreading of the disease. In Sect. 1.3.1.1 we discussed type I and type II errors (α and β) in statistical testing. These errors can be seen as inverse measures of accuracy. There is a close correspondence to specificity and sensitivity. Assuming “no disease” as the null hypothesis, Table 1.14 shows the correspondence of these measures of accuracy to the error types in statistical testing. The test sensitivity of 70% corresponds to the power of the test and the “falsely negative” rate of 30% corresponds to the β-error (type II error). Measures of sensitivity and specificity can be used for results of cross tables (in contingency analysis), discriminant analysis, and logistic regression. In Chap. 5 on logistic regression, we will give further examples of the calculation and application of these measures.
1.3.3 Interval Estimation (Confidence Interval) Interval estimation and statistical testing are part of inferential statistics and are based on the same principles. Interval Estimation for a Mean We come back to ChocoChain’s measurement of satisfaction with the mean value x = 7.30. This value can be considered a point estimate of the true mean μ, which we
1.3 Statistical Testing and Interval Estimation
39
do not know. It is the best estimate we can get for the true value μ. However, since x is a random variable, we cannot expect it to be equal to μ. But we can state an interval around x within which we expect the true mean μ with a certain error probability α (or confidence level 1–α):
µ = x ± error This interval is called an interval estimate or confidence interval for μ. Again, we can use the t-distribution to determine this interval:
sx µ = x ± tα/2 · √ N
(1.16)
We use the same values as we used above for testing: tα/2 = 1.98, sx = 1.05 and N = 100, and we get:
1.05 = 7.30 ± 0.21 µ = 7.30 ± 1.98 · √ 100 Thus, with a confidence of 95%, we can expect the true value μ to be in the interval between 7.09 and 7.51:
7.09 ← µ → 7.51 The smaller the error probability α (or the greater the confidence 1–α), the greater the interval must be. Thus, for α = 1% (or confidence 1–α = 99%), the confidence interval is [7.02, 7.58]. We can also use the confidence interval for testing a hypothesis. If our null hypothesis µ0 = 7.50 falls into the confidence interval, it is equivalent to the test statistic falling into the acceptance region. This is an alternative way of testing a hypothesis. Again, we cannot reject H0, just as in the two-tailed test above. Interval Estimation for a Proportion Analogously, we can estimate a confidence interval for a proportion. The survey yielded a proportion of prop = 13%. We can compute the confidence interval for the true value π as:
σ π = prop ± zα/2 · √ N
(1.17)
Using the same values as above for testing: zα/2 = 1.96, σ = 0.30 and N = 400, we get:
0.30 = 13.0 ± 2.94 π = 13.0 ± 1.96 · √ 400 Thus, with a confidence of 95%, we can expect the true value π to be in the interval between 10.06 and 15.94. As π0 = 10% does not fall into this interval, we can again reject the null hypothesis, as before.
1 Introduction to Empirical Data Analysis
40
If we do not know σ, we have to estimate it based on the proportion prop as √ s = prop (1 − prop)
(1.18)
In this case, we have to use the t-distribution and calculate the confidence interval as:
s π = prop ± tα/2 · √ N
(1.19)
We get
0.336 = 13.0 ± 3.31 π = 13.0 ± 1.97 · √ 400 The confidence interval increases to [9.69, 16.31].
1.4 Causality A causal relationship is a relationship that has a direction. For two variables X and Y, it can be formally expressed by
X → Y
cause
effect
This means: If X changes, Y changes as well. Thus, changes in Y are caused by changes in X. However, this does not mean that changes in X are the only cause of changes in Y. If X is the only cause of changes in Y, we speak of a mono-causal relationship. But we often face multi-causal relationships, which makes it difficult to find and prove causal relationships (cf. Freedman, 2002; Pearl & Mackenzie, 2018).
1.4.1 Causality and Correlation Finding and proving causal relationships is a primary goal of all empirical (natural and social) sciences. Statistical association or correlation plays an important role in pursuing this goal. But causality is no statistical construct and concluding causality from an association or correlation can be very misleading. Data contain no information about causality. Thus, causality cannot be detected or proven by statistical analysis alone. To infer or prove causality, we need information about the data generation process and causal reasoning. The latter is something that computers or artificial intelligence are still lacking. Causality is a conclusion that must be drawn by the researcher. Statistical methods can only support our conclusions about causality. There are many examples of significant correlations that do not imply causality. For instance, high correlations were found for
1.4 Causality
• • • • • •
41
number of storks and birth rate (1960–1990), reading skills of school children and shoe size, crop yield of hops and beer consumption, ice cream sales and rate of drowning, divorce rate in Maine and per capita consumption of margarine, US spending on science, space, and technology versus suicides by hanging, strangulation, and suffocation.
Such non-causal correlations between two variables X and Y are also called spurious correlations. They are often caused by a lurking third variable Z that is simultaneously influencing X and Y. This third variable Z is also called a confounding variable or confounder. It is causally related to X and Y. But often we cannot observe such a confounding variable or do not even know about it. Thus, the confounder can cause misinterpretations. The strong correlation between the number of storks and the birth rate that was observed in the years from 1966 to 1990 was probably caused by the growing industrial development combined with prosperity. For the reading skills of school children and their shoe size, the confounder is age. For crop yield of hops and beer consumption, the confounder is probably the number of sunny hours. The same may be true for the relationship between ice cream sales and the rate of drowning. If it is hot, people eat more ice cream and more people go swimming. If more people go swimming, more people will drown.
1.4.2 Testing for Causality To support the hypothesis of a causal relationship, various conditions should be met (Fig. 1.9).
Fig. 1.9 Testing for causality
42
1 Introduction to Empirical Data Analysis
Condition 1: Correlation Coefficient The correlation coefficient can be positive or negative. A positive sign means that Y will increase if X increases, and a negative sign indicates the opposite: Y will decrease if X increases. Thus, the researcher should not only hypothesize that a causal relationship exists, but should also state in advance (before the analysis) if it is a positive or a negative one. For example, when analyzing the relationship between chocolate sales and price, we would expect a negative correlation, and when analyzing the relationship between chocolate sales and advertising, we would expect a positive correlation. Of course, a positive relationship between price and sales is also possible (e.g., for luxury goods or if the price is used as an indicator for quality), but these are rare exceptions and usually do not apply to most fast-moving consumer goods (FMCG). Also, a negative effect of advertising (wear-out effect) has rarely been observed. Therefore, an unexpected sign of the correlation coefficient should make us skeptical. If there is a causal relationship between the two variables X and Y, a substantial correlation is expected. If there is no correlation or the correlation coefficient is very small (close to zero), there is probably no causality, or the causality is weak and irrelevant. In assessing the correlation coefficient, one must also consider the number of observations (sample size). This can be done by performing a statistical test of significance, either a t-test or an F-test. Condition 2: Temporal Ordering A causal relationship between two variables X and Y can always have two different directions: • X is a cause of Y: X → Y • Y is a cause of X: Y → X For the correlation coefficient it makes no difference whether we have situation a) or b). Thus, a significant correlation is no sufficient proof of the hypothesized causal relationship a). A cause must precede the effect, and thus changes in X must precede corresponding changes in Y. If this is not the case, the above hypothesis is wrong. In an experiment, this may be easily verified. The researcher changes X and checks for changes in Y. Yet if one has only observational data, it is often difficult or impossible to check the temporal order. We can do so if we have time series data and the observation periods are shorter than the lapse of time between cause and effect (time lag). Referring to our example, the time lag between advertising and sales depends on the type of product and the type of media used for advertising. The time lag will be shorter for FMCG like chocolate or toothpaste and longer for more expensive and durable goods (e.g., TV set, car). Also, the time lag will be shorter for TV or radio advertising than for advertising in magazines. For advertising, the effects are often dispersed over several periods (i.e., distributed lags).
43
1.5 Outliers and Missing Values
In case of a sufficiently large time lag (or sufficiently short observation periods) the direction of causation can be detected by a lagged correlation (or lagged regression). Under hypothesis X → Y, the following has to be true (Campbell & Stanley, 1966, p. 69):
rXt−r Yt > rXt Yt−r where t is the time period and r the length of the lag in periods (r = 1, 2, 3 …). Otherwise, it indicates that the hypothesis is wrong and causality has the opposite direction. A time lag can also obscure a causal relationship. Thus, rXt Yt might not be significant, but rXt−r Yt is. This should be considered from the outset if there are reasons to suspect a lagged relationship. The relationship between sales and advertising is an example where time lags frequently occur. Regression analysis (see Chap. 2) can cope with this by including lagged variables. Condition 3: Exclusion of Other Causes As stressed above, there can be a significant correlation between X and Y without a causal relationship, because it is caused by a third variable Z. In this case, we speak of non-causal or spurious correlations. Thus, it should be made sure that there are no third variables that are causing a spurious correlation between X and Y. It has also been argued in the literature that the absence of plausible rival hypotheses increases the plausibility of a hypothesis (Campbell & Stanley, 1966, p. 65). The world is complex and usually numerous factors are influencing an empirical variable Y. To account for such multi-causal relationships, we can use multivariate methods like regression analysis, variance analysis, logistic regression, or discriminant analysis. All these methods are described in the following chapters. Usually, however, not all influencing factors can be observed and included in a model. The art of model building requires an identification of the relevant variables. As Albert Einstein said: “A model should be simple, but not too simple.” With the help of statistics, we can measure the correlation between two variables, but this does not prove that a causal relationship exists. A correlation between variables is a necessary but not a sufficient condition for causality. The other two conditions must be fulfilled as well. The most reliable evidence for a causal relationship is provided by a controlled experiment (Campbell & Stanley, 1966; Green et al., 1988).
1.5 Outliers and Missing Values The results of empirical analyses can be distorted by observations with extreme values that do not correspond to the values “normally” expected. Likewise, missing data can lead to distortions, especially if they are not treated properly when analyzing the data.
1 Introduction to Empirical Data Analysis
44
1.5.1 Outliers Empirical data often contain one or more outliers, i.e., observations that deviate substantially from the other data. Such outliers can have a strong influence on the result of the analysis. Outliers can arise for different reasons. They can be due to • chance (random), • a mistake in measurement or data entry, • an unusual event.
1.5.1.1 Detecting Outliers When faced with a great number of numerical values, it can be tedious to find unusual ones. Even for a small data set as the one in Table 1.15 it is not easy to detect possible outliers just by looking at the raw data. Numerical and/or graphical methods may be used for detecting outliers, with graphical methods such as histograms, boxplots, and Table 1.15 Example data: observed and standardized values Observed data
Standardized data
Observation
X1
X2
Z1
Z2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
26 34 19 20 19 23 20 32 12 6 11 29 15 16 24 46 30 15 20 28
26 30 29 24 14 30 27 33 7 9 17 22 15 26 18 39 26 21 19 31
0.41 1.27 –0.35 –0.24 –0.35 0.08 –0.24 1.05 –1.11 –1.76 –1.22 0.73 –0.78 –0.68 0.19 2.56 0.84 –0.78 –0.24 0.62
0.35 0.84 0.71 0.10 –1.12 0.84 0.47 1.20 –1.97 –1.73 –0.75 –0.14 –0.99 0.35 –0.63 1.93 0.35 –0.26 –0.51 0.96
Mean Std.deviation
22.3 9.26
23.2 8.20
0.00 1.00
0.00 1.00
1.5 Outliers and Missing Values
45
Histogram
Frequency 8 6 4 2 0
5
10
15
20
25
30
35
40
45
50 More
Fig. 1.10 Histogram of variable X1
scatterplots usually being more convenient and efficient (du Toit et al., 1986; Tukey, 1977). A simple numerical method for detecting outliers is the standardization of data. Standardization of Data Table 1.15 shows the observed values of two variables, X1 and X2, and their standardized values, called z-values. We can see that only one z-value exceeds 2 (observation 16 of variable X1). If we assume that the data follow a normal distribution, a value >2 has a probability of less than 5%. The occurrence of a value of 2.56, as observed here, has a probability of less than 1%. Thus, this value is unusual and we can identify it as an outlier. The effect of an outlier on a statistical result can be easily quantified by repeating the computations after discarding the outlier. Table 1.15 shows that the mean of variable X1 is 22.3. After discarding observation 16, we get a mean of 21.0. Thus, the mean value changes by 1.3. The effect will be smaller for larger sample sizes. Especially for small sample sizes, outliers can cause substantial distortions. Histograms Figure 1.10 shows a histogram of variable X1, with the outlier at the far right of the figure.18 Boxplots A more convenient graphical means is the boxplot. Figure 1.11 shows the boxplots of variables X1 and X2, with the outlier showing up above the boxplot of X1. A boxplot (also called box-and-whisker plot) is based on the percentiles of data. It is determined by five statistics of a variable:
18 The
histogram was created with Excel by selecting “Data/Data Analysis/Histogram”. In SPSS, histograms are created by selecting “Analyze/Descriptive Statistics/Explore”.
1 Introduction to Empirical Data Analysis
46
x1
x2
Fig. 1.11 Boxplots of the variables X1 and X2
• maximum, • 75th percentile, • 50th percentile (median), • 25th percentile, and • minimum. The bold horizontal line in the middle of each box represents the median, i.e., 50% of the values are above this line and 50% are below. The upper rim of the box represents the 75th percentile and the lower rim represents the 25th percentile. Since these three percentiles, the 25th, 50th, and 75th percentiles, divide the data into four equal parts, they are also called quartiles. The height of the box represents 50% of the data and indicates the dispersion (spread, variation) and skewness of the data. The whiskers extending above and below the boxes represent the complete range of the data, from the smallest to the largest value (but without outliers). Outliers are defined as points that are more than 1.5 box lengths away from the rim of the box.19
19 In
SPSS we can create boxplots (just like histograms) by selecting “Analyze/Descriptive Statistics/Explore”. But don´t be surprised if observation 16 with value 46 will not be flagged as an outlier. For our data, the rule of 1.5 box lengths above the rim of the box will give the cutoff value of 47. But also this rule is not entirely free of arbitrariness. Here we want to demonstrate how an outlier is represented in the boxplot.
1.5 Outliers and Missing Values
47
X2
50
45 40 35 30 25
20 15 10 5 0
0
10
20
30
40
50
X1
Fig. 1.12 Scatterplot of the variables X1 and X2
Scatterplots Histograms and boxplots are univariate methods, which means that we are looking for outliers for each variable separately. The situation is different if we analyze the relationship between two or more variables. If we are interested in the relationship between two variables, we can display the data by using a scatterplot (Fig. 1.12). Each dot represents an observation of the two variables X1 and X2. The relationship between X1 and X2 can be represented by a linear regression line (dashed line in Fig. 1.12). We can see that observation 16 (at the right end of the regression line), which we identified as an outlier from a univariate perspective, fits the linear regression model quite well. The slope of the line will not substantially be affected if we eliminate the outlier. However, it is also possible that an outlier impacts the slope of the regression line and biases the results.
1.5.1.2 Dealing with Outliers In any case, it should always be investigated what caused the appearance of an outlier. Sometimes it is possible to correct a mistake made in data collection or entry. An observation should only be removed if we have reason to believe that a mistake has occurred or if we find proof that the outlier was caused by an unusual event outside the research context (e.g., a strike of the union or an electricity shutdown). In all other cases, outliers should be retained in the data set. If the outlier is due to chance, it does not pose a problem and does not have to be eliminated. Indeed, by removing outliers one can possibly manipulate the results. If one actually does so for some good reason, this should always be documented in any report or publication.
48
1 Introduction to Empirical Data Analysis
Some methods discussed in this book (e.g., regression analysis or factor analysis) are rather sensitive to outliers. We will therefore discuss the issue of outliers in detail in those chapters.
1.5.2 Missing Values Missing values are an unavoidable problem when conducting empirical studies and frequently occur in practice. The reasons for missing values are manifold. Some examples are: • Respondents forgot to answer a question. • Respondents cannot or do not want to answer a question. • Respondents answered outside the defined answer interval. The problem with missing values is that they can lead to distorted results. The validity of the results can also be limited since many methods require complete data sets and cases with a missing value have to be deleted (i.e., listwise deletion). Finally, missing values also represent a loss of information, so the validity of the results is reduced as compared to analyses with complete data sets. Statistical software packages offer the possibility of taking missing values into account in statistical analyses. Since all case studies in this book are using IBM SPSS, the following is a brief description of how this statistical software package can identify missing values. There are two options: • System missing values: Absent values (empty cells) in a data set are automatically identified by SPSS as so-called ‘system missing values’ and shown as dots (.) in data view. • User missing values: Missing values can also be coded by the user. For this purpose, the ‘Variable View’ must be called up in the data editor (button at the bottom left). There, missing data may be indicated by entering them in ‘Missing’ (see Fig. 1.13). Any value can be used as an indicator of a missing value if it is outside the range of the valid values of a variable, (e.g., “9999” or “0000”). Different codes may be used for different specifications of missing values (e.g., 0 for “I don’t know” and 9 for “Response denied”). These “user missing values” are then excluded from the following analyses. Dealing with Missing Values in SPSS SPSS provides the following three basic options for dealing with missing values:
1.5 Outliers and Missing Values
49
Fig. 1.13 Definition of user missing values in the data editor of IBM SPSS
• The values are excluded “case by case” (“Exclude cases listwise”), i.e., as soon as a missing value occurs, the whole case (observation) is excluded from further analysis. This often reduces the number of cases considerably. The “listwise” option is the default setting in SPSS. • The values are excluded variably (“Exclude cases pairwise”), i.e., in the absence of a value only pairs with this value are eliminated. If, for example, a value is missing for variable j, only the correlations with variable j are affected in the calculation of a correlation matrix. In this way, the coefficients in the matrix may be based on different numbers of cases. This may result in an imbalance of the variables. • There is no exclusion at all. Average values (“Replace with mean”) are inserted for the missing values. This may lead to a reduced variance if many missing values occur and to a distortion of the results. For option 3, SPSS offers an extra procedure, which can be called up by the menu sequence Transform/Replace Missing Values (cf. Fig. 1.14). With this procedure, the user can decide per variable which information should replace missing values in a data set. The following options are available: • • • • •
Series mean, Mean of nearby points (the number of nearby points may be defined as 2 to all), Median of nearby points (number of nearby points: 2 to all), Linear interpolation, Linear trend at point.
50
1 Introduction to Empirical Data Analysis
Fig. 1.14 SPSS procedure ‘Replace Missing Values’
For cross-sectional data, only the first two options make sense, since the missing values of a variable are replaced with the mean or median (nearby points: all) of the entire data series. The remaining options are primarily aimed at time series data, in which case the order of the cases in the data set is important: With the options “Mean of nearby points” and “Median of nearby points”, the user can decide how many observations before and after the missing value are used to calculate the mean or the median for a missing value. In the case of “Linear interpolation”, the mean is derived from the immediate predecessor and successor of the missing value. “Linear trend at point” calculates a regression (see Chap. 2) on an index variable scaled 1 to N. Missing values are then replaced with the estimated value from the regression. With the menu sequence Analyze/Multiple Imputation, SPSS offers a good possibility of replacing missing values with very realistic estimated values. SPSS also offers the possibility to analyze missing values under the menu sequence Analyze/MissingValue Analysis. In addition to the general options for handling missing values described above, some of the analytical procedures of SPSS also offer options for handling missing values. Table 1.16 summarizes these options for the methods discussed in this book. How the User Can Deal With Missing Values The designation System missing values is automatically assigned by SPSS if values are missing in a case (default setting). These cells are then ignored in any calculations, e.g., of the statistical parameters (see Sect. 1.2). However, this leads to the problem that variables with very different numbers of valid cases are included in the calculations (pairwise exclusion of missing values). Such distortions can be avoided if all cases with an invalid
1.6 How to Use IBM SPSS, Excel, and R
51
Table 1.16 Procedure-specific options of missing values Method
Options
Regression analysis
• Exclude cases listwise • Exclude cases pairwise • Replace with mean
Analysis of variance (ANOVA)
• Exclude cases listwise • Exclude cases analysis by analysis
Discriminant analysis
Dialog box ‘Classification’: Replace missing values with mean
Logistic regression
No separate missing value options in the procedure
Contingency analysis
No separate missing value options in the procedure
Factor analysis
• Exclude cases listwise • Exclude cases pairwise • Replace with mean
Cluster analysis
No separate missing value options in the procedure
Conjoint analysis
No separate missing value options in the procedure
value are completely excluded (listwise exclusion of missing values). But this may result in a greatly reduced number of cases. Replacing missing values with other values is therefore a good way to counteract this effect and avoid unequal weighting. Besides, the option user missing values offers the advantage that the user can differentiate missing values in terms of content. Missing values that should not be included in the calculation of statistical parameters may still provide specific information, i.e., whether a respondent is unable to answer (does not know) or does not want to answer (no information). If the option to differentiate “missing values” in such a way is integrated into the design of a survey from the start, important information can be derived from it. Finally, it should be emphasized again that it is important to make sure that missing values are marked as such in SPSS so they are not included in calculations, thus distorting the results.
1.6 How to Use IBM SPSS, Excel, and R In this book, we primarily use the software IBM SPSS Statistics (or SPSS) for the different methods of multivariate analysis, because SPSS is widely used in science and practice. The name SPSS originally was an acronym for Statistical Package for the Social Sciences. Over time, the scope of SPSS has been expanded to cover almost all areas of data analysis. IBM SPSS Statistics may be run on the operating systems Windows, MAC, and Linux. It includes a base module and several extension modules. Apart from the full
1 Introduction to Empirical Data Analysis
52 Table 1.17 Analysis methods and SPSS procedures Method of analysis
SPSS procedure
SPSS module
Regression analysis
REGRESSION
Statistics Base
Variance Analysis
UNIANOVA ONEWAY GLM
Statistics Base
Discriminant analysis
DISCRIMINANT
Statistics Base
Logistic analysis
LOGISTIC REGRESSION NOMREG
Advanced Statistics or SPSS Regression
Contingency analysis Cross tabulation
CROSSTABS LOGLINEAR HILOGLINEAR
Statistics Base Advanced Statistics Advanced Statistics
Factor analysis
FACTOR
Statistics Base
Cluster analysis
CLUSTER QUICK CLUSTER
Statistics Base Statistics Base
Conjoint Analysis
CONJOINT ORTHOPLAN PLANCARDS
SPSS Conjoint
version of IBM SPSS Statistics Base, a lower-cost student version is available for educational purposes. This has some limitations that are unlikely to be relevant to the majority of users: Data files can contain a maximum of 50 variables and 1500 cases, and the SPSS command syntax (command language) and extension modules are not available. To use SPSS, the basic IBM SPSS Statistics Base package must be purchased, containing basic statistical analysis. This basic module is also a prerequisite for purchasing additional packages or modules, which usually focus on specific analysis procedures such as SPSS Regression (regression analysis), SPSS Conjoint (conjoint analysis), or SPSS Neural Networks. An alternative option is to use the IBM SPSS Statistics Premium package which includes all the procedures of the Basic and Advanced packages and is available to students at most universities. Table 1.17 provides an overview of the analytical methods covered in this book and the associated SPSS procedures, all of which are included in the SPSS premium package. They run under the common user interface of SPSS Statistics. For readers without SPSS premium package, the column “SPSS module” lists those SPSS modules or packages that contain the corresponding procedures. The various data analysis methods can be selected in SPSS via a graphical user interface. This user interface is constantly being improved and extended. Using the available menus and dialog boxes, even complex analyses can be performed in a very convenient way. Thus, the command language (command syntax) previously required to control the
References
53
program is hardly used any more, but it still has some advantages for the user, such as the customization of analyses. All chapters in this book therefore contain the command sequences required to carry out the analyses. There are several books on how to use IBM SPSS, all of which provide a very good introduction to the package: • George, D. & Mallery, P. (2019). IBM SPSS Statistics 26 Step by Step (16th ed.). London: Taylor & Francis Ltd. • Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). London: Sage Publication Ltd. • Härdle, W. K., & Simar, L. (2015). Applied Multivariate Statistical Analysis, (5th ed.). Heidelberg: Springer. IBM SPSS also provides several manuals under the link https://www.ibm.com/support/ pages/ibm-spss-statistics-29-documentation, which are regularly updated. Users who work with the programming language R will find notes on how to use it for data analysis under the link www.multivariate-methods.info. In addition, a series of Excel files for each analysis method is also provided on the website www.multivariate-methods.info, which should help the readers familiarize themselves more easily with the various methods.
References Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for research. Rand McNelly. du Toit, S. H. C., Steyn, A. G. W., & Stumpf, R. H. (1986). Graphical exploratory data analysis. Springer. Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p. 521). Technical Report, University of California. Gigerenzer, G. (2002). Calculated rsks. Simon & Schuster. Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for marketing decisions (5th ed.). Prentice Hall. Hastie, T., Tibshirani, R., & Friedman, J. (2011). The elements of statistical learning. Springer. Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. Basic Books. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley. Watson, J., Whiting, P.F., & Brush, J.E. (2020). Interpreting a covid-19 test result. British Medical Journal, 369, m1808.
54
1 Introduction to Empirical Data Analysis
Further Reading Anderson, D. R., Sweeney, D. J., & Williams, T. A. (2007). Essentials of modern business statistics with Microsoft Excel. Thomson. Field, A., Miles, J., & Field, Z. (2012). Discovering sstatistics using R. Sage. Fisher, R. A. (1990). Statistical methods, experimental design, and scientific inference. Oxford University Press. Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). Norton. George, D., & Mallery, P. (2021). IBM SPSS statistics 27 step by step: A simple guide and reference (17th ed.). Routledge. Sarstedt, M., & Mooi, E. (2019). A concise guide to market research: The process, data, and methods using IBM SPSS statistics (3rd ed.). Springer. Wonnacott, T. H., & Wonnacott, R. J. (1977). Introductory statistics for business and economics (2nd ed.). Wiley.
2
Regression Analysis
Contents 2.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.2.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.2.2 Estimating the Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.2.2.1 Simple Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.2.2.2 Multiple Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.2.3 Checking the Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.2.3.1 Standard Error of the Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 2.2.3.2 Coefficient of Determination (R-square) . . . . . . . . . . . . . . . . . . . . . . . . 78 2.2.3.3 Stochastic Model and F-test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.2.3.4 Overfitting and Adjusted R-Square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.2.4 Checking the Regression Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 2.2.4.1 Precision of the Regression Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . 85 2.2.4.2 t-test of the Regression Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 2.2.4.3 Confidence Interval of the Regression Coefficient. . . . . . . . . . . . . . . . . 89 2.2.5 Checking the Underlying Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 2.2.5.1 Non-linearity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 2.2.5.2 Omission of Relevant Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 2.2.5.3 Random Errors in the Independent Variables. . . . . . . . . . . . . . . . . . . . . 101 2.2.5.4 Heteroscedasticity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 2.2.5.5 Autocorrelation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.2.5.6 Non-normality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 2.2.5.7 Multicollinearity and Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 2.2.5.8 Influential Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 2.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 2.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 2.3.2 Conducting a Regression Analysis With SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 2.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 2.3.3.1 Results of the First Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 © Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_2
55
56
2 Regression Analysis
2.3.3.2 Results of the Second Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 2.3.3.3 Checking the Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 2.3.3.4 Stepwise Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 2.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 2.4 Modifications and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 2.4.1 Regression With Dummy Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 2.4.2 Regression Analysis With Time-Series Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 2.4.3 Multivariate Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 2.5 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
2.1 Problem Regression analysis is one of the most useful and thus most frequently used methods for statistical data analysis. With the help of regression analysis, one can analyze the relationships between variables. For example, one can find out if a certain variable is influenced by another variable, and if so, how strong this effect is. By this one can learn, how the world works. Regression analysis can be used in the search for truth, which can be very exciting. Regression analysis is very useful when searching for explanations or making decisions or predictions. Thus, regression analysis is of eminent importance for all empirical sciences as well as for solving practical problems. Table 2.1 lists examples for the application of regression analysis. Regression analysis takes a special position among the methods of multivariate data analysis. The invention of regression analysis by Sir Francis Galton (1822–1911) in connection with his studies on heredity1 can be considered as the birth of multivariate data analysis. Stigler (1997, p. 107) calls it “one of the grand triumphs of the history of science”. And of further importance is, that regression analysis also provides a basis for numerous other methods used today in big data analysis and machine learning. For an understanding of these other, often more complex methods of multivariate data analysis a profound knowledge about regression analysis is indispensable. While regression analysis is a relatively simple method within the field of multivariate data analysis, it is still prone to mistakes and misunderstandings. Thus, wrong results or wrong interpretations of the results of regression analysis are frequent. This concerns above all the underlying assumptions of the regression model. We will come back to this later but will add a word of caution here. Regression analysis can be very helpful for finding causal relationships, and this is the main reason for its application. But neither regression analysis nor any other statistical method can prove causality. For this purpose,
1 Galton
(1886) investigated the relationship between the body heights of parents and their adult children. He “regressed the height of children on the height of parents”.
2.1 Problem
57
Table 2.1 Application examples of regression analysis in different disciplines Discipline
Exemplary research questions
Agriculture
How does crop yield depend on the amounts of rainfall, sunshine, and fertilizers?
Biology
How does bodyweight change with the quantity of food intake?
Business
What revenue and profit can we expect for the next year?
Economics
How does national income depend on government expenditures?
Engineering
How does production time depend on the type of construction, technology, and labor force?
Healthcare
How does health depend on diet, physical activity, and social factors?
Marketing
What are the effects of price, advertising, and distribution on sales?
Medicine
How is lung cancer affected by smoking and air pollution?
Meteorology
How does the probability of rainfall change with variables like temperature, humidity, air pressure, etc.?
Psychology
How important are income, health, and social relations for happiness?
Sociology
What is the relationship between income, age, and education?
reasoning beyond statistics and information about the generation of the data may be needed. First, we want to show here how regression analysis works. For the application of regression analysis, the user (researcher) must decide which one is the dependent variable that is influenced by one or more other variables, so-called independent variables. The dependent variable must be on a metric (quantitative) scale. The researcher further needs empirical data on the variables. These may be derived from observations or experiments and may be cross-sectional or time-series data. Somewhat bewildering for the novice are the different terms that are used interchangeably in the literature for the variables of regression analysis and that vary by the author and the context of the application (see Table 2.2). Table 2.2 Regression analysis and terminology Regression analysis (RA) is used to • Describe and explain relationships between variables, • Estimate or predict the values of a dependent variable. Dependent variable (output) Y, explained varia- Independent variable(s) (input) X1 , X2 , . . . , Xj , . . . , XJ , explanatory variables, ble, regressand, response variable, y-variable regressors, predictors, covariates, x-variables Example: Sales of a product
Price, advertising, quality, etc.
In linear regression, the variables are assumed to be quantitative. By using binary variables (dummy variable technique) qualitative regressors can also be analyzed.
2 Regression Analysis
58 Example
When analyzing the relationship between the sales volume of a product and its price, sales will usually be the dependent variable, also called the response variable, explained variable or regressand, because sales volume usually responds to changes in price. The price will be the independent variable, also called predictor, explanatory variable, or regressor. So, an increase in price may explain why the sales volume has declined. And the price may be a good predictor of future sales. With the help of regression analysis, one can predict the expected sales volume if the price is changed by a certain amount. ◄ Simple Linear Regression The relationship between sales volume and advertising expenditures poses one of the big problems in business since a lot of money is spent without knowing much about the effects. Many efforts have been made to learn more about this relationship with the help of regression analysis. Various elaborate models have been developed (e.g. Leeflang et al., 2000, pp. 66–99). We will start here with a very simple one. In simple regression, we are looking for a regression function of Y on X. For example, we assume that the sales volume is influenced by advertising and write in a very general form sales = f (advertising)
or
Y = f (X)
(2.1)
f (·) is an unknown function that we want to estimate: estimated sales = fˆ (advertising) or Yˆ = fˆ (X).
(2.2)
Yˆ = a + b X
(2.3)
Yˆ = 500 + 3 X
(2.4)
Of course, the estimated values are not identical with the real (observed) values. That is why the variable for estimated sales is denoted by Yˆ (Y with a hat). To get a quantitative estimate for the relationship in Eq. (2.2) we must specify its structure. In linear regression, we assume:
With given data of Y and X, regression analysis can find values for the parameters a and b. Parameters are numerical constants in a model, whose values we want to estimate. Parameters that accompany (multiply) a variable (such as b) are also called coefficients. Let us assume that the estimation yields the following result:
Figure 2.1 illustrates this function. Parameter b (the coefficient of X) is an indicator of the strength of the effect of advertising on sales. Geometrically, b is the slope of the
2.1 Problem
59
Sales 1,000
500
0
0
50
100
150 Advertising
Fig. 2.1 Estimated regression line
regression line. If advertising increases by 1 Euro, in this example sales will increase by 3 units. Parameter a (the regression intercept) reflects the basic level of sales if there is no advertising (X = 0). With the help of the estimated regression function a manager can answer questions like: • How will sales change if advertising expenditures are changed? • What sales can be expected for a certain advertising budget? So, for example, if the advertising budget is 100 Euros, we will expect sales to be
Yˆ = 500 + 3 · 100 = 800 units.
(2.5)
And if advertising is increased to 120 Euros, sales will increase to 860 units. Furthermore, if the variable costs per unit of the product are known, we can find out whether an increase in advertising is profitable or not. Thus, regression analysis can be used as a powerful tool to support decision making. The above regression function is an example of a so-called simple linear regression or bivariate regression. Unfortunately, the relationship between sales volume and advertising is usually not linear. But a linear function can be a good approximation in a limited interval around the current advertising budget (within the range of observed values). Another problem is that sales are not solely influenced by advertising. Besides advertising, sales volumes also depend on the price of the product, its quality, its distribution,
2 Regression Analysis
60
and many other influences.2 So, with simple linear regression, we usually can get only very rough estimates of sales volumes. Multiple Regression With multiple regression analysis one can take into account more than one influencing variable by including them all into the regression function. So, Eq. (2.1) can be extended to a function with several independent variables:
Y = f (X1 , X2 , . . . , Xj , . . . , XJ )
(2.6)
Choosing again a linear structure, we get:
Yˆ = a + b1 X1 + b2 X2 + . . . + bj Xj + . . . + bJ XJ
(2.7)
By including more explaining variables, the predictions of Y can become more precise. However, there are limitations to extending the model. Often, not all influencing variables are known to the researcher. Or some observations are not available. Also, with an increasing number of variables the estimation of the parameters can become more difficult.
2.2 Procedure In this section, we will show how regression analysis works. The procedure can be structured into five steps that are shown in Fig. 2.2. The steps of regression analysis are demonstrated using a small example with three independent variables and 12 cases (observations) as shown in Table 2.3.3 Example
The manager of a chocolate manufacturer is not satisfied with the sales volume of chocolate bars. He would like to find out how he can influence the sales volume. To this end, he collected quarterly sales data from the last three years. In particular, he took data on sales volume, retail price, and expenditures for advertising and sales promotion. Data on retail sales and prices were acquired from a retail panel (Table 2.3). ◄
2 Sales
can also depend on environmental factors like competition, social-economic influences, or weather. Another difficulty is that advertising itself is a complex bundle of factors that cannot simply be reduced to expenditures. The impact of advertising depends on its quality, which is difficult to measure, and it also depends on the media that are used (e.g., print, radio, television, internet). These and other reasons make it very difficult to measure the effect of advertising. 3 On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.
2.2 Procedure
61
Fig. 2.2 The five-step procedure of regression analysis
1
Model formulation
2
Estimating the regression function
3
Checking the regression function
4
Checking the regression coefficients
5
Checking the underlying assumptions
Table 2.3 Data of the application example Period i
Sales [1000 units]
Advertising [1000 EUR]
Price [EUR/unit]
Promotion [1000 EUR]
1 2 3 4 5 6 7 8 9 10 11 12
2596 2709 2552 3004 3076 2513 2626 3120 2751 2965 2818 3171
203 216 207 250 240 226 246 250 235 256 242 251
1.42 1.41 1.95 1.99 1.63 1.82 1.69 1.65 1.99 1.53 1.69 1.72
150 120 146 270 200 93 70 230 166 116 100 216
Mean Std-deviation
2825.1 234.38
235.2 18.07
1.71 0.20
156.43 61.53
2.2.1 Model Formulation 1
Model formulation
2
Estimating the regression function
3
Checking the regression function
4
Checking the regression coefficients
5
Checking the underlying assumptions
The first step in performing a regression analysis is the formulation of a model. A model is a simplified representation of a real-world phenomenon. It should have some structural or functional similarity with reality. A city map, for example, is a simplified visual model of a city that shows its streets and their courses. A globe is a three-dimensional model of Earth.
2 Regression Analysis
62
In regression analysis, we deal with mathematical models. The specification of regression models comprises: • choosing and defining the variables, • specifying the functional form, • assumptions about errors (random influences).4 A model should always be as simple as possible (principle of parsimony) and as complex as necessary. Thus, modeling is always a balancing act between simplicity and complexity (completeness). A model must be able to capture one or more relevant aspects of interest to the user. However, the more complete a model represents reality, the more complex it becomes, and its handling becomes increasingly difficult or even impossible.5 The appropriate level of detail depends on the intended use, but also on the user’s experience and the available data. An evolutionary approach is often useful, starting with a simple model, which is then extended with increasing experience and expertise (Little, 1970). A model becomes more complex with the number of variables. For explaining sales, there exists a great number of candidate explanatory variables. Our manager starts with a simple model and chooses only one variable for explaining sales. He assumes that sales volume is mainly influenced by advertising expenditures. Thus, he chooses sales as the dependent variable and advertising as the independent variable and formulates the following model:
sales = f (advertising) or
Y = f (X) The manager further assumes that the effect of advertising is positive, i.e. that the sales volume increases with increasing advertising expenditures. To check this hypothesis, he inspects the data in Table 2.3. It is always useful to visualize the data by a scatterplot (dot diagram), as shown in Fig. 2.3. This should be the first step of an analysis. Each observation of sales and advertising in Table 2.3 is represented by a point in Fig. 2.3. The first point at the left is the point (x1 , y1 ), i.e. the first observation with the values 203 and 2596. Using Excel or SPSS, such scatter diagrams can be easily created, even for large amounts of data.
4 See
Sects. 2.2.3.3 and 2.2.5. regression analysis we encounter the problem of multicollinearity. We will deal with this problem in Sect. 2.2.5.7. 5 In
2.2 Procedure
63
Sales 3,500
3,000
2,500
2,000
200
220
240
260
Advertising
Fig. 2.3 Scatterplot of the observed values for sales and advertising
The scatterplot shows that the sales volume tends to increase with advertising. We can see some linear association between sales and advertising.6 This confirms the hypothesis of the manager that there is a positive relationship between sales and advertising. For the correlation (Pearson’s r) the manager calculates rxy = 0.74. Moreover, the manager assumes that the relationship between sales and advertising can be approximately represented by a linear regression line, as shown in Fig. 2.4. The situation would be different if we had a scatterplot as shown in Fig. 2.5. This indicates a non-linear relationship. Advertising response is always non-linear. Linear models are in almost all cases a simplification of reality. But they can provide good approximations and are much easier to handle than non-linear models. So, for the data in Fig. 2.5, a linear model could be appropriate for a limited range of advertising
6 The
terms association and correlation are widely and often interchangeably used in data analysis. But there are differences. Association of variables refers to any kind of relation between variables. Two variables are said to be associated if the values of one variable tend to change in some systematic way along with the values of the other variable. A scatterplot of the variables will show a systematic pattern. Correlation is a more specific term. It refers to associations in the form of a linear trend. And it is a measure of the strength of this association. Pearson’s correlation coefficient measures the strength of a linear trend, i.e. how close the points are lying on a straight line. Spearman’s rank correlation can also be used for non-linear trends.
2 Regression Analysis
64
Sales 3,500
3,000
2,500
2,000
200
220
260
240
Advertising
Fig. 2.4 Scatterplot with a linear regression line
Sales 3,000
2,500 2,000
1,500 1,000 500 0
0
100
200
300
400
Advertising
Fig. 2.5 Scatterplot with a non-linear association
expenditures, e.g. from zero to 200. For modeling advertising response over the complete range of expenditures, a non-linear formulation would be necessary (for handling non-linear relations see Sect. 2.2.5.1). The regression line in Fig. 2.4 can be mathematically represented by the linear function:
Yˆ = a + b X
(2.8)
2.2 Procedure
65
Y 3,000
2,000
1,000
0
0
50
100
150
200
250
X
Fig. 2.6 The linear regression function
with
Yˆ estimated sales X advertising expenditures a constant term (intercept) b regression coefficient The meaning of the regression parameters a and b is illustrated in Fig. 2.6. Parameter a (intercept) indicates the intersection of the regression line with the y-axis (the vertical axis or ordinate) of the coordinate system. This is the value of the regression line for X = 0 or no advertising. Parameter b indicates the slope of the regression line. It holds that
b=
ΔYˆ ΔX
(2.9)
Parameter b tells us how much Y will probably increase if X is increased by one unit.
2.2.2 Estimating the Regression Function 1
Model formulation
2
Estimating the regression function
3
Checking the regression function
4
Checking the regression coefficients
5
Checking the underlying assumptions
2 Regression Analysis
66 Table 2.4 Data for sales and advertising with basic statistics
Year i
Sales Y
Advertising X
1 2 3 4 5 6 7 8 9 10 11 12
2596 2709 2552 3004 3076 2513 2626 3120 2751 2965 2818 3171
203 216 207 250 240 226 246 250 235 256 242 251
Mean y, x Std-deviation sy , sx Correlation rxy
2825 234.38 0.742
235.2 18.07
A mathematical model, like the regression function in Eq. (2.3), must be adapted to reality. The parameters of the model must be estimated based on a data set (observations of the variables). This process is called model estimation or calibration. We will demonstrate this with the data from Table 2.3, first for simple regression, and then for multiple regression.
2.2.2.1 Simple Regression Models The procedure of estimation is based on the method of least squares (LS) that we will explain in the following. Table 2.4 shows the data for sales and advertising (transferred from Table 2.3) and the values of some basic statistics. The regression coefficient b can be calculated as: b =
ΣN i
34,587 (xi − x) · (yi − y) = 9.63 = ΣN 2 3592 i (xi − x)
(2.10)
With the statistics for the standard deviations and the correlation of the two variables given in Table 2.4, we can calculate the regression coefficient more easily as7
b = rxy
7 These
234.38 sy = 9.63 = 0.742 · sx 18.07
(2.11)
basic statistics can be easily calculated with the Excel functions AVERAGE(range) for mean, STDEV.S(range) for std-deviation, and CORREL(range1;range2) for correlation.
67
2.2 Procedure
Sales 3,500
3,000
2,500
2,000
200
210
220
230
240
250
260
Advertising
Fig. 2.7 Regression line and SD line
With the value of coefficient b, we get the constant term through
a = y − b x = 2825 − 9.63 · 235.2 = 560
(2.12)
The resulting regression function as shown in Fig. 2.4 is:
Yˆ = 560 + 9.63 X
(2.13)
The value of b = 9.63 states that if advertising expenditures are increased by 1 EUR, sales may be expected to rise by 9.63 units. The value of the regression coefficient can provide important information for the manager. If we assume that the contribution margin per chocolate bar is 0.20 EUR, spending 1 EUR more on advertising will increase the total contribution margin by 0.20 EUR · 9.63 = 1.93 EUR. Thus, by spending 1.00 EUR, net profit will increase by 0.93 EUR. If the profit per unit were only 1.00/9.63 = 0.10 EUR or less, an increase in advertising expenditures would not be profitable. Understanding Regression A regression line for given data must always pass through the centroid of the data (the point of means or point of averages). This is a consequence of the least-squares estimation. In our case, the point of means is [x, y] = [235, 2825] (marked by a bullett in Fig. 2.7). For standardized variables with x = y = 0 and sx = sy = 1, the regression line will pass through the origin of the coordinate system, the point [x, y] = [0, 0]. For the constant term, we get a = 0, as we can see from Eq. (2.12). And from Eq. (2.11) we can see that the regression coefficient (the slope of the regression line) is simply the same as the correlation coefficient. In our case, after standardization we would get:
2 Regression Analysis
68
b = rxy = 0.74. For the original variables X and Y the slope b also depends on the standard deviations of X and Y, sx and sy. Only for sx = sy the values of b and rxy are identical. If sy > sx, then the slope of the regression line will be larger than the correlation (b > rxy ) and vice versa. The greater sy, the greater the coefficient b will be, and the greater sx, the smaller b will be. As the standard deviation of any variable changes with its scale, the regression coefficient b also depends on the scaling of the variables. If, e.g., the advertising expenditures are given in cents instead of EUR, b will be diminished by the factor 100. The effect of a change by one cent is just 1/100 of the effect of a change by one EUR. By changing the scale of the variables, the researcher can arbitrarily change the standard deviations and thus the regression coefficient. But he cannot change the value of the correlation coefficient rxy since its value is independent of differences in scale. The line through the point of means with the slope sy /sx (with the same sign as the correlation coefficient) is called the standard deviation line (SD line (Freedman et al. 2007, pp. 130–131). This line is known before performing a regression analysis. For our data, we get sy /sx = 13. In Fig. 2.7 the SD line is represented by the dashed line. For rxy = 1, the regression line is identical with the SD line. But for empirical data, we will always get rxy < 1. Thus, it follows that the regression line will always be flatter than the SD line, i.e. |b| < sy /sx. This effect is called the regression effect, from which regression analysis got its name. For our data we get b = 9.63 Femp)
1.0 0.9 0.8
critical F-value for α = 0.05
p-values
0.7 0.6 0.5
0.4 0.3
f(F, 3, 8)
0.2 0.1 0.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
F, Femp
Fig. 2.10 F-distribution and p-values
Figure 2.10 shows that our empirical F-value Femp = 31.50 is much larger than the critical F-value for α = 5%. So our p is almost zero and it is practically impossible that H0 is true. Our estimated regression function for Model 3 is highly significant. Our Models 1 and 2 are also statistically significant, as can be checked with the given values of R-square.
2.2.3.4 Overfitting and Adjusted R-Square R-square is the most common goodness-of-fit measure, but it has limitations. Everybody strives for a model with a high R-square, but the model with the higher R-square must not necessarily be the better model. • R-square does not take into account the number of observations (sample size N) on which the regression is based. But we will have more trust in an estimation that is based on 50 observations than in one that is based on only 5 observations. In the extreme case with only two observations, a simple regression would always yield R2 = 1 since a straight line can always be laid through two points without deviations. But for this, we would not need a regression analysis.
2 Regression Analysis
84
• R-square does not consider the number of independent variables contained in the regression model and thus the complexity of the model. We mentioned the principle of parsimony for model building. Making a model more complex by adding variables (increasing J) will always increase R-square, but not necessarily increase the goodness of the model. The amount of “explanation” added by a new variable may only be a random effect. Moreover, with an increasing number of variables, the precision of the estimates can decrease due to multicollinearity between the variables (see Sect. 2.2.5.7). Also, with too much fitting, called “overfitting”, “the model adapts itself too closely to the data, and will not generalize well” (cf. Hastie et al. 2011, p. 38). This especially concerns predictions: We are not interested in predicting a value yi that we used already for estimating the model. We are more interested in predicting a value yN+i that we have not yet observed. And for this, a simpler model may be better than a more complex model, because every parameter in the model contains some error. On the other hand, if the model is omitting relevant variables and is not complex enough, called “underfitting”, the estimates of the model parameters will be biased, i.e. contain systematic errors (see Sect. 2.2.5.2). Again, large prediction errors will result. Remember: Modeling is a balancing act between simplicity and complexity, or between underfitting and overfitting. The inclusion of a variable in the regression model should always be based on logical or theoretical reasoning. It is bad scientific style to haphazardly include several or all available variables into the regression model in the hope of finding some independent variables with a statistically significant influence. This procedure is sometimes called “kitchen sink regression”. With today’s software and computing power, the calculation is very easy and such a procedure is tempting. As R-square cannot decrease by adding variables to a regression model, it cannot indicate the “badness” caused by overfitting. For these reasons, in addition to R-square, an adjusted coefficient of determination (adjusted R-square) should also be calculated. With the values in Table 2.7 we get: 2 Radj =1−
MSR 5,896 SSR/(N − J − 1) =1− =1− = 0.893 SST/(N − 1) MST 54,934 with
(2.40)
2 Radj < R2 .
The adjusted R-square uses the same information as the F-statistic. Both statistics consider the sample size and the number of parameters. To compare the adjusted R-square with R-square, we can write: 2 Radj =1−
N −1 (1 − R2 ) N −J −1
(2.41)
2.2 Procedure
85
The adjusted R-square becomes smaller when the number of regressors increases (other things being equal) and can also become negative. Thus, it penalizes increasing model complexity or overfitting.18 In our example we get the following values: 2 = 0.506 • Model 1: sales = f (advertising) Radj 2 = 0.485 • Model 2: sales = f (advertising, price) Radj 2 = 0.893 • Model 3: sales = f (advertising, price, promotion) Radj
By including price into the model, the adjusted R-square decreases. Price contributes only little to explaining the sales volume and its contribution cannot compensate for the penalty for increasing model complexity. With the inclusion of promotion, we get another picture. Promotion strongly boosts the explained variation. Here the increase of model complexity plays only a minor role. 2 The term adjusted R-square may be misunderstood, because Radj is not the square of any correlation. Another name, corrected R-square, is also misleading, because it suggests that R2 is false, which is not the case.
2.2.4 Checking the Regression Coefficients 1
Model formulation
2
Estimating the regression function
3
Checking the regression function
4
Checking the regression coefficients
5
Checking the underlying assumptions
2.2.4.1 Precision of the Regression Coefficient If the check of the regression function (by F-test or p-value) has shown that our model is statistically significant, the regression coefficients must now be checked individually. By this, we want to get information about their precision (due to random error) and the importance of the corresponding variables. As explained above, the estimated regression parameters bj are realizations of random variables. Thus, the standard deviation of bj, called the standard error of the coefficient, can be used as an inverse measure of precision.
18 Other
criteria for model assessment and selection are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). See, e.g., Agresti (2013, p. 212); Greene (2012, p. 179); Hastie et al. (2011, pp. 219–257).
2 Regression Analysis
86
For simple regression, the standard error of b can be calculated as:
SE(b) = with
SE √ s(x) · N − 1
(2.42)
SE standard error of the regression s(x) standard deviation of the independent variable For our Model 1, sales = f (advertising), we get:
SE(b) =
164.7 √ = 2.75 18.07 · 12 − 1
We estimated b = 9.63, so the relative standard error is 2.75/9.63 = 0.29 or 29%. It is instructive to examine the formula for SE(b) more closely. To achieve high precision for an estimated coefficient it is not sufficient to get a good model fit, here expressed by the standard error of the regression. Furthermore, precision increases, i.e., the standard error of b gets smaller, when the • standard deviation s(x) of the regressor increases, • sample size N increases. Variation of the x-values and sufficient sample size are essential factors for getting reliable results in regression analyses. To make a comparison, you cannot get a stable position by balancing on one leg. So, if the variance of the x-values and/or the sample size is small, the regression analysis will be a shaky affair. In an experiment, the researcher can control these two conditions. He can manipulate the independent variable(s) and determine the sample size. But mostly we have to cope with observational data. Experiments are not always possible, and a higher larger sample size takes more time and leads to higher costs. For multiple regressions, the formula for the standard error of an estimated coefficient extends to:
SE(bj ) =
SE √ √ s(xj ) · N − 1 · 1 − Rj2
(2.43)
where Rj2 denotes the R-square for a regression of the regressor j on all other independent variables. Rj2 is a measure of multicollinearity (see Sect. 2.2.5.7). It refers to the relationships among the x-variables. The precision of an estimated coefficient increases (other things being equal) with a smaller Rj2, i.e. with less correlation of xj with the other x-variables.
87
2.2 Procedure
For our Model 3 and variable j = 1 (advertising) we get the following standard error for b1:
SE(b1 ) =
18.07 ·
√
76.8 √ = 1.34 12 − 1 · 1 − 0.089
We estimated b1 = 7.91. So, the relative standard error for the coefficient of advertising has now decreased to 0.17 or 17%. This is due to a substantial reduction of the standard error of the regression in Model 3.
2.2.4.2 t-test of the Regression Coefficient To test whether a variable Xj has an influence on Y we have to check whether the regression coefficient βj differs sufficiently from zero. For this, we must test the null hypothesis H0 : βj = 0 versus the alternative hypothesis H1 : βj � = 0. Again, an F-test may be applied. But a t-test is easier and thus more common in this context. While the F-test can be used for testing a group of variables, the t-test is only suitable for testing a single variable.19 For a single variable (df = 1), it holds: F = t 2. Thus, both tests will give the same results. The t-statistic (empirical t-value) of an independent variable j is calculated very simply by dividing the regression coefficient by its standard error: temp =
bj SE(bj )
(2.44)
Under the null hypothesis, the t-statistic follows a Student’s t-distribution with N – J – 1 degrees of freedom. In our Model 3 we have 12 – 3 – 1 = 8 df. Figure 2.11 shows the density function of the t-distribution for 8 df with quantiles (critical values) −tα/2 and tα/2 for a two-tailed t-test with an error probability α = 5%. For 8 df we get tα/2 ± 2.306.20 For our Model 3, we get the empirical t-values shown in Table 2.8. All these t-values are clearly outside the region [−2.306, 2.306] and thus statistically significant at α = 5%. The p-values are clearly below 5%.21 So we can conclude that all three marketing variables influence sales. One-tailed t-test—Rejection Region in the Upper Tail An advantage of the t-test over the F-test is that it allows the application of a one-tailed test, as the t-distribution has two tails. While the two-tailed t-test is the standard in
19 For
a brief summary of the basics of statistical testing see Sect. 1.3. Excel we can calculate the critical value tα/2 for a two-tailed t-test by using the function T.INV.2 T(α;df). We get: T.INV.2 T(0.05;8) = 2.306. 21 The p-values can be calculated with Excel by using the function T.DIST.2 T(ABS(t emp);df). For the variable price we get: T.DIST.2 T(3.20;8) = 0.0126 or 1.3%. 20 With
2 Regression Analysis
88
f(t) 0.4
0.3
0.2
0.1
-3.0
-2.0
-1.0
0.0
0.0
1.0
2.0
3.0
Fig. 2.11 t-distribution and critical values for error probability α = 5% (two-tailed t-test)
Table 2.8 Regression coefficients and statistics of Model 3
j
Regressor
bj
Std. error
t-value
p-value
1
Advertising
7.91
1.342
5.89
0.0004
2
Price
–387.6
121.1
–3.20
0.0126
3
Promotion
2.42
0.408
5.93
0.0003
regression analysis, a one-tailed t-test offers greater power since smaller deviations from zero are now statistically significant and thus the danger of a type II error (accepting a wrong null hypothesis) is reduced. But a one-tailed test requires more reasoning and a priori knowledge on the researcher’s side. A one-tailed t-test is appropriate if the test outcome has different consequences depending on the direction of the deviation. Our manager will spend money on advertising only if advertising has a positive effect on sales. He will not spend any money if the effect is zero or if it is negative (while the size of the negative effect does not matter). Thus, he wants to prove the alternative hypothesis
H1 : βj > 0 versus the null hypothesis H0 : βj ≤ 0
(2.45)
H0 states the opposite of the research question. The decision criterion is: If temp > tα , then reject H0
(2.46)
2.2 Procedure
89
Now the critical value for a one-tailed t-test at α = 5% is only tα = 1.86.22 This value is much smaller than the critical value tα/2 = 2.306 for the two-tailed test. As the rejection region is only in the upper tail (right side), the test is also called an upper tail test. The rejection region on the upper tail has now double size (α instead α/2). Thus, a lower value of temp is significant. Using the p-value, the decision criterion is the same as before: We reject H0 if p 0
6
Exponential
exp (X)
Unlimited
7
Logit
0 2) and positive. For training purposes, the reader may calculate this bias in Model 1. We summarize: An omitted variable is relevant if • it has a significant influence on Y • and is significantly correlated with the independent variables in the model. An omitted variable does not cause bias if it is not correlated with the independent variable(s) in the model. Detection of Omitted Variables If relevant variables are omitted, E(εi ) and Corr(εi , xji ) will not be equal to zero. To check this, we have to look at the residuals ei = yi − yˆ i. The residuals can be analyzed by numerical or graphical methods. In the present case, a numerical analysis of the residuals is problematic. By construction of the OLS method, the mean value of all the residuals is always zero. Also, the correlations between the residuals and the x-variables will all be zero. So, these statistics are of no help. Thus, we need graphical methods to check the assumptions. Graphical methods are often more powerful and easier to understand. An important kind of plot is the TukeyAnscombe plot, which involves plotting the residuals against the fitted y-values (on the horizontal x-axis).28 For simple regression, it is equivalent to plotting the residuals against the x-variable since the fitted y-values are linear combinations of the x-values. According to the assumptions of the regression model, the residuals should scatter randomly and evenly around the x-axis, without any structure or systematic pattern. For an impression, Fig. 2.14 shows a residual plot with purely random scatter (for N = 75 observations). Deviations of the residual scatter from this ideal look would indicate that the model is not correctly specified. In our Model 1, sales = f (advertising), the variables price and promotion are omitted. Figure 2.15 shows the Tukey-Anscombe plot for this Model. The scatterplot deviates from the ideal shape in Fig. 2.14, and the difference would become even more pronounced if we had more observations.
28 Anscombe
and Tukey (1963) demonstrated the power of graphical techniques in data analysis.
2 Regression Analysis
98
Purely Random Residuals
300 200 100 0 2,400 -100
2,500
2,600
2,700
2,800
2,900
3,000
3,100
3,200
-200 -300 -400
Fitted y-values
Fig. 2.14 Scatterplot with purely random residuals (N = 75)
Residuals, Model 1: sales = f(advertising) 300 200 100
0 2,400 -100
2,500
2,600
2,700
2,800
2,900
3,000
3,100
-200 -300 -400
Estimated Sales
Fig. 2.15 Scatterplot of the residuals for Model 1
For Model 3, which includes the variables price and promotion, we get the scatterplot in Fig. 2.16. Now the suspicious scatter on the right-hand side of Fig. 2.15 has vanished. Lurking Variables and Confounding Choosing the variables in a regression model is the most challenging task on the side of the researcher. It requires knowledge of the problem and logical reasoning. From Eq. (2.60) we infer that a bias can
2.2 Procedure
99
Residuals, Model 3: sales = f(adv., price, promotion) 300 200
100 0 2,400 -100
2,500
2,600
2,700
2,800
2,900
3,000
3,100
3,200
-200 -300 -400
Estimated Sales
Fig. 2.16 Scatterplot of the residuals for Model 3
• obscure a true effect, • exaggerate a true effect, • give the illusion of a positive effect when the true effect is zero or even negative. Thus, great care has to be taken when concluding causality from a regression coefficient (cf. Freedman 2002). Causality will be evident if we have experimental data.29 But most data are observational data. To conclude causality from an association or a significant correlation can be very misleading. “Correlation is not causation” is a mantra that is repeated again and again in statistics. The same applies to a regression coefficient. If we want to predict the effects of changes in the independent variables on Y, we have to assume that a causal relationship exists. But regression is blind for causality. Mathematically we can regress an effect Y on its cause X (correct) but also a cause X on its effect Y. Data contain no information about causality so it is the task of the researcher to interpret a regression coefficient as a causal effect. A danger is posed by the existence of lurking variables, that influence both the dependent variable and the independent variable(s), but are not seen or known and thus are omitted in the regression equation. Such variables are also called confounders (see Fig. 2.17). They are confounding (confusing) the relationship between two variables X and Y.
29 In
an experiment the researcher actively changes the independent variable X and observes changes of the dependent variable Y. And, as far as possible, he tries to keep out any other influences on Y. For the design of experiments see e.g. Campbell and Stanley (1966); Green et al. (1988).
2 Regression Analysis
100
X
Y
X
Y
X
Y
Z
Z
Z
a) Confounder Z
b) Confounder Z
c) Mediator Z
Fig. 2.17 Causal diagrams of confounding and mediation
Example
A lot of surprise and confusion were created by a study on the relation between chocolate consumption and the number of Nobel Prize winners in various countries (R-square = 63%).30 It is claimed that the flavanols in dark chocolate (also contained in green tea and red wine) have a positive effect on the cognitive functions. But one should not expect to win a Nobel Prize if only one eats enough chocolate. The confounding variable is probably the wealth or the standard of living in the observed countries. ◄ Causal Diagrams Confounding can be illustrated by the causal diagrams (a) and (b) in Fig. 2.17. In diagram (a) there is no causal relationship between X and Y. The correlation between X and Y, caused by the lurking variable Z, is a non-causal or spurious correlation. If the confounding variable Z is omitted, the estimated regression coefficient is equal to the bias in Eq. (2.61). In diagram (b) the correlation between X and Y has a causal and a non-causal part. The regression coefficient of X will be biased by the non-causal part if the confounder Z is omitted. The bias in the regression is given by Eq. (2.61).31 Another frequent problem in causal analysis is mediation, illustrated in diagram (c). Diagrams (b) and (c) look similar and the dataset of (c) might be the same as in (b), but the causal interpretation is completely different. A classic example of mediation is the placebo effect in medicine: a drug can have a biophysical effect on the body of the patient (direct effect), but it can also act via the patient’s belief in its benefits (indirect
30 Switzerland
was the top performer in chocolate consumption and number of Noble Prizes. See Messerli, F. H. (2012). Chocolate consumption, Cognitive Function and Nobel Laureates. The New England Journal of Medicine, 367(16), 1562–1564. 31 For causal inference in regression see Freedman (2012); Pearl and Mackenzie (2018, p. 72). Problems like this one are covered by path analysis, originally developed by Sewall Wright (1889– 1988), and structural equation modeling (SEM), cf. e.g. Kline (2016); Hair et al. (2014).
2.2 Procedure
101
effect). We will give an example of mediation in the case study (Sect. 2.3). Thus, one must clearly distinguish between a confounder and a mediator.32 Inclusion of Irrelevant Variables In contrast to the omission of relevant variables (underfitting), a model may also contain too many independent variables (overfitting). This may be a consequence of incomplete theoretical knowledge and the resulting uncertainty. In this case, the researcher may include all available variables into the model so as not to overlook any relevant ones. As discussed in Sect. 2.2.3.4, such models are known as “kitchen sink models” and should be avoided. As in many things it applies also here: more is not necessarily better.
2.2.5.3 Random Errors in the Independent Variables A crucial assumption of the linear regression model is the assumption A3: The independent variables are measured without any error. As stated in the beginning, an analysis is worthless if the data are wrong. But with regard to measurements, we have to distinguish between systematic errors (validity) and random errors (reliability). We permitted random errors for Y. They are absorbed in the error term, which plays a central role in regression analysis. In practical applications of regression analysis, we also encounter random errors in the independent variables. Such errors in measurement may be substantial if the variables are collected by sampling and/or surveys, especially in the social sciences. Examples from marketing are constructs like image, attitude, trust, satisfaction, or brand knowledge that can all influence sales. Such variables can never be measured with perfect reliability. Thus, it is important to know something about the consequences of random errors in the independent variables. We will illustrate this by a small simulation. We choose a very simple model: Y ∗ = X∗
which forms a diagonal line. Now we assume that we can observe Y and X with the random errors εx and εy:
Y = Y ∗ + εy
and
X = X ∗ + εx
We assume the errors are normally distributed with means of zero and standard deviations σεx and σεy . Based on these observations of Y and X we estimate as usual:
Yˆ = a + b · X
32 “Mistaking
a mediator for a confounder is one of the deadliest sins in causal inference.” (Pearl and Mackenzie 2018, p. 276).
2 Regression Analysis
102
Regression 1
Y 500
Regression 1
Y 500
450
450
400
400
350
350
300
300
250
250
200
200 observed
150
observed
150
estimated
estimated diagonal
100
SD Line X mean
50
diagonal
100
SD Line X mean
50
Y mean
Y mean center
center
0
0
50
100
150
200
250
300
350
400
450
500 X
Regression 1
Y 500
0
0
50
100
150
450
400
400
350
350
300
300
250
250
200
200
150
250
300
350
400
150
observed
estimated
diagonal
diagonal
100
SD Line X mean
50
SD Line
X mean
50
Y mean
Y mean
center
0
0
50
100
150
200
250
300
350
400
450
500 X
observed
estimated
100
450
Regression 1
Y 500
450
200
center
500 X
0
0
50
100
150
200
250
300
350
400
450
500 X
Fig. 2.18 Scenarios with different error sizes (N = 300)
What is important now is that the two similar errors εx and εy have quite different effects on the regression line. We will demonstrate this with the following four scenarios that are illustrated in Fig. 2.18: 1) σεx = σεy = 0 No error. All observations are lying on the diagonal, the true model. By regression we get correctly a = 0 and b = 1. The regression line is identical to the diagonal. 2) σεx = 0, σεy = 50 We induce an error in Y. This is the normal case in regression analysis. Despite considerable random scatter of the observations, the estimated regression line (solid line) shows no visible change.
2.2 Procedure Table 2.12 Effects of error size on standard deviations, correlation, and estimation
103 s(ex)
s(ey)
s(x)
s(y)
r(x,y)
a
b
0
0
87
87
1.00
0.0
1.000
0
50
87
100
0.87
50
103
100
0.74
−0.1
0.999
50
67.2
0.724
100
50
137
100
0.57
144.5
0.415
The slope of the SD line (dashed line) has slightly increased because the standard deviation of Y has been increased by the random error in Y. 3) σεx = 50, σεy = 50. We now induce an error in X that is equal to the error in Y. The regression line moves clockwise. The estimated coefficient b < 0.75 is now biased downward (toward zero). The slope of the SD line has also slightly decreased because the standard deviation of X has been increased by the random error in X. The deviation between the SD line and the regression line has increased because the correlation between X and Y has decreased (random regression effect). 4) σεx = 100, σεy = 50 We now double the error in X. The effects are the same as in 3), but stronger. The coefficient b < 0.5 is now less than half of the true value. Table 2.12 shows the numerical changes in the four different scenarios. The effect of the measurement error in X can be expressed by (2.62)
b = β · reliability
where β is the true regression coefficient (here β = 1) and reliability expresses the amount of random error in the measurement of X. We can state:
reliability =
σ 2 (X ∗ ) + σ 2 (εx )
σ 2 (X ∗ )
≤1
(2.63)
Reliability is 1 if the variance of the random error in X is zero. The greater the random error, the lower the reliability of the measurement. Diminishing reliability affects the correlation coefficient as well as the regression coefficient. But the effect on the regression coefficient is stronger, as the random error in X also increases the standard deviation of X. The effect of biasing the regression coefficient toward zero is called regression to the mean (moving back to the average), whence regression got its name.33 It is important
33 The
expression goes back to Francis Galton (1886), who called it “regression towards mediocrity”. Galton wrongly interpreted it as a causal effect in human heredity. It is ironic that the first and most important method of multivariate data analysis got its name from something that means the opposite of what regression analysis actually intends to do. Cf. Kahneman (2011, p. 175); Pearl and Mackenzie (2018, p. 53).
2 Regression Analysis
104
e
e
0
0
Fig. 2.19 Heteroscedasticity
to note that this is a purely random effect. To mistake it for a causal effect is called the regression fallacy (regression trap).34 In practice, it is difficult to quantify this effect, because we usually do not know the error variances.35 But it is important to know about its existence so as to avoid the regression fallacy. If there are considerable measurement errors in X, the regression coefficient tends to be underestimated (attenuated). This causes non-significant p-values and type II errors in hypothesis testing.
2.2.5.4 Heteroscedasticity Assumption 3 of the regression model states that the error terms should have a constant variance. This is called homoscedasticity, and non-constant error variance is called heteroscedasticity. Scedasticity means statistical dispersion or variability and can be measured by variance or standard deviation. As the error term cannot be observed, we again have to look at the residuals. Figure 2.19 shows examples of increasing and decreasing dispersion of the residuals in a Tukey-Anscombe plot.
34 Cf.
Freedman et al. (2007, p. 169). In econometric analysis this effect is called least squares attenuation or attenuation bias. Cf., e.g., Kmenta (1997, p. 346); Greene (2012, p. 280); Wooldridge (2016, p. 306). 35 In psychology great efforts have been undertaken, beginning with Charles Spearman in 1904, to measure empirically the reliability of measurement methods and thus derive corrections for attenuation. Cf., e.g., Hair et al. (2014, p. 96); Charles (2005).
2.2 Procedure
105
Heteroscedasticity does not lead to biased estimators, but the precision of leastsquares estimation is impaired. Also, the standard errors of the regression coefficients, their p-values, and the estimation of the confidence intervals become inaccurate. To detect heteroscedasticity, a visual inspection of the residuals by plotting them against the predicted (estimated) values of Y is recommended. If heteroscedasticity is present, a triangular pattern is usually obtained, as shown in Fig. 2.19. Numerical testing methods are provided by the Goldfeld-Quandt test and the method of Glesjer.36 Goldfeld-Quandt test A well-known test to detect heteroscedasticity is the Goldfeld-Quandt test, in which the sample is split into two sub-samples, e.g. the first and second half of a time series, and the respective variances of the residuals are compared. If perfect homoscedasticity exists, the variances must be identical:
s12 = s22 , i.e. the ratio of the two variances of the subgroups will be 1. The further the ratio deviates from 1, the more uncertain the assumption of equal variance becomes. If the errors are normally distributed and the assumption of homoscedasticity is correct, the ratio of the variances follows an F-distribution and can, therefore, be tested against the null hypothesis of equal variance:
H0 : σ12 = σ22 . The F-test statistic is calculated as follows:
Femp =
with
s12
=
N1 Σ
i=1
s12 s22
e2i
N1 − J − 1
and
s22
=
N2 Σ
i=1
e2i
(2.64)
N2 − J − 1
N1 and N2 are the numbers of cases in the two subgroups and J is the number of independent variables in the regression. The groups are to be arranged in such a way that s12 ≥ s22 applies. The empirical F-value is to be tested at a given significance level against the theoretical F-value for (Nl – J – 1, N2− J −1) degrees of freedom. Method of Glesjer An easier way for detecting heteroscedasticity is the method of Glesjer, in which the absolute residuals are regressed on the regressors:
36 An overview of this test and other tests is given by Kmenta (1997, p. 292); Maddala and Lahiri (2009, p. 214).
2 Regression Analysis
106
e
e
0
0
Fig. 2.20 Positive and negative autocorrelation
|ei | = β0 +
J Σ
βj xji
(2.65)
j=1
In the case of homoscedasticity, the null hypothesis H0 : βj = 0 (j = 1, 2, . . . , J) applies. If significant non-zero coefficients result, the assumption of homoscedasticity must be rejected. Coping with Heteroscedasticity Heteroscedasticity can be an indication of nonlinearity or the omission of some relevant influence. Thus, the test for heteroscedasticity can also be understood as a test for nonlinearity and we should check for this. In the case of nonlinearity, transforming the dependent variable and/or the independent variables (e.g., to logs) will often help.
2.2.5.5 Autocorrelation Assumption 4 of the regression model states that the error terms are uncorrelated. If this condition is not met, we speak of autocorrelation. Autocorrelation occurs mainly in time series but can also occur in cross-sectional data (e.g., due to non-linearity). The deviations from the regression line are then no longer random but depend on the deviations of previous values. This dependency can be positive (successive residual values are close to each other) or negative (successive values fluctuate strongly and change sign). This is illustrated by the Tukey-Anscombe plot in Fig. 2.20. Like heteroscedasticity, autocorrelation usually does not lead to biased estimators, but the efficiency of least-squares estimation is diminished. The standard errors of the
2.2 Procedure
107
regression coefficients, their p-values, and the estimation of the confidence intervals become inaccurate. Detection of Autocorrelation To detect autocorrelation, again a visual inspection of the residuals is recommended by plotting them against the predicted (estimated) values of Y. A computational method for testing for autocorrelation is the Durbin-Watson test. The Durbin-Watson test checks the hypothesis H0 that the errors are not autocorrelated:
Cov(εi , εi+r ) = 0 with r ≠ 0. To test this hypothesis, a Durbin-Watson statistic DW is calculated from the residuals.
DW =
N Σ
i=2
(ei − ei−1 )2 N Σ
i=1
e2i
[ ] ≈ 2 1 − Cov(εi , εi−1 )
(2.66)
The formula considers only a first-order autoregression. Values of DW close to 0 or close to 4 indicate autocorrelation, whereas values close to 2 indicate that there is no autocorrelation. It applies: DW → 0
if positive autocorrelation:
Cov(ei, ei − 1) = 1.
DW → 4
if negative autocorrelation:
DW → 2
if no autocorrelation:
Cov(ei, ei− 1) = −1. Cov(ei, ei − 1) = 0.
For sample sizes around N = 50, the Durbin-Watson statistic should roughly be between 1.5 and 2.5 if there is no autocorrelation. More exact results can be achieved by using the critical values dL (lower limit) and dU (upper limit) from a Durbin-Watson table. The critical values for a given significance level (e.g., α = 5%) vary with the number of regressors J and the number of observations N. Figure 2.21 illustrates this situation. It shows the acceptance region for the null hypothesis (that there is no autocorrelation) and the rejection regions. And it also shows that there are two regions of inconclusiveness. Decision Rules for the (Two-sided) Durbin-Watson Test (Test of H0: d = 2): 1. Reject H0 if: DW 4 − dL (autocorrelation). 2. Do not reject H0 if: dU 0 p Odds Logit Odds ratio
b 1
“Logit” is a short form for logarithmic odds (also log-odds) and can also be defined as16 : logit(p) ≡ ln odds(p) (5.34)
By transforming the odds into logits, the range is extended to [−∞, +∞] (see Fig. 5.2, right panel). Following Eq. (5.31), the odds increase by eb if x increases by one unit. Thus, the logits (log-odds) increase by
ln(eb ) = b
(5.35)
So in our Model 3, the logits increase by b = 1.827 units if x increases by one unit (cf. column 2c in Table 5.12). This makes it easy to calculate with logits. The coefficient b represents the marginal effect of x on logits, just as b is the marginal effect of x on Y in linear regression. If we know the logits, we can calculate the corresponding probabilities, e.g. with Eq. (5.10):
p(x)=
1 1 + e−z(x)
Thus, logits are usually not computed from probabilities,17 as Eq. (5.33) might suggest. Instead, logits are used for computing probabilities, e.g. with Eq. (5.10). Table 5.13 summarizes the effects described above. Odds Ratio and Relative Risk The ratio of the two odds, as in Eq. (5.31), is called the odds ratio (OR). The odds ratio is an important measure in statistics. It is usually formed by the odds of two separate
16 The
name “logit” was introduced by Joseph Berkson 1944, who used it as an abbreviation for “logistic unit”, in analogy to the abbreviation “probit” for “probability unit”. Berkson contributed strongly to the development and popularization of logistic regression. 17 That is the reason why we used the “equal by definition” sign in Eqs. (5.33) and (5.34).
298
5 Logistic Regression
groups (populations), e.g. men and women (or test group and control group), thus indicating the difference between the two groups. If we calculate the odds ratio for two values of a metric variable (as we have done for Income above), its size depends on the unit of measurement of that variable and is therefore not very meaningful. In the example above, we get OR ≈ 6 for an increase of income x by one unit. The odds ratio is large because the unit of x is [1000 EUR]. The situation is different with binary variables, which can take only the values 0 and 1 and thus have no unit. In Model 4, we included the gender of the persons as a predictor and estimated the following function:
p=
1 1+
e−(a+b1 x1k +b2 x2k )
=
1 1+
e−(−5.635+2.351x1k +1.751x2k )
The binary variable Gender indicates two groups, men and women. With an average income of 2 [1000 EUR], the following probabilities may be calculated for men and women:
1 = 0.694 1 + e−(−5.635+2.351·2+1.751·1) 1 Women: pw = = 0.283 −(−5.635+2.351·2+1.751·0) 1+e Men: pm =
This results in the corresponding odds ratios:
pm /(1 − pm ) 2.267 oddsm = = 5.8 = 0.393 oddsw pw /(1 − pw ) pw /(1 − pw ) 0.393 oddsw = = 0.17 = ORw = 2.267 oddsm pm /(1 − pm )
ORm =
A man’s odds for buying are roughly six times higher than a woman’s. A woman’s odds are less than 20% of a man’s odds.18 This seems to be a very large difference between men and women. Another, similar measure for the difference of two groups is the relative risk (RR), which is the ratio of two probabilities.19 Analogously to the odds ratios we obtain here:
0.694 pm = = 2.5 0.283 pw pw 0.283 RRw = = = 0.41 0.694 pm
RRm =
18 Alternatively,
we may calculate the odds ratios with Eq. (5.32): ORm = eb2 = e1.751 = 5.76 and ORw = e−b2 = e−1.751 = 0.174. 19 In common language, the term risk is associated with negative events, such as accidents, illness or death. Here the term risk refers to the probability of any uncertain event.
5.2 Procedure
299
According to this measure, a man is 2.5 times more likely to buy the chocolate than a woman at the given income. The values of RR are significantly smaller (or, more generally, closer to 1) than the values of the odds ratio OR and often come closer to what we would intuitively assume. However, OR can also be used in situations where calculating RR is not possible.20 The odds ratio, therefore, has a broader range of applications than the relative risk.
5.2.4 Checking the Overall Model 1
Model formulation
2
Estimation of the logistic regression function
3
Interpretation of the regression coefficients
4
Checking the overall model
5
Checking the estimated coefficients
Once we have estimated a logistic model, we need to assess its quality or goodness-of-fit since no one wants to rely on a bad model. We need to know how well our model fits the empirical data and whether it is suitable as a model of reality. For this purpose, we need measures for evaluating the goodness-of-fit. Such measures are: • likelihood ratio statistic, • pseudo-R-square statistics, • hit rate of the classification and the ROC curve. In linear regression, the coefficient of determination R2 indicates the proportion of explained variation of the dependent variable. Thus, it is a measure for the goodness-offit that is easy to calculate and to interpret. Unfortunately, such a measure does not exist for logistic regression, since the dependent variable is not metric. In logistic regression there are several measures for the goodness-of-fit, which can be confusing. Since the maximum likelihood method (ML method) was used to estimate the parameters of the logistic regression model, it seems natural to use the value of the maximized likelihood or the log-likelihood LL (see Fig. 5.12) as a basis for assessing the goodnessof-fit. And indeed, this is the basis for various measures of quality.
20 This
can be the case in so-called case-control studies where groups are not formed by random sampling. Thus the size of the groups cannot be used for the estimation of probabilities. Such studies are often carried out for the analysis of rare events, e.g. in epidemiology, medicine or biology (cf. Agresti, 2013, pp. 42–43; Hosmer et al., 2013, pp. 229–230).
300
5 Logistic Regression
A simple measure that is used quite often is the value –2LL (= −2 · LL). Since LL is always negative, −2LL is positive. A small value for −2LL thus indicates a good fit of the model for the available data. The “2” is due to the fact that a chi-square distributed test statistic is aimed at (see Sect. 5.2.4.1). For Model 4 with the systematic component z = a + b1 x1 + b2 x2 we get:
−2LL = 2 · 16.053 = 32.11
(5.36)
The absolute size of this value says little since LL is a sum according to Eq. (5.25). The value of LL and thus −2LL therefore depends on the sample size N. Both values would, therefore, double if the number of observations were doubled, without changing the estimated values. The size of −2LL is comparable to the sum of squared residuals (SSR) in linear regression, which is minimized by the OLS method (ordinary least squares). For a perfect fit, both values have to be zero. The ML estimation can be performed by either maximizing LL or minimizing −2LL. The −2LL statistic can be used to compare a model with other models (for the same data set). For Model 3, i.e., the simple logistic regression with only one predictor (Income), the systematic component is reduced to z = a + b x and we get:
−2LL = 2 · 18.027 = 36.05
So if variable 2 (Gender) is omitted, the value of −2 LL increases from 32.11 to 36.05, and thus the model fit is reduced. An even simpler model results with the systematic component
z = a = 0.134 It yields:−2LL = 2 · 20.728 = 41.46. This primitive model is called the null model (constant-only model, 0-model) and it has no meaning by itself. But it serves to construct the most important statistic for testing the fit of a logistic model, the likelihood ratio statistic.
5.2.4.1 Likelihood Ratio Statistic To evaluate the overall quality of the model under investigation (the fitted or full model), we can compare its likelihood with the likelihood of the corresponding 0-model. This leads to the likelihood ratio statistic (the logarithm of the likelihood ratio): Likelihood of the 0-model LLR = −2 · ln (5.37) Likelihood of the fitted model = −2 · ln
L0 Lf
= −2 · (LL0 − LLf )
with LL0 maximized log-likelihood for the 0-model (constant-only model) LLf maximized log-likelihood for the fitted model
5.2 Procedure
301
The greater the distance, the better the model
0
LLf
Maximum attainable LL-value
Maximum LL-value considering all predictors
LL0
Maximum LL-value of the 0-model for the given data set
Fig. 5.14 Log-likelihood values in the LR test
The logarithm of the ratio of likelihoods is thus equal to the difference in the log-likelihoods. With the above values from our example for multiple logistic regression (Model 4) we get:
LLR = −2 · (LL0 − LLf ) = −2 · (−20.728 + 16.053) = 9.35 Under the null hypothesis H0: β1 = β2 = … = βJ = 0, the LR statistic is approximately chi-square distributed with J degrees of freedom (df).21 Thus, we can use LLR to test the statistical significance of a fitted model. This is called the likelihood ratio test (LR test), which is comparable to the F-test in linear regression analysis.22 The tabulated chi-square value for α = 0.05 and 2 degrees of freedom is 5.99. Since LLR = 9.35 > 5.99, the null hypothesis can be rejected and the model is considered to be statistically significant. The p-value (empirical significance level) is only 0.009 and the model can be regarded as highly significant.23 Figure 5.14 illustrates the log-likelihood values used in the LR test. Comparison of Different Models Modeling should always be concerned with parsimony. The LR test can also be used to check whether a more complex model provides a significant improvement versus a simpler model. In our example, we can examine whether the inclusion of further predictors (e.g. age or weight) is justified because they would yield a better fit of the model. Conversely, we can examine whether Model 4 has led to significant improvements compared to Model 3 by including the variable Gender. To check this, we use the following likelihood ratio statistic:
21 Thus,
in SPSS the LLR statistic is denoted as chi-square. For the likelihood ratio test statistic see, e.g., Agresti (2013, p. 11); Fox (2015, pp. 346–348). 22 For a brief summary of the basics of statistical testing see Sect. 1.3. 23 We can calculate the p-value with Excel by using the function CHISQ.DIST.RT(x;df). Here, we get CHISQ.DIST.RT(9.35;2) = 0.009.
302
5 Logistic Regression
LLR = −2 · ln
Lr Lf
= −2 · (LLr − LLf )
(5.38)
with LLr maximized log-likelihood for the reduced model (Model 3) LLf maximized log-likelihood for the full model (Model 4) With the above values we get:
LLR = −2 · (LLr − LLf ) = −2(−18.027 + 16.053) = 3.949 The LLR statistic is again approximately chi-square distributed, with the degrees of freedom resulting from the difference in the number of parameters between the two models. In this case, with df = 1, we get a p-value of 0.047. Thus, the improvement of Model 4 compared to Model 3 is statistically significant for α = 0.05. A prerequisite for applying the chi-square distribution is that the models are nested, i.e., the variables of one model must be a subset of the variables of the other model.
5.2.4.2 Pseudo-R-Square Statistics There have been many efforts to create a similar measure for goodness-of-fit in logistic regression as the coefficient of determination R2 in linear regression. These efforts resulted in the so-called pseudo-R-square statistics. They resemble R2 insofar as • they can only assume values between 0 and 1, • a higher value means a better fit. However, the pseudo-R2 statistics do not measure a proportion. They are based on the ratio of two probabilities, like the likelihood ratio statistic. There are three different versions of pseudo-R-square statistics. (a) McFadden’s R2 −16.053 LLf =1− = 0.226 McF − R2 = 1 − (5.39) LL0 −20.728 In contrast to the LLR statistic, which uses the logarithm of the ratio of likelihoods, McFadden uses the ratio of the log-likelihoods. In case of a small difference between the two log-likelihoods (of the fitted model and the null model), the ratio will be close to 1, and McF − R2 thus close to 0. This means the estimated model is not much better than the 0-model. Or, in other words, the estimated model is of no value. If there is a big difference between the two log-likelihoods, it is exactly the other way round. But with McFadden’s R2 it is almost impossible to reach values close to 1 with empirical data. For a value of 1 (perfect fit), the likelihood would have to be 1, and thus the log-likelihood, 0. The values are therefore in practice much lower than for R2. As a rule of thumb, values from 0.2 to 0.4 can be considered to indicate a good model fit (Louviere et al., 2000, p. 54).
5.2 Procedure
303
(b) Cox & Snell R2 2 RCS =
L0 1− Lf
2
N
=1−
exp(−20.728) exp(−16.053)
2
30
= 0.268
(5.40)
The Cox & Snell R2 can take only values 0. Thus, it will deliver values |0.09|). But the anti-image of the variable ‘melting’ is rather large (0.459), indicating that this variable cannot be explained by the other variables and is thus weakly correlated with the other variables.
Table 7.6 Anti-image covariance matrix in the application example Milky
Melting
Artificial
Fruity
Refreshing
Milky
0.069 −0.019
−0.019
−0.065
−0.010
0.010
Melting
−0.065
−0.027
0.071
0.009
−0.008
0.010
0.025
Artificial Fruity Refreshing
−0.010
0.459
−0.026
−0.026
0.025
0.009
0.026
−0.008
−0.026
−0.026
−0.027
0.027
7.2 Procedure
393
Table 7.7 Criteria to assess the suitability of the data Criterion
When is the criterion met?
Is the criterion met for the application example?
Bartlett test
The null hypothesis can be rejected, i.e.: R ≠ I
Fulfilled at 5% significance level
Kaiser–Meyer–Olkin criterion
KMO should be larger than 0.5, a value KMO of 0.576 is only slightly above 0.8 is recommended above the critical value
Measure of sampling adequacy (MSA)
MSA should be larger than 0.5 for each The MSA for the variables variable ‘fruity’ and ‘refreshing’ is below the threshold of 0.5
Anti-image covariance The off-diagonal elements (i.e., matrix negative partial covariances) of the anti-image covariance matrix should be close to zero. Data are not suited for factor analysis if 25% or more of the off-diagonal elements are different from zero (>|0.09|)
Data meet the requirements
Conclusion To assess the suitability of the data for factor analysis, different criteria can be used but none of them is superior (Table 7.7). This is due to the fact that all criteria use the same information to assess the suitability of the data. Therefore, we have to carefully evaluate the different criteria to get a good understanding of the data. For our example, it can be concluded that the initial data are only ‘moderately’ suitable for a factor analysis. We will now continue with the extraction of factors to illustrate the basic idea of factor analysis.
7.2.2 Extracting the Factors and Determining their Number 1
Evaluating the suitability of data
2
Extracting the factors and determining their number
3
Interpreting the factors
4
Determining the factor scores
While the previous considerations referred to the suitability of initial data for a factor analysis, in the following we will explore the question of how factors can actually be extracted from a data set with highly correlated variables. To illustrate the correlations, we first show how correlations between variables can also be visualized graphically by vectors. The graphical interpretation of correlations helps to illustrate the fundamental theorem of factor analysis and thus the basic principle of factor extraction. Building
7 Factor Analysis
394
on these considerations, various mathematical methods for factor extraction are then presented, with an emphasis on principal component analysis and the factor-analytical procedure of principal axis analysis. These considerations then lead to the question of starting points for determining the number of factors to be extracted in a concrete application.
7.2.2.1 Graphical Illustration of Correlations In general, correlations can also be displayed in a vector diagram where the correlations are represented by the angles between the vectors. Two vectors are called linearly independent if they are orthogonal to each other (angle = 90°). If two vectors are correlated, the correlation is not equal to 0, and thus, the angle is not equal to 90°. For example, a correlation of 0.5 can be represented graphically by an angle of 60° between two vectors. This can be explained as follows: In Fig. 7.5, the vectors AB and AC represent two variables. The length of the vectors is equal to 1 because we use standardized data. Now imagine the correlation between the
C
A
D
standardized length = 1
Fig. 7.5 Graphical representation of the correlation coefficients
B
7.2 Procedure
395
two variables equals 0.5. With an angle of 60°, the length of AD is equal to 0.5 which is the cosine of a 60° angle. The cosine is the quotient of the adjacent leg and the hypotenuse (i.e., AD/AC ). Since AC is equal to 1, the correlation coefficient is equal to the distance AD. Example 2: correlation matrix with three variables
The above relationship is illustrated by a second example with three variables and the following correlation matrix R. ◦ 1 0 which is equal to R = 30◦ 0◦ ◄ R = 0.8660 1 ◦ ◦ ◦ 0.1736 0.6428 1 80 50 0 In example 2, we have chosen the correlations in such a way that a graphical illustration in a two-dimensional space is possible. Figure 7.6 graphically illustrates the relationships between the three variables in the present example. Generally we can state: the smaller the angle, the higher the correlation between two variables. The more variables we consider, the more dimensions we need to position the vectors with their corresponding angles to each other.
vector x2 vector x1
80° vector x3
50°
30°
Fig. 7.6 Graphical representation of the correlation matrix with three variables
7 Factor Analysis
396
A vector x1
0
30° 30°
C resultant
vector x2 B
Fig. 7.7 Factor extraction for two variables with a correlation of 0.5
Graphical factor extraction Factor analysis strives to reproduce the associations between the variables measured by the correlations with the smallest possible number of factors. The number of axes (=dimension) required to reproduce the associations among the variables then indicates the number of factors. The question now is: how are the axes (i.e., factors) determined in their positions with respect to the relevant vectors (i.e., variables)? The best way to do this is to imagine a half-open umbrella. The struts of the umbrella frame, all pointing in a certain direction and representing the variables, can also be represented approximately by the umbrella stick. Figure 7.7 illustrates the case of two variables. The correlation between the two variables is 0.5, which corresponds to an angle of 60° between the two vectors OA and OB. The vector OC is a good representation of the two vectors OA and OB and thus represents the factor (i.e., the resultant or the total when two or more vectors are added). The angles of 30° between OA and OC as well as OB and OC indicate the correlations between the variables and the factor. This correlation is called factor loading and here equals cos 30° = 0.866.
7.2 Procedure
397
7.2.2.2 Fundamental Theorem of Factor Analysis For a general explanation of the fundamental theorem of factor analysis we start with the assumption that the initial data have been standardized.6 Factor analysis now assumes that each observed value (zij) of a standardized variable j in person i can be represented as a linear combination of several (unobserved) factors. We can express this idea with the following equation: zij = aj1 · pi1 + aj2 · pi2 + . . . + ajQ · piQ =
Q q=1
ajq · piq
(7.5)
with zij standardized value of observation i for variable j ajq weight of factor q for variable j (i.e., factor loading of variable j on factor q) piq value of factor q for observation i The factor loadings ajq indicate how strongly a factor is related to an initial variable. Statistically, factor loadings therefore correspond to the correlation between an observed variable and the extracted factor, which was not observed. As such, factor loadings are a measure of the relationship between a variable and a factor. We can express Eq. (7.5) in matrix notation:
Z = P · A′
(7.6)
The matrix of the standardized data Z has the dimension (N × J), where N is the number of observations (cases) and J equals the number of variables. We observe the standardized data matrix Z, while the matrices P and A are unknown and need to be determined. Here, P reflects the matrix of the factor scores and A is the factor loading matrix. In Eq. (7.1) we showed that the correlation matrix R can be derived from the standardized variables. When we substitute Z by Eq. (7.6), we get:
1 1 · Z′ · Z = · (P · A′ )′ · (P · A′ ) N −1 N −1 1 1 = · A · P ′ · P · A′ = A · · P ′ · P · A′ N −1 N −1
R=
(7.7)
1 · P′ · P in Eq. (7.7) is again a correlation matrix. More Since we use standardized data, N−1 specifically, it is the correlation matrix of the factors, and we label it C. Thus, we can write:
R = A · C · A′ 6 If
(7.8)
a variable xj is transformed into a standardized variable zj, the mean value of zj = 0 and the variance of zj = 1. This results in a considerable simplification in the representation of the following relationships. See the explanations on standardization in Sect. 1.2.1.
7 Factor Analysis
398 Table 7.8 Correlation matrix including corresponding angles in example 3
x1
x2
x3
x4
x5
x1
1
10°
70°
90°
100°
x2
0.985
1
60°
80°
90°
x3
0.342
0.500
1
20°
30°
x4
0.000
0.174
0.940
1
10°
x5
−0.174
0.000
0.866
0.985
1
The relationship expressed in Eq. (7.7) is called the fundamental theorem of factor analysis, which states that the correlation matrix of the initial data can be reproduced by the factor loading matrix A and the correlation matrix of the factors C. Generally, factor analysis assumes that the extracted factors are uncorrelated. Thus, C corresponds to an identity matrix. The multiplication of a matrix with an identity matrix results in the initial matrix, and therefore Eq. (7.7) may be simplified to:
R = A · A′
(7.9)
Assuming independent (uncorrelated) factors, the empirical correlation matrix can be reproduced by the factor loadings matrix A.
7.2.2.3 Graphical Factor Extraction In the following, a further example is used to show graphically how factors can be extracted when three or more variables have been collected. Again, we illustrate the correlations graphically and choose the vector representation for the correlations between variables and factors. Example 3: correlation matrix of five variables
We now observe five variables, and the correlations are chosen in such a way that we can depict the interrelations in two-dimensional space—which will hardly be the case in reality.7 Table 7.8 shows the correlation matrix for the example, with the upper triangular matrix containing the angle specifications belonging to the correlations. ◄ Figure 7.8 visualizes the correlations of example 3 in a two-dimensional space. To extract the first factor, we search for the center of gravity (centroid) of the five vectors which is actually the resultant of the five vectors. If the five vectors represented five ropes tied to a weight in 0 and five persons were pulling with equal strength on each one of the ends of the ropes, the weight would move in a certain direction. This direction is indicated by the dashed line in Fig. 7.9, which is the graphical representation of the first factor (factor 1).
7 Please
note that example 3 does not correspond to the application example in Sect. 7.2.1.
7.2 Procedure
399
x3 x4 x2
x5
x1 10°
60°
10° 20°
0
Fig. 7.8 Graphical representation of the correlation in example 3
F1 x3 x4
x2
x5
x1 10°
60°
10°
45°12‘
20° 45°12‘
0
Fig. 7.9 Graphical representation of the center of gravity
F2
7 Factor Analysis
400
We can derive the factor loadings with the help of the angles between the variables and the vector of the first factor. For example, the angle between the first factor and x1 equals 55°12′ (= 45°12′ + 10°), which corresponds to a factor loading of 0.571. Table 7.9 shows the factor loadings for all five variables. Since factor analysis searches for factors that are independent (uncorrelated), a second factor should be orthogonal to the first factor (Fig. 7.9). Table 7.10 shows the factor loadings for the corresponding second factor (factor 2). The negative factor loadings of x1 and x2 indicate that the respective factor 2 is negatively correlated with the corresponding variables. If the extracted factors fully explained the variance of the observed variables, the sum of the squared factor loadings for each variable would be equal to 1 (so-called unit variance). This relationship can be explained as follows: 1. By standardizing the initial variables, we end up with a mean of 0 and a standard deviation of 1. Since the variance is the squared standard deviation, the variance also equals 1:
sj2 = 1 2. The variance of each standardized variable j is the main diagonal element of the correlation matrix (variance-covariance-matrix) and it is the correlation of a variable with itself:
sj2 = 1 = rjj . 3. If the factors completely reproduce the variance of the initial standardized variables, the sum of the squared factor loadings will be 1.
Table 7.9 Factor loadings for the one-factor solution in example 3
Table 7.10 Factor loadings for the two-factor solution in example 3
Variable
Factor 1
x1
0.571
x2
0.705
x3
0.967
x4
0.821
x5
0.710
Factor 1
Factor 2
x1
0.571
x2
0.705
−0.821
x3
0.967
x4
0.821
0.571
x5
0.710
0.705
−0.710 0.255
7.2 Procedure
401
D resultant 2 (factor 2)
A vector x1
60°
120°
30° 0
C resultant 1 (factor 1)
30°
B vector x2 Fig. 7.10 Graphical presentation of the case in which all variances in the variables are explained
To illustrate this, let us take an example where two variables are reproduced by two factors (Fig. 7.10). The factor loadings are the cosine of the angles between the vectors reflecting the variables and factors. For x1, the factor loadings are 0.866 (=cos 30°) for factor 1 and 0.5 (=cos 60°) for factor 2. The sum of the squared factor loadings is 1 (=0.8662 + 0.52). According to Fig. 7.10, we can express the factor loadings of x1 on the factors 1 and 2 as follows:
OC OA for factor 1 and
OD OA for factor 2. If the two factors completely reproduce the standardized variance of the initial variables, the following relation has to be true:
OC OA
2
+
OD OA
2
= 1,
7 Factor Analysis
402
—which is actually the case. Thus, the variance of a standardized output variable can also be calculated using factor loadings as follows: 2 2 2 sj2 = rjj = aj1 + aj2 + · · · + ajQ =
Q
2 ajq
(7.10)
q=1
with ajq
factor loading of variable j on factor q
The factor loadings represent the model parameters of the factor-analytical model which can be used to calculate the so-called model-theoretical (reproduced) correlation matrix ˆ ).The parameters (factor loadings a ) must now be determined in such a way that the (R jq difference between the empirical correlation matrix (R) and the model-theoretical corˆ ), which is calculated with the derived factor loadings, is as small as relation matrix (R possible (cf. Loehlin, 2004, p. 160). The objective function is therefore:
ˆ → Min.! F =R−R
7.2.2.4 Mathematical Methods of Factor Extraction In Sect. 7.1 it was pointed out that factor analysis can be used to pursue two different objectives: 1. to reduce a large number of correlated variables to a smaller set of factors (dimensions). 2. to reveal the causes (factors) responsible for the correlations between variables. For objective 1 we use principal component analysis (PCA). We look for a small number of factors (principal components) which preserve a maximum of the variance (information) contained in the variables. Of course, this requires a trade-off between the smallest possible number of factors and a minimal loss of information. If we extract all possible components, the fundamental theorem shown in Eq. (7.8) applies:
R = A · A′ with R correlation matrix A factor-loading matrix A′ transposed factor-loading matrix For objective 2, we use factor analysis (FA) defined in a narrower sense. The factors are interpreted as the causes of the observed variables and their correlations. In this case,
7.2 Procedure
403
it is assumed that the factors do not explain all the variance in the variables. Thus, the correlation matrix cannot be completely reproduced by the factor loadings and the fundamental theorem is transformed to:
R = A · A′ + U
(7.11)
where U is a diagonal matrix that contains unique variances of the variables that cannot be explained by the factors.8 While principal component analysis (objective 1) pursues a more pragmatic purpose (data reduction), factor analysis (objective 2) is used in a more theoretical context (finding and investigating hypotheses). So, many researchers strictly separate between principal component analysis and factor analysis and treat PCA as a procedure independent of FA, and indeed, PCA and FA are based on fundamentally different theoretical models. But both approaches follow the same steps (cf. Fig. 7.3) and use the same mathematical methods. Besides, they usually also provide very similar results. For these reasons, PCA is listed in many statistical programs as the default extraction procedure in the context of factor analysis (as it is in SPSS). 7.2.2.4.1 Principal component analysis (PCA) The basic principle of principal component analysis (PCA) is illustrated by an example with 300 observations of two variables, which were first standardized. The variance of a variable is a measure of the information that is contained in the variable. If a variable has a variance of zero, it does not contain any information. Otherwise, after standardization, each variable has a variance of 1. Figure 7.11 shows the scatterplot of the two standardized variables Z1 and Z2. Each point represents an observation. Furthermore, the straight line (solid line) represents the first principal component (PC_1) of these two variables. It minimizes the distances between the observations and the straight line. This line accounts for the maximum variance (information) contained in the two variables. The variance of the projections of the observed points on the solid line is equal to s2 = 1.596. Since the variance of each standardized variable is 1, the total variance of the data is 2, so the line explains 80% (1.596/2 = 0.80) of the total variance. Figure 7.11 also contains the second principal component (PC_2) (dashed line), which was determined perpendicular (orthogonal) to the first principal component. The second principal component can explain the remaining 20% of the total information (variance). Thus, PC_2 represents a significantly lower share of the total information in the data set compared to PC_1. In favor of a more parsimonious presentation of the data, PC_2 could also be omitted. The example clearly shows that PCA tries to reproduce a large part of the variance in a data set with only one or a few components.
8 Standardized
variables with a unit variance of 1 are assumed. For the decomposition of the variance of an output variable, see also the explanations in Sect. 7.2.2.4.2 and especially Fig. 7.13.
7 Factor Analysis
404
Z2
3.0
PC_1 2.0
1.0
-3.0
-2.0
-1.0
0.0
0.0
1.0
2.0
3.0
Z1
-1.0
PC_2 -2.0
-3.0
Fig. 7.11 Scatterplot of 300 observations and two principal components
This simple graphical illustration is also reflected in the fundamental theorem of factor analysis, which, according to Eq. (7.8), can reproduce the empirical correlation matrix (R) completely via the factor loading matrix (A) if as many principal components as variables are extracted. Example Let us return to our example in Sect. 7.2.1 with the correlation matrix shown in Table 7.3.9 The upper part of Table 7.11 shows the factor loading matrix resulting from these data if all five principal components (as many as there are variables) are extracted with PCA. With these loadings, the correlation matrix can be reproduced according to Eq. (7.8).
9 Remember
that we use standardized variables and therefore the variance of each variable is 1 and the total variance in the data set is 5.
7.2 Procedure
405
The lower part of Table 7.11 presents the squares of the component loadings (factor loadings) (a2jq). If these are summed up over the rows (over the components), we get the variance of a variable that is covered by the extracted components according to Eq. (7.9). As the variance of a standardized variable is 1, and as all possible 5 components were extracted (Q = J), the sum of the squared loadings for each variable equals 1. This sum is called the communality of a variable j: Communality of variable j:
hj2 =
Q
2 ajq
(7.12)
q=1
The communality is a measure of the variance (information) of a variable j that can be explained by the components. Of course, nothing would be gained if we extracted as many components as there are variables, as our objective is a reduction of the data or dimensions. But the communality tells us how much variance (information) of any variable is explained by the smaller set of components (Q